[x264-devel] Re: Very small optimizations
Loren Merritt
lorenm at u.washington.edu
Thu Dec 1 22:52:41 CET 2005
On Thu, 1 Dec 2005, David Pio wrote:
> encoder/me.c
> function 'x264_me_search_ref'
>
> various search methods use 'i_me_range/2' or 'i_me_range/4' in the
> conditional part of the looping structure.
> i_me_range seems not to change within the context of the function, so should
> those divides be taken out of the looping structure? say create 2 new
> variables i_me_range_div2 and i_me_range_div4?? It would save some divide
> CPU cycles.
>
> Or does the compiler optimize this out?
Division symbol != division instruction. The compiler knows to use shifts.
> common/mc.c
>
> line 247 and 278:
> int filter1 = (hpel1x & 1) + ( (hpel1y & 1) << 1 );
> could be
> int filter1 = (hpel1x & 1) ^ ( (hpel1y & 1) << 1 );
>
> replacing an addition with a bitwise OR, should save some CPU cycles?
Is there any modern CPU where ADD and OR are not equally fast?
> lines 314 to 317:
> const int cA = (8-d8x)*(8-d8y);
> const int cB = d8x *(8-d8y);
> const int cC = (8-d8x)*d8y;
> const int cD = d8x *d8y;
>
> could be rewritten as:
> int d8x_times8 = d8x * 8;
> int d8y_times8 = d8x * 8;
> const int cD = d8x * d8y;
> const int cC = d8y_times8 - cD;
> const int cB = d8x_times8 - cD;
> const int cA = 64 - d8x_times8 - d8y_times8 + cD;
>
> 4 subtractions and 4 multiplications are replaced by 1 multplication, 4
> subtractions, 1 addition, and 2 bit shifts ( the *8 should be optimized by
> the compiler to a 3 bit left shift, right?)
> In the SSE version it is accomplished with 2 subtractions and 4
> multiplications
> I couldn't find how many cycles a multiplcation takes, but additions,
> subtractions, and bit shifts take like 1 cycle each, right?
This does remove 3 imul instructions, but doesn't save any cycles on my
athlon64. Maybe with p4's slower imul?
--Loren Merritt
--
This is the x264-devel mailing-list
To unsubscribe, go to: http://developers.videolan.org/lists.html
More information about the x264-devel
mailing list