[x264-devel] SSE4 Optimizations
Loren Merritt
lorenm at u.washington.edu
Sun Jan 13 00:34:37 CET 2008
On Thu, 20 Dec 2007, Mike Kazmier wrote:
> Has anyone looked at or considered optimizations for SSE4 as described
> here: http://softwarecommunity.intel.com/articles/eng/1246.htm
>
> It looks quite promising. I don't personally have the skills to do
> this, but can supply the hardware if anyone wants to take a stab at it.
There are 3 possible approaches:
use MPSADBW in full search, as proposed in the above webpage,
use MPSADBW in a new iterative search,
use PHMINPOSUW in the existing iterative searches.
Full search can be further divided into:
PSADBW ESA
MPSADBW ESA
PSADBW SEA (current x264)
MPSADBW SEA
("ESA" = exhaustive, "SEA" = successive elimination. They have identical
results, but SEA is algorithmically faster.)
MPSADBW ESA is definitely faster than PSADBW ESA, but both are slower than
PSADBW SEA. I previously thought MPSADBW SEA was impossible, but after
some experimentation I revise that. MPSADBW evaluates 8 consecutive mvs,
and successive elimination in general leaves sparse mvs, but in practice
they're sufficiently clumped that MPSADBW should work with only a moderate
amount of wasted computation. Still, the SAD part of the search accounts
for less cpu-time than the successive elimination part, so don't expect
anything major here.
New iterative search: I'm not exactly sure what form it should take.
Perhaps a 8x5 rectangle (with only 3 rows, the other 2 being searched
after the iterative part halts.) This should have convergence properties
similar to Hex search, but whether it ends up faster will be determined by
exactly how fast MPSADBW is compared to independent SADs.
PHMINPOSUW gets ugly, because it doesn't replace a DSP function, it
replaces code that's optimally written in inline C when not using SSE4.
--Loren Merritt
More information about the x264-devel
mailing list