[x265] [PATCH] Fixed 32 bit bug in intrapred dc4 sse2

Steve Borho steve at borho.org
Thu Feb 26 20:18:11 CET 2015


On 02/26, dave wrote:
> FYI,
> 
> I forgot to comment in the commit message:
> 
> I kept the original code for 64 bit because while using r2b works in
> 64 bits, using it severely hurt  performance to the point that it
> was well below c code.
> 
> This is probably due to how things like instruction order, length
> and layout in memory can affect performance(see agner docs).
> 
> This probably leaves some room for performance improvements since
> most X265 assembler is based primarily on algorithms implementing
> x265 functionality but not all aspects of processor function. While
> writing optimized assembler for every processor is unrealistic, the
> assembler of each simd level could be optimized for the latest
> processor that supports it.(i.e. the sse4 assembler could be
> optimized to support the latest processor that support only up to
> sse4).
> 
> One downside is these types of optimizations are more likely to
> generate code that looks like the jumbled code generated by a
> compiler and thus be less easy to read, understand and maintain. Of
> course more comments can help here.
> 
> Is x265 interested in such optimizations?

Off-hand, I would suggest we only optimize routines for older CPUs if
there is another routine using a higher SIMD architecture that is
optimized for newer CPUs (for instance, only if the SSE4 primitive is
also covered by an AVX primitive).  IE: only optimize (for older CPUs)
primitves that are never used by newer CPUs.

Over time, the older CPUs become less and less relevant, and we don't
want to rewrite these same primitives again to optimize for newer
architectures.

-- 
Steve Borho


More information about the x265-devel mailing list