[x265] [PATCH 02 of 12] asm: interp_4tap_vert_ps_4x2 sse2

Steve Borho steve at borho.org
Tue May 19 02:27:41 CEST 2015


On 05/18, dave wrote:
> On 05/18/2015 09:42 AM, chen wrote:
> > [MC] yes, it is faster on AMD CPU, on Intel, these instructions
> > choke Port5, the PADD execute on Port1.  I often choice faster
> > instrction for Intel because my PC use Intel CPU
> >
> ... and of course, while I don't follow it very closely, I do believe intel
> still dominates the market.
> 
> Do we have any way to determine what the target build is?  Something like..
> 
> %if INTEL
>     optimal intel code
> %elif AMD
>     optimal amd code
> %endif

The *build* machine should not matter at all, it is a question of
runtime detection and selection of the best routines for each CPU.

x264 does have a lot more knowledge than x265 about older CPUs (since
their code base is more than a decade older). They have this code in
common/cpu.c:

if( ecx&0x00000040 ) /* SSE4a, AMD only */
{
    int family = ((eax>>8)&0xf) + ((eax>>20)&0xff);
    cpu |= X264_CPU_SSE2_IS_FAST;      /* Phenom and later CPUs have fast SSE units */
    if( family == 0x14 )
    {
        cpu &= ~X264_CPU_SSE2_IS_FAST; /* SSSE3 doesn't imply fast SSE anymore... */
        cpu |= X264_CPU_SSE2_IS_SLOW;  /* Bobcat has 64-bit SIMD units */                                                       
        cpu |= X264_CPU_SLOW_PALIGNR;  /* palignr is insanely slow on Bobcat */
    }
    if( family == 0x16 )
    {
        cpu |= X264_CPU_SLOW_PSHUFB;   /* Jaguar's pshufb isn't that slow, but it's slow enough
                                        * compared to alternate instruction sequences that this
                                        * is equal or faster on almost all such functions. */
    }
}

They keep track of particularly slow instructions, then write different
versions of key functions or macros for both types of CPUs.  I don't
believe x265 needs to support such differentiation for decade old CPUs,
but this is generally how it has to be done.

My only point is that, at this point in time, the vast majority of
non-SSE4 capable CPUs are probably made by AMD and so it is ok to tune
for AMD when writing SSE2 and SSE3 functions which have SSE4 or higher
counterparts.

This patch series looks ok, I've queued it locally for testing and will
probably push it soon. Go ahead and make any followup changes as new
patches.

-- 
Steve Borho


More information about the x265-devel mailing list