[x265] [PATCH 02 of 12] asm: interp_4tap_vert_ps_4x2 sse2

dave dtyx265 at gmail.com
Tue May 19 02:49:50 CEST 2015


On 05/18/2015 05:27 PM, Steve Borho wrote:
> On 05/18, dave wrote:
>> On 05/18/2015 09:42 AM, chen wrote:
>>> [MC] yes, it is faster on AMD CPU, on Intel, these instructions
>>> choke Port5, the PADD execute on Port1.  I often choice faster
>>> instrction for Intel because my PC use Intel CPU
>>>
>> ... and of course, while I don't follow it very closely, I do believe intel
>> still dominates the market.
>>
>> Do we have any way to determine what the target build is?  Something like..
>>
>> %if INTEL
>>      optimal intel code
>> %elif AMD
>>      optimal amd code
>> %endif
> The *build* machine should not matter at all, it is a question of
> runtime detection and selection of the best routines for each CPU.
I was thinking more of INTEL/AMD as a build time option. Particularly if 
the differences are small tweaks like this one.
>
> x264 does have a lot more knowledge than x265 about older CPUs (since
> their code base is more than a decade older). They have this code in
> common/cpu.c:
>
> if( ecx&0x00000040 ) /* SSE4a, AMD only */
> {
>      int family = ((eax>>8)&0xf) + ((eax>>20)&0xff);
>      cpu |= X264_CPU_SSE2_IS_FAST;      /* Phenom and later CPUs have fast SSE units */
>      if( family == 0x14 )
>      {
>          cpu &= ~X264_CPU_SSE2_IS_FAST; /* SSSE3 doesn't imply fast SSE anymore... */
>          cpu |= X264_CPU_SSE2_IS_SLOW;  /* Bobcat has 64-bit SIMD units */
>          cpu |= X264_CPU_SLOW_PALIGNR;  /* palignr is insanely slow on Bobcat */
>      }
>      if( family == 0x16 )
>      {
>          cpu |= X264_CPU_SLOW_PSHUFB;   /* Jaguar's pshufb isn't that slow, but it's slow enough
>                                          * compared to alternate instruction sequences that this
>                                          * is equal or faster on almost all such functions. */
>      }
> }
>
> They keep track of particularly slow instructions, then write different
> versions of key functions or macros for both types of CPUs.  I don't
> believe x265 needs to support such differentiation for decade old CPUs,
> but this is generally how it has to be done.
>
> My only point is that, at this point in time, the vast majority of
> non-SSE4 capable CPUs are probably made by AMD and so it is ok to tune
> for AMD when writing SSE2 and SSE3 functions which have SSE4 or higher
> counterparts.
>
> This patch series looks ok, I've queued it locally for testing and will
> probably push it soon. Go ahead and make any followup changes as new
> patches.
OK, once it is pushed I will submit a follow up patch.



More information about the x265-devel mailing list