[x264-devel] [PATCH 2/3] RFC: checkasm: Warn if a better SIMD function is slower than the simpler one

Fri Aug 14 08:28:17 CEST 2015

On Fri, 14 Aug 2015, Janne Grunau wrote:

> On 2015-08-14 00:26:55 +0300, Martin Storsjö wrote:
>> On Thu, 13 Aug 2015, Henrik Gramner wrote:
>>
>>> On Thu, Aug 13, 2015 at 11:00 PM, Martin Storsjö <martin at martin.st> wrote:
>>>> ---
>>>> This naively assumes that a later tested SIMD function is supposed
>>>> to be better than the earlier ones - this probably doesn't
>>>> hold for all x86 SIMD flags.
>>>
>>> This would most likely result in a huge amount of false positives.
>>> There are plenty of AVX functions for example that are neither slower
>>> nor faster than non-AVX functions on many CPUs which would often
>>> trigger the warning since the cycle counter can drift a bit from run
>>> to run for multiple reasons.
>>
>> Yeah, I guess so. With some amount of margin it might be more useful
>> though (e.g. N * nop?). Even though it's prone to false positives,
>> it can also be a useful hint to investigate things - at least for
>> arm I found a few surprises where the C versions were faster than
>
> Have you looked at the functions with slower asm than C? I guess 4 pixel
> wide predictions functions are a likely target. Even on ARM this is
> probably CPU dependent.

Yes, pretty much.

Some of the existing functions that are slower than the C version, on A8:

coeff_last15_c: 368
coeff_last15_neon: 400
coeff_last16_c: 396
coeff_last16_neon: 400
intra_predict_4x4_dct_c: 190
intra_predict_4x4_dct_neon: 211
intra_predict_8x8_dc_c: 460
intra_predict_8x8_dc_neon: 522
intra_predict_8x8_v_c: 210
intra_predict_8x8_v_neon: 300
intra_predict_8x8c_dc_c: 520
intra_predict_8x8c_dc_neon: 534
intra_predict_8x8c_dcl_c: 380
intra_predict_8x8c_dcl_neon: 453
intra_predict_16x16_dcl_c: 912
intra_predict_16x16_dcl_neon: 1268

Additionally on A9:
memzero_aligned_c: 1247
memzero_aligned_neon: 1738
sad_4x8_c: 1378
sad_4x8_armv6: 521
sad_4x8_neon_fast_mrc: 549
sad_aligned_4x4_c: 776
sad_aligned_4x4_armv6: 199
sad_aligned_4x4_neon_fast_mrc: 297
sad_aligned_4x8_c: 1376
sad_aligned_4x8_armv6: 337
sad_aligned_4x8_neon_fast_mrc: 537

Most of these aren't significantly slower than the C version though, so 
one probably doesn't need to do anything about them, except possibly 
intra_predict_16x16_dcl_neon which is 30% slower than C.

// Martin