[x265] [PATCH 2 of 2] asm:intra pred planar32 sse2 high bit

dave dtyx265 at gmail.com
Tue Mar 10 23:01:08 CET 2015


On 03/10/2015 12:12 PM, Steve Borho wrote:
> On 03/10, dave wrote:
>>>> This produces some interesting numbers.
>> sorry, I mixed these two up.
>>>>>> incorrect:Without using registers for constants
>>>>>> with using registers
>>>>>> x265 [info]: I32: Intra 100%(DC 0% P 40% Ang 58%)
>>>>>>
>>>>>> encoded 2000 frames in 95.98s (20.84 fps), 1020.04 kb/s
>>>>>>
>>>>>> incorrect:With using registers for constants
>>>>>> without using registers
>>>>>> x265 [info]: I32: Intra 99%(DC 39% P 16% Ang 43%)
>>>>>>
>>>>>> encoded 2000 frames in 93.10s (21.48 fps), 1008.63 kb/s
>>>>>>
>>>>>> I just added --cu-stats to the same command options that I used
>>>>>> previously and I ran it several times and got exactly the same
>>>>>> percentages.  Times varied by less than a second for each build.  So
>>>>>> how can simple register usage in one primitive affect intra pred
>>>>>> decisions?
>>>>> it shouldn't, the behavior must be wrong in one of the cases. no change
>>>>> in performance should be able to impact the encoder output (or any
>>>>> coding decisions)
>>>>>
>>>> So execution time isn't directly measured for decision making?
>>>>
>>>> The output is also different.
>>>>
>>>> ls -l bridge-close*
>>>> -rw-r--r-- 1 shakezula shakezula 8432204 Mar 10 09:25 bridge-close1.y4m
>>>> -rw-r--r-- 1 shakezula shakezula 8527219 Mar 10 07:49 bridge-close.y4m
>>>>
>>>>   bridge-close1.y4m was generated without the use of registers to hold
>>>> constants.
>>> yeah, definitely a bug in one of the two versions and if the testbench
>>> doesn't catch it that's really bad.
>> I am using the same source tree for both so the only differences is
>> the register usage.
>>
>> The unpatched tip, which is going to use c code for planar32,
>> produces the same intra pred decision percentages as not using
>> registers for constants but different encoded output.
>>
>> x265 [info]: I32: Intra 99%(DC 39% P 16% Ang 43%)
>>
>> encoded 2000 frames in 101.82s (19.64 fps), 1008.64 kb/s
>>
>> ls -l bridge-close.*
>> -rw-r--r-- 1 shakezula shakezula   8432239 Mar 10 10:03 bridge-close.hevc
>>
>> The reconstructed output of all three looks the same.
>>
>> Just to test for overflow I modified the testbench to test with all
>> maximum 10-bit values of 0x3FF instead of random values and it
>> passes.  One more bit, 0x4FF, and it fails.  Though the y4m file has
>> 8 bit depth.
> this sounds like your outputs would be non-deterministic if you just ran
> the same encode multiple times? That would be a different class of bug,
> perhaps unrelated to your work on the intra primitives.
>
> I don't think we often check for non-determinism on older architectures.
> we regularly test --no-asm against fully optimized outputs but this
> only tests primitives normally used on our test machines.
According to agner(The microarchitecture of Intel and AMD CPUs, p169) 
there is a non-deterministic aspect of my processor but it should only 
affect execution time, not output.  Most of the primitives that I have 
worked on have variable results from the testbench when run repeatedly.  
A few even seem to randomly alternate between two distinct execution 
times, something that you might expect from agner's findings.

The qp and bits generated is consistent across encodes.  The qp is 
mostly consistent across builds only varying by .01 if at all but the 
bits varies more so across builds.

After all this, what is preferred?  Constants copied to registers or 
used from memory?  The benchtest says memory runs faster(I believe they 
are cached in a temp register,see agner).  Encodes are less conclusive 
since each build doesn't use planar32 equally.



More information about the x265-devel mailing list