[x265] [PATCH 2 of 2] asm:intra pred planar32 sse2 high bit

dave dtyx265 at gmail.com
Wed Mar 11 01:21:16 CET 2015


On 03/10/2015 04:17 PM, chen wrote:
> At 2015-03-11 06:01:08,dave <dtyx265 at gmail.com> wrote:
> >On 03/10/2015 12:12 PM, Steve Borho wrote:
> >> On 03/10, dave wrote:
> >>>>> This produces some interesting numbers.
> >>> sorry, I mixed these two up.
> >>>>>>> incorrect:Without using registers for constants
> >>>>>>> with using registers
> >>>>>>> x265 [info]: I32: Intra 100%(DC 0% P 40% Ang 58%)
> >>>>>>>
> >>>>>>> encoded 2000 frames in 95.98s (20.84 fps), 1020.04 kb/s
> >>>>>>>
> >>>>>>> incorrect:With using registers for constants
> >>>>>>> without using registers
> >>>>>>> x265 [info]: I32: Intra 99%(DC 39% P 16% Ang 43%)
> >>>>>>>
> >>>>>>> encoded 2000 frames in 93.10s (21.48 fps), 1008.63 kb/s
> >>>>>>>
> >>>>>>> I just added --cu-stats to the same command options that I used
> >>>>>>> previously and I ran it several times and got exactly the same
> >>>>>>> percentages.  Times varied by less than a second for each build.  So
> >>>>>>> how can simple register usage in one primitive affect intra pred
> >>>>>>> decisions?
> >>>>>> it shouldn't, the behavior must be wrong in one of the cases. no change
> >>>>>> in performance should be able to impact the encoder output (or any
> >>>>>> coding decisions)
> >>>>>>
> >>>>> So execution time isn't directly measured for decision making?
> >>>>>
> >>>>> The output is also different.
> >>>>>
> >>>>> ls -l bridge-close*
> >>>>> -rw-r--r-- 1 shakezula shakezula 8432204 Mar 10 09:25 bridge-close1.y4m
> >>>>> -rw-r--r-- 1 shakezula shakezula 8527219 Mar 10 07:49 bridge-close.y4m
> >>>>>
> >>>>>   bridge-close1.y4m was generated without the use of registers to hold
> >>>>> constants.
> >>>> yeah, definitely a bug in one of the two versions and if the testbench
> >>>> doesn't catch it that's really bad.
> >>> I am using the same source tree for both so the only differences is
> >>> the register usage.
> >>>
> >>> The unpatched tip, which is going to use c code for planar32,
> >>> produces the same intra pred decision percentages as not using
> >>> registers for constants but different encoded output.
> >>>
> >>> x265 [info]: I32: Intra 99%(DC 39% P 16% Ang 43%)
> >>>
> >>> encoded 2000 frames in 101.82s (19.64 fps), 1008.64 kb/s
> >>>
> >>> ls -l bridge-close.*
> >>> -rw-r--r-- 1 shakezula shakezula   8432239 Mar 10 10:03 bridge-close.hevc
> >>>
> >>> The reconstructed output of all three looks the same.
> >>>
> >>> Just to test for overflow I modified the testbench to test with all
> >>> maximum 10-bit values of 0x3FF instead of random values and it
> >>> passes.  One more bit, 0x4FF, and it fails.  Though the y4m file has
> >>> 8 bit depth.
> >> this sounds like your outputs would be non-deterministic if you just ran
> >> the same encode multiple times? That would be a different class of bug,
> >> perhaps unrelated to your work on the intra primitives.
> >>
> >> I don't think we often check for non-determinism on older architectures.
> >> we regularly test --no-asm against fully optimized outputs but this
> >> only tests primitives normally used on our test machines.
> >According to agner(The microarchitecture of Intel and AMD CPUs, p169)
> >there is a non-deterministic aspect of my processor but it should only
> >affect execution time, not output.  Most of the primitives that I have
> >worked on have variable results from the testbench when run repeatedly.
> >A few even seem to randomly alternate between two distinct execution
> >times, something that you might expect from agner's findings.
> >
> >The qp and bits generated is consistent across encodes.  The qp is
> >mostly consistent across builds only varying by .01 if at all but the
> >bits varies more so across builds.
> >
> >After all this, what is preferred?  Constants copied to registers or
> >used from memory?  The benchtest says memory runs faster(I believe they
> >are cached in a temp register,see agner).  Encodes are less conclusive
> >since each build doesn't use planar32 equally.
> >
> Are you save and restore the extra constant XMM register? the compiler store float in it
oops.  That's it.  I left it at 13 when it should have been 16. Once 
corrected the stats are the same and performance between the two also 
looks the same.  I can submit a patch with whatever you prefer.
>   
>
>
> _______________________________________________
> x265-devel mailing list
> x265-devel at videolan.org
> https://mailman.videolan.org/listinfo/x265-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20150310/5887dd54/attachment-0001.html>


More information about the x265-devel mailing list