[x265] [PATCH 2 of 2] asm:intra pred planar32 sse2 high bit

dave dtyx265 at gmail.com
Tue Mar 10 16:02:11 CET 2015


On 03/09/2015 11:40 PM, Steve Borho wrote:
> On 03/09, dave wrote:
>> On 03/09/2015 08:25 PM, Steve Borho wrote:
>>> On 03/09, dave wrote:
>>>> Interesting.  Performance is almost identical
>>>>
>>>> original code
>>>>
>>>> /x265 -I 1 --input ~/Videos/bridge-close-cif/bridge-close.y4m -o
>>>> bridge-close.y4m
>>>> y4m  [info]: 352x288 fps 30/1 i420p8 frames 0 - 1999 of 2000
>>>> x265 [info]: HEVC encoder version 1.5+162-4d1d54d28cb1
>>>> x265 [info]: build info [Linux][GCC 4.7.2][64 bit] 16bpp
>>>> x265 [info]: using cpu capabilities: MMX2 SSE2Slow SlowCTZ
>>>> x265 [info]: Main 10 profile, Level-2 (Main tier)
>>>> x265 [info]: Thread pool created using 2 threads
>>>> x265 [info]: frame threads / pool features       : 1 / wpp(5 rows)
>>>> x265 [info]: Internal bit depth                  : 10
>>>> x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
>>>> x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
>>>> x265 [info]: ME / range / subpel / merge         : hex / 57 / 2 / 2
>>>> x265 [info]: Keyframe min / max / scenecut       : 1 / 1 / 40
>>>> x265 [info]: Lookahead / bframes / badapt        : 20 / 0 / 0
>>>> x265 [info]: b-pyramid / weightp / weightb / refs: 0 / 1 / 0 / 3
>>>> x265 [info]: Rate Control / AQ-Strength / CUTree : CRF-28.0 / 1.0 / 1
>>>> x265 [info]: tools: rd=3 psy-rd=0.30 deblock sao signhide tmvp
>>>> x265 [info]: frame I:   2000, Avg QP:33.08  kb/s: 973.45
>>>> x265 [info]: global :   2000, Avg QP:33.08  kb/s: 973.45
>>>> x265 [info]: consecutive B-frames: 100.0%
>>>>
>>>> encoded 2000 frames in 414.26s (4.83 fps), 973.45 kb/s
>>>>
>>>> and using registers to hold constants
>>>>
>>>> ./x265 -I 1 --input ~/Videos/bridge-close-cif/bridge-close.y4m -o
>>>> bridge-close.y4m
>>>> y4m  [info]: 352x288 fps 30/1 i420p8 frames 0 - 1999 of 2000
>>>> x265 [info]: HEVC encoder version 1.5+162-4d1d54d28cb1
>>>> x265 [info]: build info [Linux][GCC 4.7.2][64 bit] 16bpp
>>>> x265 [info]: using cpu capabilities: MMX2 SSE2Slow SlowCTZ
>>>> x265 [info]: Main 10 profile, Level-2 (Main tier)
>>>> x265 [info]: Thread pool created using 2 threads
>>>> x265 [info]: frame threads / pool features       : 1 / wpp(5 rows)
>>>> x265 [info]: Internal bit depth                  : 10
>>>> x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
>>>> x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
>>>> x265 [info]: ME / range / subpel / merge         : hex / 57 / 2 / 2
>>>> x265 [info]: Keyframe min / max / scenecut       : 1 / 1 / 40
>>>> x265 [info]: Lookahead / bframes / badapt        : 20 / 0 / 0
>>>> x265 [info]: b-pyramid / weightp / weightb / refs: 0 / 1 / 0 / 3
>>>> x265 [info]: Rate Control / AQ-Strength / CUTree : CRF-28.0 / 1.0 / 1
>>>> x265 [info]: tools: rd=3 psy-rd=0.30 deblock sao signhide tmvp
>>>> x265 [info]: frame I:   2000, Avg QP:33.08  kb/s: 973.45
>>>> x265 [info]: global :   2000, Avg QP:33.08  kb/s: 973.45
>>>> x265 [info]: consecutive B-frames: 100.0%
>>>>
>>>> encoded 2000 frames in 414.28s (4.83 fps), 973.45 kb/s
>>>>
>>>> The closest I could find to forcing intra planar32 to be used is -I 1
>>> If you enable DETAILED_CU_STATS it will report the amount of time spent
>>> in intra analysis.
>>>
>> Without using extra registers
>>
>> ./x265 --ctu 32 --min-cu-size 32 -I 1 --input
>> ~/Videos/bridge-close-cif/bridge-close.y4m -o bridge-close.y4m
>> y4m  [info]: 352x288 fps 30/1 i420p8 frames 0 - 1999 of 2000
>> x265 [info]: HEVC encoder version 1.5+162-4d1d54d28cb1
>> x265 [info]: build info [Linux][GCC 4.7.2][64 bit] 16bpp
>> x265 [info]: using cpu capabilities: MMX2 SSE2Slow SlowCTZ
>> x265 [info]: Main 10 profile, Level-2 (Main tier)
>> x265 [info]: Thread pool created using 2 threads
>> x265 [info]: frame threads / pool features       : 1 / wpp(9 rows)
>> x265 [info]: Internal bit depth                  : 10
>> x265 [info]: Coding QT: max CU size, min CU size : 32 / 32
>> x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
>> x265 [info]: ME / range / subpel / merge         : hex / 57 / 2 / 2
>> x265 [info]: Keyframe min / max / scenecut       : 1 / 1 / 40
>> x265 [info]: Lookahead / bframes / badapt        : 20 / 0 / 0
>> x265 [info]: b-pyramid / weightp / weightb / refs: 0 / 1 / 0 / 3
>> x265 [info]: Rate Control / AQ-Strength / CUTree : CRF-28.0 / 1.0 / 1
>> x265 [info]: tools: rd=3 psy-rd=0.30 deblock sao signhide tmvp
>> x265 [info]: frame I:   2000, Avg QP:31.94  kb/s: 1008.63
>> x265 [info]: global :   2000, Avg QP:31.94  kb/s: 1008.63
>> x265 [info]: consecutive B-frames: 100.0%
>> x265 [info]: CU: %00.00 time spent in motion estimation, averaging
>> 0.000 CU inter modes per CTU
>> x265 [info]: CU: %23.55 time spent in intra analysis, averaging
>> 1.000 Intra PUs per CTU
>> x265 [info]: CU: %00.00 time spent in inter RDO, measuring 0.000
>> inter/merge predictions per CTU
>> x265 [info]: CU: %45.03 time spent in intra RDO, measuring 5.331
>> intra predictions per CTU
>> x265 [info]: CU: %09.23 time spent in loop filters, average 0.899 ms
>> per call
>> x265 [info]: CU: %03.88 time spent in slicetypeDecide (avg 0.720ms)
>> and prelookahead (avg 2.683ms)
>> x265 [info]: CU: %18.30 time spent in other tasks
>> x265 [info]: CU: Intra RDO time  per depth %100.00 %00.00 %00.00 %00.00
>> x265 [info]: CU: Intra RDO calls per depth %100.00 %00.00 %00.00 %00.00
>> x265 [info]: CU: 198000 32X32 CTUs compressed in 175.257 seconds,
>> 1129.772 CTUs per worker-second
>> x265 [info]: CU: 1.766 average worker utilization, %88.31 of
>> theoretical maximum utilization
>>
>> encoded 2000 frames in 99.23s (20.15 fps), 1008.63 kb/s
>>
>> with using extra registers for constants
>>
>> ./x265 --ctu 32 --min-cu-size 32 -I 1 --input
>> ~/Videos/bridge-close-cif/bridge-close.y4m -o bridge-close.y4m
>> y4m  [info]: 352x288 fps 30/1 i420p8 frames 0 - 1999 of 2000
>> x265 [info]: HEVC encoder version 1.5+162-4d1d54d28cb1
>> x265 [info]: build info [Linux][GCC 4.7.2][64 bit] 16bpp
>> x265 [info]: using cpu capabilities: MMX2 SSE2Slow SlowCTZ
>> x265 [info]: Main 10 profile, Level-2 (Main tier)
>> x265 [info]: Thread pool created using 2 threads
>> x265 [info]: frame threads / pool features       : 1 / wpp(9 rows)
>> x265 [info]: Internal bit depth                  : 10
>> x265 [info]: Coding QT: max CU size, min CU size : 32 / 32
>> x265 [info]: Residual QT: max TU size, max depth : 32 / 1 inter / 1 intra
>> x265 [info]: ME / range / subpel / merge         : hex / 57 / 2 / 2
>> x265 [info]: Keyframe min / max / scenecut       : 1 / 1 / 40
>> x265 [info]: Lookahead / bframes / badapt        : 20 / 0 / 0
>> x265 [info]: b-pyramid / weightp / weightb / refs: 0 / 1 / 0 / 3
>> x265 [info]: Rate Control / AQ-Strength / CUTree : CRF-28.0 / 1.0 / 1
>> x265 [info]: tools: rd=3 psy-rd=0.30 deblock sao signhide tmvp
>> x265 [info]: frame I:   2000, Avg QP:31.94  kb/s: 1020.04
>> x265 [info]: global :   2000, Avg QP:31.94  kb/s: 1020.04
>> x265 [info]: consecutive B-frames: 100.0%
>> x265 [info]: CU: %00.00 time spent in motion estimation, averaging
>> 0.000 CU inter modes per CTU
>> x265 [info]: CU: %22.94 time spent in intra analysis, averaging
>> 1.000 Intra PUs per CTU
>> x265 [info]: CU: %00.00 time spent in inter RDO, measuring 0.000
>> inter/merge predictions per CTU
>> x265 [info]: CU: %46.41 time spent in intra RDO, measuring 5.462
>> intra predictions per CTU
>> x265 [info]: CU: %09.03 time spent in loop filters, average 0.900 ms
>> per call
>> x265 [info]: CU: %03.71 time spent in slicetypeDecide (avg 0.673ms)
>> and prelookahead (avg 2.661ms)
>> x265 [info]: CU: %17.92 time spent in other tasks
>> x265 [info]: CU: Intra RDO time  per depth %100.00 %00.00 %00.00 %00.00
>> x265 [info]: CU: Intra RDO calls per depth %100.00 %00.00 %00.00 %00.00
>> x265 [info]: CU: 198000 32X32 CTUs compressed in 179.521 seconds,
>> 1102.938 CTUs per worker-second
>> x265 [info]: CU: 1.767 average worker utilization, %88.36 of
>> theoretical maximum utilization
>>
>> encoded 2000 frames in 101.59s (19.69 fps), 1020.04 kb/s
>>
>> It doesn't say much about planar 32 usage.
> No, but the command line option --cu-stats does show how much it is
> called (but not how long it took)
>
This produces some interesting numbers.

Without using registers for constants

x265 [info]: I32: Intra 100%(DC 0% P 40% Ang 58%)

encoded 2000 frames in 95.98s (20.84 fps), 1020.04 kb/s

With using registers for constants

x265 [info]: I32: Intra 99%(DC 39% P 16% Ang 43%)

encoded 2000 frames in 93.10s (21.48 fps), 1008.63 kb/s

I just added --cu-stats to the same command options that I used 
previously and I ran it several times and got exactly the same 
percentages.  Times varied by less than a second for each build.  So how 
can simple register usage in one primitive affect intra pred decisions?


More information about the x265-devel mailing list