[x264-devel] Optimization of the x264
Loren Merritt
lorenm at u.washington.edu
Wed Oct 10 04:48:49 CEST 2007
On Tue, 9 Oct 2007, jogging song wrote:
> On 10/9/07, Gabriel Bouvigne <gabriel.bouvigne at joost.com> wrote:
>> Victor Mateevitsi a écrit :
>>
>>> I found out, that there are three main functions (without CALVC, CABAC
>>> and NAL) that consume lot's of CPU power:
>>>
>>> x264_macroblock_cache_load
>>> x264_macroblock_cache_save
>>> x264_macroblock_analyse
Those 3 functions are less than 1% cpu each. Worthy of optimization if you
can pull it off, but hardly main.
And x264_nal_encode (is that what you mean by NAL?) takes nothing.
oprofile results attached (postprocessed to merge similar functions).
>>> Reading cache_load, I realized that there is a struct named cache, that
>>> stores the cached data.
>>> Why not use pointers in this struct, without having to copy the data
>>> from one struct to the other ?
>>> I think we will gain some FPS there.
>>
>> The purpose is to have "compact" data within the CPU cache. If your
>> source data was already within the CPU cache, copy will be very fast
>> (assuming there is still available space within the cache). If your
>> source data was not already within CPU cache, you will have a slower
>> copy, but future access will be fast. By using pointers, you would have
>> to use a lot of CPU prefetch hints in order to avoid being stalled
>> because of a cache miss during computing-intensive parts.
>
> But using cache data structure, access to the cache is through x264_scan8,
> which requires a lot of address calculation.
As opposed to pointers, which would require a non-constant stride
(dependent on video resolution), and thus even more address calculation.
--Loren Merritt
-------------- next part --------------
x264 --crf 20 -b3 -r6 -m7 -t2 -8 -w --b-rdo --mixed-refs --me umh -Aall --bime --direct auto --no-fast-pskip
CPU: Core 2, speed 2400.75 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % symbol name
19339273 23.9161 quant_trellis_cabac
17540289 21.6911 pixel_sad_*_mmx
8512890 10.5275 pixel_satd_*_mmx
4353627 5.3839 pixel_avg_*_mmx
4251986 5.2583 me_search_ref
3453580 4.2709 mc_chroma_mmx
2131654 2.6361 block_residual_write_cabac
1480012 1.8303 refine_subpel
1435113 1.7747 dct_*_mmx
1377895 1.7040 macroblock_encode
1271179 1.5720 cabac_size_decision_noup
1210223 1.4966 cabac_size_decision
1178132 1.4569 get_ref_mmx
915569 1.1323 dequant_*_mmx
912835 1.1289 zigzag_scan_*
811803 1.0039 mb_encode_8x8_chroma
811720 1.0038 cabac_size_decision2
755941 0.9349 mb_predict_mv_*
695673 0.8602 predict_intra_*
661921 0.8186 idct_*_mmx
493708 0.6105 pixel_sa8d_*_mmx
482573 0.5967 pixel_ssd_*_mmx
434550 0.5374 macroblock_size_cabac
428571 0.5299 mb_analyse_inter_*
410358 0.5075 macroblock_cache_load
392712 0.4857 mb_analyse_intra_*
386242 0.4777 hpel_filter_mmx
352885 0.4363 mc_copy_*_mmx
316615 0.3915 mb_mc_*
305143 0.3774 frame_init_lowres
289818 0.3584 macroblock_cache_save
274721 0.3398 cabac_mb_mvd
256562 0.3173 cabac_encode_decision
251457 0.3110 quant_*_mmx
233637 0.2889 macroblock_analyse
194533 0.2406 analyse_update_cache
187355 0.2317 frame_deblock_row
164529 0.2035 plane_copy_mmx
148518 0.1837 mc_luma_mmx
140143 0.1733 macroblock_probe_skip
129736 0.1604 deblock_*_mmx
129345 0.1600 me_refine_qpel_rd
113775 0.1407 predict_8x8_filter
110969 0.1372 mb_encode_i*
109943 0.1360 slicetype_mb_cost
89930 0.1112 macroblock_encode_p8x8
78711 0.0973 mb_predict_intra4x4_mode
75743 0.0937 macroblock_write_cabac
71767 0.0887 cabac_mb_type
59747 0.0739 intra_rd_refine
56389 0.0697 deblock_*_c
55789 0.0690 rd_cost_mb
55713 0.0689 partition_size_cabac
50158 0.0620 slices_write
43416 0.0537 .plt
40376 0.0500 mb_analyse_*_rd
39013 0.0483 cabac_mb_intra4x4_pred_mode
38353 0.0474 me_refine_bidir
30348 0.0375 rd_cost_part
21491 0.0266 rd_cost_i8x8
19277 0.0238 prefetch_fenc_mmx
19013 0.0235 plane_expand_border
18062 0.0223 cabac_encode_bypass
17906 0.0221 prefetch_ref_mmx
17222 0.0213 mb_load_mv_direct8x8
16352 0.0202 rd_cost_i4x4
13483 0.0167 cabac_mb_ref
9746 0.0121 cabac_mb8x8_mvd
9274 0.0115 ratecontrol_qp
8837 0.0109 cabac_mb_skip
8053 0.0100 rd_cost_i8x8_chroma
-------------- next part --------------
x264 --crf 20 -b3 -r4 -m6 -t1 -8 -w --b-rdo --mixed-refs --me umh --merange 12
CPU: Core 2, speed 2400.75 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % symbol name
11659572 25.2962 pixel_sad_*_mmx
7030913 15.2541 pixel_satd_*_mmx
3417912 7.4155 pixel_avg_*_mmx
2983057 6.4719 me_search_ref
2303283 4.9971 mc_chroma_mmx
2004425 4.3487 quant_trellis_cabac
1299367 2.8191 macroblock_encode
1261859 2.7376 block_residual_write_cabac
1130087 2.4518 refine_subpel
968260 2.1006 dct_*_mmx
892915 1.9372 get_ref_mmx
696490 1.5111 cabac_size_decision
638272 1.3847 zigzag_scan_*
635907 1.3796 mb_encode_8x8_chroma
628438 1.3635 mb_predict_mv_*
617196 1.3390 predict_intra_*
587410 1.2744 dequant_*_mmx
479187 1.0396 pixel_sa8d_*_mmx
434150 0.9420 idct_*_mmx
389361 0.8447 macroblock_cache_load
388074 0.8420 hpel_filter_mmx
369224 0.8011 mb_analyse_intra_*
359333 0.7796 quant_*_mmx
349169 0.7575 macroblock_size_cabac
323632 0.7021 pixel_ssd_*_mmx
313496 0.6801 mb_analyse_inter_*
305625 0.6631 macroblock_cache_save
305018 0.6618 frame_init_lowres
291058 0.6315 mc_copy_*_mmx
272469 0.5911 cabac_encode_decision
271118 0.5883 mb_mc_*
266287 0.5777 macroblock_analyse
251281 0.5452 analyse_update_cache
202119 0.4386 cabac_mb_mvd
173994 0.3775 frame_deblock_row
163174 0.3540 plane_copy_mmx
133649 0.2900 cabac_size_decision_noup
121308 0.2633 deblock_*_mmx
117640 0.2552 mc_luma_mmx
107665 0.2336 slicetype_mb_cost
99981 0.2169 predict_8x8_filter
80173 0.1739 mb_encode_i*
76440 0.1658 macroblock_write_cabac
76012 0.1649 cabac_size_decision2
68867 0.1494 cabac_mb_type
58684 0.1273 deblock_*_c
55166 0.1197 mb_predict_intra4x4_mode
53330 0.1157 slices_write
47315 0.1027 mb_analyse_*_rd
42898 0.0931 macroblock_probe_skip
42454 0.0921 rd_cost_mb
30975 0.0672 mb_load_mv_direct8x8
25404 0.0551 .plt
22827 0.0495 cabac_mb_intra4x4_pred_mode
19536 0.0424 plane_expand_border
18740 0.0407 prefetch_fenc_mmx
18091 0.0392 cabac_encode_bypass
16955 0.0368 prefetch_ref_mmx
11587 0.0252 cabac_mb_ref
11275 0.0245 cabac_mb_skip
10127 0.0220 cabac_mb8x8_mvd
8752 0.0190 ratecontrol_qp
8032 0.0174 cabac_encode_terminal
5737 0.0124 mb_cache_mv_*
4853 0.0105 slicetype_frame_cost
-------------- next part --------------
x264 --crf 20 -b3 -r1 -m3 -t0 -8
CPU: Core 2, speed 2400.75 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % symbol name
2155684 16.7003 pixel_satd_*_mmx
1816710 14.0743 pixel_sad_*_mmx
819444 6.3483 pixel_avg_*_mmx
675700 5.2347 me_search_ref
493839 3.8258 pixel_sa8d_*_mmx
477614 3.7005 predict_intra_*
386837 2.9969 hpel_filter_mmx
337768 2.6167 dct_*_mmx
307622 2.3832 frame_init_lowres
302414 2.3428 get_ref_mmx
296038 2.2935 mc_chroma_mmx
288773 2.2372 refine_subpel
284917 2.2073 cabac_encode_decision
280814 2.1755 macroblock_cache_load
272193 2.1088 mb_analyse_intra_*
249413 1.9322 block_residual_write_cabac
248528 1.9254 macroblock_cache_save
239293 1.8539 mb_predict_mv_*
218882 1.6957 mc_copy_*_mmx
215115 1.6665 dequant_*_mmx
213712 1.6557 zigzag_scan_*
205101 1.5889 macroblock_encode
167681 1.2991 plane_copy_mmx
163885 1.2696 frame_deblock_row
160222 1.2413 mb_analyse_inter_*
157446 1.2197 idct_*_mmx
134216 1.0398 macroblock_analyse
123591 0.9575 macroblock_probe_skip
123189 0.9544 quant_*_mmx
122915 0.9523 mb_mc_*
115551 0.8953 deblock_*_mmx
107384 0.8319 slicetype_mb_cost
93534 0.7246 mb_encode_8x8_chroma
79385 0.6150 deblock_*_c
75653 0.5861 predict_8x8_filter
68604 0.5315 macroblock_write_cabac
60181 0.4662 mc_luma_mmx
52160 0.4041 analyse_update_cache
43143 0.3342 mb_encode_i*
33294 0.2579 slices_write
32089 0.2486 mb_predict_intra4x4_mode
29736 0.2304 cabac_mb_mvd
22062 0.1709 prefetch_fenc_mmx
21621 0.1675 plane_expand_border
17903 0.1387 prefetch_ref_mmx
15300 0.1185 cabac_encode_bypass
12925 0.1001 mb_load_mv_direct8x8
12198 0.0945 cabac_mb_type
10696 0.0829 cabac_encode_terminal
6640 0.0514 .plt
6416 0.0497 ratecontrol_qp
6327 0.0490 cabac_mb_skip
6205 0.0481 cabac_mb_intra4x4_pred_mode
5187 0.0402 me_refine_qpel
4975 0.0385 slicetype_frame_cost
4682 0.0363 nal_encode
3963 0.0307 cabac_mb_intra_chroma_pred_mode
3162 0.0245 mb_transform_8x8_allowed
3081 0.0238 mb_cache_mv_*
2972 0.0230 cabac_mb_type_intra
1895 0.0147 cabac_encode_ue_bypass
-------------- next part --------------
x264 -q0 -m5 -Aall
CPU: Core 2, speed 2400.75 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % symbol name
5276007 20.0626 cabac_encode_decision
4416561 16.7944 block_residual_write_cabac
4061745 15.4452 pixel_sad_*_mmx
1739546 6.6148 pixel_avg_*_mmx
1418887 5.3955 mc_chroma_mmx
1309726 4.9804 me_search_ref
1252722 4.7636 refine_subpel
969617 3.6871 get_ref_mmx
745883 2.8363 hpel_filter_mmx
599794 2.2807 predict_intra_*
493527 1.8766 mb_analyse_inter_*
453190 1.7233 mb_analyse_intra_*
415030 1.5782 cabac_encode_bypass
406677 1.5464 mb_predict_mv_*
318068 1.2095 macroblock_cache_load
298472 1.1350 zigzag_sub_4x4_frame
296410 1.1271 macroblock_encode
255290 0.9708 macroblock_cache_save
233376 0.8874 macroblock_analyse
168506 0.6408 plane_copy_mmx
164371 0.6250 macroblock_write_cabac
139207 0.5293 mb_encode_8x8_chroma
126266 0.4801 zigzag_sub_4x4ac_frame
118014 0.4487 mc_copy_*_mmx
92060 0.3501 nal_encode
89158 0.3390 cabac_mb_mvd
49925 0.1898 analyse_update_cache
47560 0.1808 mb_mc_*
46217 0.1757 mb_encode_i*
42186 0.1604 mb_predict_intra4x4_mode
41823 0.1590 slices_write
34016 0.1293 plane_expand_border
28148 0.1070 prefetch_ref_mmx
21125 0.0803 mc_luma_mmx
19226 0.0731 prefetch_fenc_mmx
17731 0.0674 cabac_mb_type
16523 0.0628 cabac_encode_ue_bypass
14246 0.0542 me_refine_qpel
12995 0.0494 cabac_encode_terminal
12446 0.0473 cabac_mb_intra4x4_pred_mode
6383 0.0243 cabac_mb_skip
5553 0.0211 .plt
5109 0.0194 ratecontrol_qp
4191 0.0159 cabac_mb_intra_chroma_pred_mode
2727 0.0104 cabac_mb_type_intra
More information about the x264-devel
mailing list