[x264-devel] Optimization of the x264

Loren Merritt lorenm at u.washington.edu
Wed Oct 10 04:48:49 CEST 2007


On Tue, 9 Oct 2007, jogging song wrote:
> On 10/9/07, Gabriel Bouvigne <gabriel.bouvigne at joost.com> wrote:
>> Victor Mateevitsi a écrit :
>>
>>> I found out, that there are three main functions (without CALVC, CABAC 
>>> and NAL) that consume lot's of CPU power:
>>>
>>> x264_macroblock_cache_load
>>> x264_macroblock_cache_save
>>> x264_macroblock_analyse

Those 3 functions are less than 1% cpu each. Worthy of optimization if you 
can pull it off, but hardly main.
And x264_nal_encode (is that what you mean by NAL?) takes nothing.
oprofile results attached (postprocessed to merge similar functions).

>>> Reading cache_load, I realized that there is a struct named cache, that
>>> stores the cached data.
>>> Why not use pointers in this struct, without having to copy the data
>>> from one struct to the other ?
>>> I think we will gain some FPS there.
>>
>> The purpose is to have "compact" data within the CPU cache. If your
>> source data was already within the CPU cache, copy will be very fast
>> (assuming there is still available space within the cache). If your
>> source data was not already within CPU cache, you will have a slower
>> copy, but future access will be fast. By using pointers, you would have
>> to use a lot of CPU prefetch hints in order to avoid being stalled
>> because of a cache miss during computing-intensive parts.
>
> But using cache data structure, access to the cache is through x264_scan8,
> which requires a lot of address calculation.

As opposed to pointers, which would require a non-constant stride 
(dependent on video resolution), and thus even more address calculation.

--Loren Merritt
-------------- next part --------------
x264 --crf 20 -b3 -r6 -m7 -t2 -8 -w --b-rdo --mixed-refs --me umh -Aall --bime --direct auto --no-fast-pskip
CPU: Core 2, speed 2400.75 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        symbol name
19339273 23.9161  quant_trellis_cabac
17540289 21.6911  pixel_sad_*_mmx
8512890  10.5275  pixel_satd_*_mmx
4353627   5.3839  pixel_avg_*_mmx
4251986   5.2583  me_search_ref
3453580   4.2709  mc_chroma_mmx
2131654   2.6361  block_residual_write_cabac
1480012   1.8303  refine_subpel
1435113   1.7747  dct_*_mmx
1377895   1.7040  macroblock_encode
1271179   1.5720  cabac_size_decision_noup
1210223   1.4966  cabac_size_decision
1178132   1.4569  get_ref_mmx
915569    1.1323  dequant_*_mmx
912835    1.1289  zigzag_scan_*
811803    1.0039  mb_encode_8x8_chroma
811720    1.0038  cabac_size_decision2
755941    0.9349  mb_predict_mv_*
695673    0.8602  predict_intra_*
661921    0.8186  idct_*_mmx
493708    0.6105  pixel_sa8d_*_mmx
482573    0.5967  pixel_ssd_*_mmx
434550    0.5374  macroblock_size_cabac
428571    0.5299  mb_analyse_inter_*
410358    0.5075  macroblock_cache_load
392712    0.4857  mb_analyse_intra_*
386242    0.4777  hpel_filter_mmx
352885    0.4363  mc_copy_*_mmx
316615    0.3915  mb_mc_*
305143    0.3774  frame_init_lowres
289818    0.3584  macroblock_cache_save
274721    0.3398  cabac_mb_mvd
256562    0.3173  cabac_encode_decision
251457    0.3110  quant_*_mmx
233637    0.2889  macroblock_analyse
194533    0.2406  analyse_update_cache
187355    0.2317  frame_deblock_row
164529    0.2035  plane_copy_mmx
148518    0.1837  mc_luma_mmx
140143    0.1733  macroblock_probe_skip
129736    0.1604  deblock_*_mmx
129345    0.1600  me_refine_qpel_rd
113775    0.1407  predict_8x8_filter
110969    0.1372  mb_encode_i*
109943    0.1360  slicetype_mb_cost
89930     0.1112  macroblock_encode_p8x8
78711     0.0973  mb_predict_intra4x4_mode
75743     0.0937  macroblock_write_cabac
71767     0.0887  cabac_mb_type
59747     0.0739  intra_rd_refine
56389     0.0697  deblock_*_c
55789     0.0690  rd_cost_mb
55713     0.0689  partition_size_cabac
50158     0.0620  slices_write
43416     0.0537  .plt
40376     0.0500  mb_analyse_*_rd
39013     0.0483  cabac_mb_intra4x4_pred_mode
38353     0.0474  me_refine_bidir
30348     0.0375  rd_cost_part
21491     0.0266  rd_cost_i8x8
19277     0.0238  prefetch_fenc_mmx
19013     0.0235  plane_expand_border
18062     0.0223  cabac_encode_bypass
17906     0.0221  prefetch_ref_mmx
17222     0.0213  mb_load_mv_direct8x8
16352     0.0202  rd_cost_i4x4
13483     0.0167  cabac_mb_ref
9746      0.0121  cabac_mb8x8_mvd
9274      0.0115  ratecontrol_qp
8837      0.0109  cabac_mb_skip
8053      0.0100  rd_cost_i8x8_chroma
-------------- next part --------------
x264 --crf 20 -b3 -r4 -m6 -t1 -8 -w --b-rdo --mixed-refs --me umh --merange 12
CPU: Core 2, speed 2400.75 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        symbol name
11659572 25.2962  pixel_sad_*_mmx
7030913  15.2541  pixel_satd_*_mmx
3417912   7.4155  pixel_avg_*_mmx
2983057   6.4719  me_search_ref
2303283   4.9971  mc_chroma_mmx
2004425   4.3487  quant_trellis_cabac
1299367   2.8191  macroblock_encode
1261859   2.7376  block_residual_write_cabac
1130087   2.4518  refine_subpel
968260    2.1006  dct_*_mmx
892915    1.9372  get_ref_mmx
696490    1.5111  cabac_size_decision
638272    1.3847  zigzag_scan_*
635907    1.3796  mb_encode_8x8_chroma
628438    1.3635  mb_predict_mv_*
617196    1.3390  predict_intra_*
587410    1.2744  dequant_*_mmx
479187    1.0396  pixel_sa8d_*_mmx
434150    0.9420  idct_*_mmx
389361    0.8447  macroblock_cache_load
388074    0.8420  hpel_filter_mmx
369224    0.8011  mb_analyse_intra_*
359333    0.7796  quant_*_mmx
349169    0.7575  macroblock_size_cabac
323632    0.7021  pixel_ssd_*_mmx
313496    0.6801  mb_analyse_inter_*
305625    0.6631  macroblock_cache_save
305018    0.6618  frame_init_lowres
291058    0.6315  mc_copy_*_mmx
272469    0.5911  cabac_encode_decision
271118    0.5883  mb_mc_*
266287    0.5777  macroblock_analyse
251281    0.5452  analyse_update_cache
202119    0.4386  cabac_mb_mvd
173994    0.3775  frame_deblock_row
163174    0.3540  plane_copy_mmx
133649    0.2900  cabac_size_decision_noup
121308    0.2633  deblock_*_mmx
117640    0.2552  mc_luma_mmx
107665    0.2336  slicetype_mb_cost
99981     0.2169  predict_8x8_filter
80173     0.1739  mb_encode_i*
76440     0.1658  macroblock_write_cabac
76012     0.1649  cabac_size_decision2
68867     0.1494  cabac_mb_type
58684     0.1273  deblock_*_c
55166     0.1197  mb_predict_intra4x4_mode
53330     0.1157  slices_write
47315     0.1027  mb_analyse_*_rd
42898     0.0931  macroblock_probe_skip
42454     0.0921  rd_cost_mb
30975     0.0672  mb_load_mv_direct8x8
25404     0.0551  .plt
22827     0.0495  cabac_mb_intra4x4_pred_mode
19536     0.0424  plane_expand_border
18740     0.0407  prefetch_fenc_mmx
18091     0.0392  cabac_encode_bypass
16955     0.0368  prefetch_ref_mmx
11587     0.0252  cabac_mb_ref
11275     0.0245  cabac_mb_skip
10127     0.0220  cabac_mb8x8_mvd
8752      0.0190  ratecontrol_qp
8032      0.0174  cabac_encode_terminal
5737      0.0124  mb_cache_mv_*
4853      0.0105  slicetype_frame_cost
-------------- next part --------------
x264 --crf 20 -b3 -r1 -m3 -t0 -8
CPU: Core 2, speed 2400.75 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        symbol name
2155684  16.7003  pixel_satd_*_mmx
1816710  14.0743  pixel_sad_*_mmx
819444    6.3483  pixel_avg_*_mmx
675700    5.2347  me_search_ref
493839    3.8258  pixel_sa8d_*_mmx
477614    3.7005  predict_intra_*
386837    2.9969  hpel_filter_mmx
337768    2.6167  dct_*_mmx
307622    2.3832  frame_init_lowres
302414    2.3428  get_ref_mmx
296038    2.2935  mc_chroma_mmx
288773    2.2372  refine_subpel
284917    2.2073  cabac_encode_decision
280814    2.1755  macroblock_cache_load
272193    2.1088  mb_analyse_intra_*
249413    1.9322  block_residual_write_cabac
248528    1.9254  macroblock_cache_save
239293    1.8539  mb_predict_mv_*
218882    1.6957  mc_copy_*_mmx
215115    1.6665  dequant_*_mmx
213712    1.6557  zigzag_scan_*
205101    1.5889  macroblock_encode
167681    1.2991  plane_copy_mmx
163885    1.2696  frame_deblock_row
160222    1.2413  mb_analyse_inter_*
157446    1.2197  idct_*_mmx
134216    1.0398  macroblock_analyse
123591    0.9575  macroblock_probe_skip
123189    0.9544  quant_*_mmx
122915    0.9523  mb_mc_*
115551    0.8953  deblock_*_mmx
107384    0.8319  slicetype_mb_cost
93534     0.7246  mb_encode_8x8_chroma
79385     0.6150  deblock_*_c
75653     0.5861  predict_8x8_filter
68604     0.5315  macroblock_write_cabac
60181     0.4662  mc_luma_mmx
52160     0.4041  analyse_update_cache
43143     0.3342  mb_encode_i*
33294     0.2579  slices_write
32089     0.2486  mb_predict_intra4x4_mode
29736     0.2304  cabac_mb_mvd
22062     0.1709  prefetch_fenc_mmx
21621     0.1675  plane_expand_border
17903     0.1387  prefetch_ref_mmx
15300     0.1185  cabac_encode_bypass
12925     0.1001  mb_load_mv_direct8x8
12198     0.0945  cabac_mb_type
10696     0.0829  cabac_encode_terminal
6640      0.0514  .plt
6416      0.0497  ratecontrol_qp
6327      0.0490  cabac_mb_skip
6205      0.0481  cabac_mb_intra4x4_pred_mode
5187      0.0402  me_refine_qpel
4975      0.0385  slicetype_frame_cost
4682      0.0363  nal_encode
3963      0.0307  cabac_mb_intra_chroma_pred_mode
3162      0.0245  mb_transform_8x8_allowed
3081      0.0238  mb_cache_mv_*
2972      0.0230  cabac_mb_type_intra
1895      0.0147  cabac_encode_ue_bypass
-------------- next part --------------
x264 -q0 -m5 -Aall
CPU: Core 2, speed 2400.75 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        symbol name
5276007  20.0626  cabac_encode_decision
4416561  16.7944  block_residual_write_cabac
4061745  15.4452  pixel_sad_*_mmx
1739546   6.6148  pixel_avg_*_mmx
1418887   5.3955  mc_chroma_mmx
1309726   4.9804  me_search_ref
1252722   4.7636  refine_subpel
969617    3.6871  get_ref_mmx
745883    2.8363  hpel_filter_mmx
599794    2.2807  predict_intra_*
493527    1.8766  mb_analyse_inter_*
453190    1.7233  mb_analyse_intra_*
415030    1.5782  cabac_encode_bypass
406677    1.5464  mb_predict_mv_*
318068    1.2095  macroblock_cache_load
298472    1.1350  zigzag_sub_4x4_frame
296410    1.1271  macroblock_encode
255290    0.9708  macroblock_cache_save
233376    0.8874  macroblock_analyse
168506    0.6408  plane_copy_mmx
164371    0.6250  macroblock_write_cabac
139207    0.5293  mb_encode_8x8_chroma
126266    0.4801  zigzag_sub_4x4ac_frame
118014    0.4487  mc_copy_*_mmx
92060     0.3501  nal_encode
89158     0.3390  cabac_mb_mvd
49925     0.1898  analyse_update_cache
47560     0.1808  mb_mc_*
46217     0.1757  mb_encode_i*
42186     0.1604  mb_predict_intra4x4_mode
41823     0.1590  slices_write
34016     0.1293  plane_expand_border
28148     0.1070  prefetch_ref_mmx
21125     0.0803  mc_luma_mmx
19226     0.0731  prefetch_fenc_mmx
17731     0.0674  cabac_mb_type
16523     0.0628  cabac_encode_ue_bypass
14246     0.0542  me_refine_qpel
12995     0.0494  cabac_encode_terminal
12446     0.0473  cabac_mb_intra4x4_pred_mode
6383      0.0243  cabac_mb_skip
5553      0.0211  .plt
5109      0.0194  ratecontrol_qp
4191      0.0159  cabac_mb_intra_chroma_pred_mode
2727      0.0104  cabac_mb_type_intra


More information about the x264-devel mailing list