[x264-devel] add optimised functions (hpel_filter, ssd, ssim)

Guillaume POIRIER poirierg at gmail.com
Sat Nov 24 15:26:29 CET 2007


Hello,

On Nov 24, 2007 2:21 PM, Firas Al-Tahan <firearse at gmail.com> wrote:
> So with the continued support for G4 PPC / G5 - is it still worth holding
> onto that dual G4 tower?!

If it suits your needs, there's no need to change it, right? ;-)


> Is there much to optimize with regards to AltiVec?

Let's have look at the profiling of x264 with reasonably advanced settings:

# Report 0 - SSD Altivec - Advanced options - Time Profile (All Thread
States) of x264.mshark - Time Profile (All Thread States) of x264
[G5.local]
SharkProfileViewer
# Generated from the visible portion of the outline view
- 9.6% mc_chroma_altivec (x264)
- 8.0% pixel_satd_8x8_altivec (x264)
- 7.5% get_ref_altivec (x264)
- 6.3% pixel_satd_16x16_altivec (x264)
- 5.6% pixel_satd_4x4_altivec (x264)
- 4.8% block_residual_write_cabac (x264)
- 3.7% x264_add4x4_idct_altivec (x264)
- 3.6% x264_macroblock_encode (x264)
- 3.5% pixel_sad_x4_16x16_altivec (x264)
- 2.6% x264_cabac_size_decision (x264)
- 2.5% pixel_sad_16x16_altivec (x264)
- 2.4% pixel_sad_x3_16x16_altivec (x264)
- 2.2% x264_mb_encode_8x8_chroma (x264)
- 2.1% x264_me_search_ref (x264)
- 2.0% read (libSystem.B.dylib)
- 1.7% x264_quant_4x4_altivec (x264)
- 1.6% x264_dequant_4x4_altivec (x264)
- 1.5% pixel_sad_x4_8x8_altivec (x264)
- 1.4% x264_hpel_filter_altivec (x264)
- 1.3% pixel_sad_x3_8x8_altivec (x264)
- 1.2% refine_subpel (x264)
- 1.2% predict_16x16_p (x264)
- 1.1% x264_pixel_ssd_8x8 (x264)
- 1.1% x264_sub8x8_dct_altivec (x264)
- 1.1% x264_sub4x4_dct_altivec (x264)
- 1.0% 0xffff88c8 [536B] (Unknown Library)
- 1.0% pixel_sad_8x8_altivec (x264)
- 0.8% zigzag_scan_4x4ac_frame (x264)
- 0.8% predict_8x8c_p (x264)
- 0.8% x264_cabac_encode_decision (x264)
- 0.6% x264_mb_analyse_intra (x264)
- 0.6% zigzag_scan_4x4_frame (x264)
- 0.6% x264_pixel_ssd_8x8_4x4 (x264)
- 0.5% x264_macroblock_cache_load (x264)
- 0.4% x264_macroblock_analyse (x264)
  0.4% x264_deblock_h_luma_altivec (x264)
- 0.4% x264_frame_init_lowres (x264)
- 0.4% x264_slicetype_mb_cost (x264)
- 0.4% x264_rd_cost_i4x4 (x264)
  0.4% x264_frame_deblock_row (x264)
- 0.4% pixel_ssd_16x16_altivec (x264)
- 0.4% deblock_v_chroma_c (x264)
- 0.3% mc_copy_w16 (x264)
- 0.3% deblock_v_luma_intra_c (x264)
- 0.3% x264_macroblock_size_cabac (x264)
- 0.3% write (libSystem.B.dylib)
- 0.3% pixel_satd_8x4_altivec (x264)
  0.3% deblock_h_luma_intra_c (x264)
- 0.2% x264_quant_2x2_dc_altivec (x264)
- 0.2% x264_mb_predict_intra4x4_mode (x264)
- 0.2% x264_mb_encode_i4x4 (x264)
- 0.2% x264_mb_analyse_intra_chroma (x264)
- 0.2% x264_macroblock_write_cabac (x264)
  0.2% x264_macroblock_cache_save (x264)
- 0.2% predict_8x8c_dc (x264)
- 0.2% predict_4x4_vr (x264)
- 0.2% predict_4x4_vl (x264)
- 0.2% pixel_satd_8x16_altivec (x264)
- 0.2% pixel_satd_16x8_altivec (x264)
- 0.2% mc_copy_w8 (x264)
- 0.2% 0xffff8600 [276B] (Unknown Library)
- 0.2% x264_mb_predict_mv_ref16x16 (x264)
- 0.2% x264_analyse_update_cache (x264)
- 0.2% ssim_4x4x2_core_altivec (x264)
- 0.2% predict_4x4_hu (x264)
- 0.2% predict_4x4_ddr (x264)
- 0.2% predict_4x4_dc (x264)
  0.2% deblock_v_chroma_intra_c (x264)
- 0.2% deblock_h_chroma_c (x264)
- 0.2% predict_4x4_hd (x264)
- 0.2% 0xffff8ae4 [148B] (Unknown Library)
- 0.1% x264_mb_predict_mv (x264)
  0.1% x264_mb_dequant_2x2_dc (x264)
- 0.1% x264_intra_rd_refine (x264)
- 0.1% x264_cabac_mb_mvd (x264)
  0.1% x264_cabac_mb_intra4x4_pred_mode (x264)
- 0.1% predict_4x4_ddl (x264)
- 0.1% pixel_sad_16x8_altivec (x264)
- 0.1% dct4x4dc (x264)
- 0.1% x264_slice_write (x264)
- 0.1% x264_rd_cost_mb (x264)
- 0.1% x264_predict_8x8_filter (x264)
- 0.1% x264_mb_dequant_4x4_dc (x264)
- 0.1% x264_cabac_encode_bypass (x264)
- 0.1% predict_8x8_vr (x264)
- 0.1% predict_8x8_vl (x264)
- 0.1% predict_8x8_hd (x264)
- 0.1% predict_16x16_h (x264)
- 0.1% predict_16x16_dc (x264)
- 0.1% pixel_satd_4x8_altivec (x264)
- 0.1% pixel_sad_x4_8x16_altivec (x264)
- 0.1% pixel_sad_8x16_altivec (x264)
- 0.1% 0xffff8780 [208B] (Unknown Library)
- 0.1% x264_mb_analyse_p_rd (x264)
  0.1% x264_mb_analyse_inter_p16x16 (x264)
  0.1% x264_deblock_v_luma_altivec (x264)
- 0.1% ssim_end4 (x264)
- 0.1% predict_8x8c_h (x264)
- 0.1% predict_8x8_hu (x264)
- 0.1% predict_8x8_ddl (x264)
- 0.1% predict_4x4_h (x264)
- 0.1% pixel_sad_x4_16x8_altivec (x264)
- 0.1% pixel_sad_x3_8x16_altivec (x264)
- 0.1% memcpy (libSystem.B.dylib)
- 0.1% dyld_stub_memset (x264)
- 0.1% dct2x2dc (x264)
  0.0% x264_slicetype_frame_cost (x264)
  0.0% x264_rd_cost_i8x8_chroma (x264)
- 0.0% x264_quant_4x4_dc_altivec (x264)
- 0.0% x264_pixel_ssd_wxh (x264)
  0.0% x264_mb_predict_mv_16x16 (x264)
  0.0% x264_cabac_mb_type (x264)
- 0.0% x264_cabac_encode_terminal (x264)
  0.0% x264_add16x16_idct_altivec (x264)
- 0.0% prefetch_fenc_null (x264)
- 0.0% predict_16x16_v (x264)
  0.0% pixel_sad_x3_16x8_altivec (x264)
- 0.0% memset (libSystem.B.dylib)
  0.0% mc_luma_altivec (x264)
- 0.0% dyld_stub_memcpy (x264)
  0.0% deblock_h_chroma_intra_c (x264)


As you can see, the non-Altivec routines don't take up that much time already;
I'd like to say that x264 has pretty much all of its important
routines ported to Altivec, which is quite nice.

There's 2 things that can be done to improve performance on PPC machines:

* Audit the "hot" Altivec routines, and see if there's something that
can be done to make them even faster, doing more clever memory
computations, or using fewer instructions.

* From the profile list list, take these 4 items that take up more
than 0.2% of overall time:
- 1.2% predict_16x16_p
- 0.8% predict_8x8c_p
- 0.4% deblock_v_chroma_c (x264)
  0.3% deblock_h_luma_intra_c (x264)

... and write an Altivec version (All of these have an MMX/SSE counterpart).


Other than that, keep in mind that today's Intel CPU are just faster
than G4/G5, so even with extensive optimization, our PPCs won't ever
be as fast as today's Intel CPUs. :-(

Guillaume
-- 
A soldier will fight long and hard for a bit of colored ribbon.
 -- Napoleon Bonaparte



More information about the x264-devel mailing list