[x264-devel] x264 performance measured on ARM Cortex-M4

Leon Woestenberg leon.woestenberg at gmail.com
Mon Sep 22 14:56:38 CEST 2014


Peter,

interesting benchmark, do you have fps benchmarks also taking memory
accessing into account?

Did you need to modify Makefiles / compiler settings much, and if so, are
you able to share these modifications?

Regards,

Leon

On Thu, Sep 4, 2014 at 5:12 AM, Peter Lawrence <majbthrd at gmail.com> wrote:

> I wanted to share some possibly interesting x264 benchmark numbers for a
> processor that I couldn't find any previous mention of on the mailing list.
>
> I've been porting x264 to run on an ARM Cortex-M4 (ARMv7E-M instruction
> set) to determine how viable it would be for low resolution, low frame rate
> applications.  The following figures might be of interest to someone
> contemplating the same notion.
>
> The Cortex-M4 does have some DSP instructions that are nowhere near as
> useful as NEON is for x264, but might offer some small increase in
> performance.  (For example, dual 16-bit multiplies and additions per
> instruction.)  However, these tests are just with the generic C code
> without using any processor-specific optimizations (other than the one for
> endian_swap32).
>
> Source material is the "carphone" video sequence at QCIF (176 x 144) in
> I420 format.
>
> x264 presets are "ultrafast" plus "zerolatency"; profile is "baseline".
>
> I wanted to see if defeating the chroma encoding would save an appreciable
> amount of CPU time.  To my surprise, it actually seems to make it worse.
>
> Given these parameters:
>
> param.i_bitrate = 80,
> param.rc.i_vbv_max_bitrate = 80
> param.rc.i_vbv_buffer-size = 100
> param.rc.i_rc_method = X264_RC_ABR
>
> Here is how many million processor clock cycles were consumed for each of
> the first ten frames of the "carphone" video sequence:
>
> chroma ostensibly defeated (and analyse.b_chroma_me = 0)
> 22 27 26 30 30 27 30 29 31 31
>
> default mode
> 24 18 20 21 22 22 26 22 27 25
>
> (For default mode with analyse.b_chroma_me = 0, the numbers are
> essentially those shown for "default mode".)
>
> That disabling chroma made things worse made no sense to me.  Then, I
> excitedly thought perhaps disabling it freed enough bits to cause the rate
> control to reduce the quantization and thus decrease the number of zeros in
> the DCT (perhaps increasing the complexity for the VLC encoder), so I tried
> another test in constant quality mode:
>
> param.rc.qp_constant = 37
> param.rc.i_rc_method = X264_RC_CQP
>
> Here is how many million processor clock cycles were consumed for each of
> the first ten frames of the "carphone" video sequence:
>
> chroma ostensibly defeated (and analyse.b_chroma_me = 0)
> 18 22 23 23 23 22 22 22 23 24
>
> default mode
> 20 16 13 16 15 14 17 16 18 19
>
> So, under this (flawed?) test setup, it seems worse to defeat chroma in
> both rate control scenarios.
>
> The output bitstream with chroma ostensibly defeated is marginally smaller
> (9238 bytes versus 9262 bytes) in the constant quality mode. (Comparing
> file sizes in a rate control mode seemed pointless, since the rate control
> should rightfully use any spare bits.)  Moreover, the output supposedly
> without chroma decodes in Video Lan Client as grayscale images (versus
> color for the normal mode).
>
> Surely my method of defeating chroma must be flawed.  I did it by forcing
> "chroma" to zero inside x264_macroblock_encode_internal(), which *looked*
> like it would disable computing the chroma DCTs and generation of the
> associated VLC coding.
>
> Specifics of the test environment are:
>
> Target is a STM32F429 Discovery Kit (180MHz STM32F429 with 8MBytes of
> SDRAM).
>
> The code is running on "bare-metal" (e.g. not some Linux "embedded"
> application).
>
> I needed practically all of the 8MBytes of SDRAM to hold all the x264 data
> structures.  The STM32F429 has a sizable 2MBytes of flash, so I used the
> upper 1MByte to store 27 frames of the "carphone" video sequence and the
> lower half for code storage.  (The video frames were memcpy-ed from flash
> to the x264_picture_t structure prior to each x264_encoder_encode() call,
> rather than accessing them directly in place.)
>
> Code was compiled using Clang v3.4.1 (-O1 optimization) in the Rowley
> Crossworks for ARM development environment.
>
> _______________________________________________
> x264-devel mailing list
> x264-devel at videolan.org
> https://mailman.videolan.org/listinfo/x264-devel
>



-- 
Leon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x264-devel/attachments/20140922/8ea21854/attachment.html>


More information about the x264-devel mailing list