[x264-devel] x264 performance measured on ARM Cortex-M4

Thu Sep 4 05:12:03 CEST 2014

I wanted to share some possibly interesting x264 benchmark numbers for a 
processor that I couldn't find any previous mention of on the mailing list.

I've been porting x264 to run on an ARM Cortex-M4 (ARMv7E-M instruction 
set) to determine how viable it would be for low resolution, low frame 
rate applications.  The following figures might be of interest to 
someone contemplating the same notion.

The Cortex-M4 does have some DSP instructions that are nowhere near as 
useful as NEON is for x264, but might offer some small increase in 
performance.  (For example, dual 16-bit multiplies and additions per 
instruction.)  However, these tests are just with the generic C code 
without using any processor-specific optimizations (other than the one 
for endian_swap32).

Source material is the "carphone" video sequence at QCIF (176 x 144) in 
I420 format.

x264 presets are "ultrafast" plus "zerolatency"; profile is "baseline".

I wanted to see if defeating the chroma encoding would save an 
appreciable amount of CPU time.  To my surprise, it actually seems to 
make it worse.

Given these parameters:

param.i_bitrate = 80,
param.rc.i_vbv_max_bitrate = 80
param.rc.i_vbv_buffer-size = 100
param.rc.i_rc_method = X264_RC_ABR

Here is how many million processor clock cycles were consumed for each 
of the first ten frames of the "carphone" video sequence:

chroma ostensibly defeated (and analyse.b_chroma_me = 0)
22 27 26 30 30 27 30 29 31 31

default mode
24 18 20 21 22 22 26 22 27 25

(For default mode with analyse.b_chroma_me = 0, the numbers are 
essentially those shown for "default mode".)

That disabling chroma made things worse made no sense to me.  Then, I 
excitedly thought perhaps disabling it freed enough bits to cause the 
rate control to reduce the quantization and thus decrease the number of 
zeros in the DCT (perhaps increasing the complexity for the VLC 
encoder), so I tried another test in constant quality mode:

param.rc.qp_constant = 37
param.rc.i_rc_method = X264_RC_CQP

Here is how many million processor clock cycles were consumed for each 
of the first ten frames of the "carphone" video sequence:

chroma ostensibly defeated (and analyse.b_chroma_me = 0)
18 22 23 23 23 22 22 22 23 24

default mode
20 16 13 16 15 14 17 16 18 19

So, under this (flawed?) test setup, it seems worse to defeat chroma in 
both rate control scenarios.

The output bitstream with chroma ostensibly defeated is marginally 
smaller (9238 bytes versus 9262 bytes) in the constant quality mode. 
(Comparing file sizes in a rate control mode seemed pointless, since the 
rate control should rightfully use any spare bits.)  Moreover, the 
output supposedly without chroma decodes in Video Lan Client as 
grayscale images (versus color for the normal mode).

Surely my method of defeating chroma must be flawed.  I did it by 
forcing "chroma" to zero inside x264_macroblock_encode_internal(), which 
*looked* like it would disable computing the chroma DCTs and generation 
of the associated VLC coding.

Specifics of the test environment are:

Target is a STM32F429 Discovery Kit (180MHz STM32F429 with 8MBytes of 
SDRAM).

The code is running on "bare-metal" (e.g. not some Linux "embedded" 
application).

I needed practically all of the 8MBytes of SDRAM to hold all the x264 
data structures.  The STM32F429 has a sizable 2MBytes of flash, so I 
used the upper 1MByte to store 27 frames of the "carphone" video 
sequence and the lower half for code storage.  (The video frames were 
memcpy-ed from flash to the x264_picture_t structure prior to each 
x264_encoder_encode() call, rather than accessing them directly in place.)

Code was compiled using Clang v3.4.1 (-O1 optimization) in the Rowley 
Crossworks for ARM development environment.