[x264-devel] x264 performance measured on ARM Cortex-M4
Peter Lawrence
majbthrd at gmail.com
Thu Sep 4 05:12:03 CEST 2014
I wanted to share some possibly interesting x264 benchmark numbers for a
processor that I couldn't find any previous mention of on the mailing list.
I've been porting x264 to run on an ARM Cortex-M4 (ARMv7E-M instruction
set) to determine how viable it would be for low resolution, low frame
rate applications. The following figures might be of interest to
someone contemplating the same notion.
The Cortex-M4 does have some DSP instructions that are nowhere near as
useful as NEON is for x264, but might offer some small increase in
performance. (For example, dual 16-bit multiplies and additions per
instruction.) However, these tests are just with the generic C code
without using any processor-specific optimizations (other than the one
for endian_swap32).
Source material is the "carphone" video sequence at QCIF (176 x 144) in
I420 format.
x264 presets are "ultrafast" plus "zerolatency"; profile is "baseline".
I wanted to see if defeating the chroma encoding would save an
appreciable amount of CPU time. To my surprise, it actually seems to
make it worse.
Given these parameters:
param.i_bitrate = 80,
param.rc.i_vbv_max_bitrate = 80
param.rc.i_vbv_buffer-size = 100
param.rc.i_rc_method = X264_RC_ABR
Here is how many million processor clock cycles were consumed for each
of the first ten frames of the "carphone" video sequence:
chroma ostensibly defeated (and analyse.b_chroma_me = 0)
22 27 26 30 30 27 30 29 31 31
default mode
24 18 20 21 22 22 26 22 27 25
(For default mode with analyse.b_chroma_me = 0, the numbers are
essentially those shown for "default mode".)
That disabling chroma made things worse made no sense to me. Then, I
excitedly thought perhaps disabling it freed enough bits to cause the
rate control to reduce the quantization and thus decrease the number of
zeros in the DCT (perhaps increasing the complexity for the VLC
encoder), so I tried another test in constant quality mode:
param.rc.qp_constant = 37
param.rc.i_rc_method = X264_RC_CQP
Here is how many million processor clock cycles were consumed for each
of the first ten frames of the "carphone" video sequence:
chroma ostensibly defeated (and analyse.b_chroma_me = 0)
18 22 23 23 23 22 22 22 23 24
default mode
20 16 13 16 15 14 17 16 18 19
So, under this (flawed?) test setup, it seems worse to defeat chroma in
both rate control scenarios.
The output bitstream with chroma ostensibly defeated is marginally
smaller (9238 bytes versus 9262 bytes) in the constant quality mode.
(Comparing file sizes in a rate control mode seemed pointless, since the
rate control should rightfully use any spare bits.) Moreover, the
output supposedly without chroma decodes in Video Lan Client as
grayscale images (versus color for the normal mode).
Surely my method of defeating chroma must be flawed. I did it by
forcing "chroma" to zero inside x264_macroblock_encode_internal(), which
*looked* like it would disable computing the chroma DCTs and generation
of the associated VLC coding.
Specifics of the test environment are:
Target is a STM32F429 Discovery Kit (180MHz STM32F429 with 8MBytes of
SDRAM).
The code is running on "bare-metal" (e.g. not some Linux "embedded"
application).
I needed practically all of the 8MBytes of SDRAM to hold all the x264
data structures. The STM32F429 has a sizable 2MBytes of flash, so I
used the upper 1MByte to store 27 frames of the "carphone" video
sequence and the lower half for code storage. (The video frames were
memcpy-ed from flash to the x264_picture_t structure prior to each
x264_encoder_encode() call, rather than accessing them directly in place.)
Code was compiled using Clang v3.4.1 (-O1 optimization) in the Rowley
Crossworks for ARM development environment.
More information about the x264-devel
mailing list