[x264-devel] x264 performance measured on ARM Cortex-M4

Peter Lawrence majbthrd at gmail.com
Tue Sep 23 01:19:21 CEST 2014


Leon,

the numbers shown represent total time expressed in PCLK (processor 
clock) cycles (shown in millions); in this case, each PCLK cycle is 
1/180MHz.  This includes all subsystem delays.  These numbers were 
obtained using the "DWT_CYCCNT" register available on some Cortex-M 
processors.

Using the existing Makefile was a non-starter.  There are two important 
distinctions between x264 and my experiment:

1) x264 is designed as a command-line tool, and the build environment 
and code is tailored to this.
2) x264, as written, expects a number of PC-like operating system features.

On #1, I suspect that if one were targeting a Linux platform using a 
Cortex-M3/M4 processor, one could re-use the build environment and code 
with far fewer changes.  (Note: expect the code to run slower and 
require much more memory due to the co-existence with the OS.)

I didn't previously send out all the changes that I made only because I 
didn't want to offend the x264 developers.  I would feel like a 
houseguest marching into a home and moving the furniture around.  You'll 
see what I mean in my list below.  The x264 code is written the way that 
it is to support the 99.9% of users, and I'm not sure what would be a 
sensible way to support #1 and #2 above without introducing a confusing 
sea of #ifdef in the code.

My interpretation of the x264 code is that any and all platform-specific 
changes should primarily be in common.c, osdep.c, and osdep.h.

common.c :
replace/gut-out x264_log(), x264_log_default(), x264_malloc(), 
x264_free(), and x264_slurp_file()

osdep.c :
comment out #include <sys/time.h>

osdep.h :
comment out #include <sys/stat.h>
add #include <stdint.h>
add "const" to lut[] declarations in x264_ctz_4bit(), x264_clz(), 
x264_ctz() (these seems applicable to any platform)
gutted out x264_is_regular_file() and x264_is_regular_file_path() (these 
assume PC-like operating system)
kludge: since ARM_ARCH and HAVE_ARMV6 were only compatible with the 
platform within osdep.h, I put #define of both at the beginning of 
osdep.h and matching #undefine at the end; this ensures all the other 
#ifdef used elsewhere in the code are omitted

In addition to the expected files to change, I had to make the following 
additional changes:

common-cabac.c:
section attributes for x264_cabac_contexts[] to place it in SDRAM

common-set.c:
gut out x264_cqm_parse_file() (it assumes PC-like operating system)

vlc.c
section attributes for x264_run_before[] to place it in SDRAM

encoder.c :
remove x264_frame_dump()
gut out param.psz_dump_yuv code in x264_encoder_open() and 
x264_encoder_frame_end()
remove call to x264_macroblock_tree_read()

ratecontrol.c
remove x264_macroblock_tree_read()
gut out param.rc.b_stat_read and param.rc.b_stat_write code in 
x264_ratecontrol_new()
gut out param.rc.b_stat_write code in x264_ratecontrol_end()

Beyond the changes listed above, I think there is the opportunity in a 
more optimized embedded scenario to remove lots of code that wouldn't be 
used.  There are many if (param something) statements that the compiler 
can't realize will always evaluate as false, but some sort of human / 
preprocessor intervention could remove them.  However, again, I don't 
see how to reasonably do this without introducing a sea of extra #ifdef.

P.S.: an improvement in grayscale performance over my previous email was 
to substitute the chroma planes with unchanging values of 0x7F and keep 
chroma = 0 in x264_macroblock_encode_internal().  That keeps the CPU 
performance for I-frames lower (chroma defeated scenario) but seems to 
cause consistently less execution time for P-frames (consistent with 
normal chroma scenario).

On 9/22/2014 5:56 AM, Leon Woestenberg wrote:
> Peter,
>
> interesting benchmark, do you have fps benchmarks also taking memory
> accessing into account?
>
> Did you need to modify Makefiles / compiler settings much, and if so,
> are you able to share these modifications?
>
> Regards,
>
> Leon
>
> On Thu, Sep 4, 2014 at 5:12 AM, Peter Lawrence <majbthrd at gmail.com
> <mailto:majbthrd at gmail.com>> wrote:
>
>     I wanted to share some possibly interesting x264 benchmark numbers
>     for a processor that I couldn't find any previous mention of on the
>     mailing list.
>
>     I've been porting x264 to run on an ARM Cortex-M4 (ARMv7E-M
>     instruction set) to determine how viable it would be for low
>     resolution, low frame rate applications.  The following figures
>     might be of interest to someone contemplating the same notion.
>
>     The Cortex-M4 does have some DSP instructions that are nowhere near
>     as useful as NEON is for x264, but might offer some small increase
>     in performance.  (For example, dual 16-bit multiplies and additions
>     per instruction.)  However, these tests are just with the generic C
>     code without using any processor-specific optimizations (other than
>     the one for endian_swap32).
>
>     Source material is the "carphone" video sequence at QCIF (176 x 144)
>     in I420 format.
>
>     x264 presets are "ultrafast" plus "zerolatency"; profile is "baseline".
>
>     I wanted to see if defeating the chroma encoding would save an
>     appreciable amount of CPU time.  To my surprise, it actually seems
>     to make it worse.
>
>     Given these parameters:
>
>     param.i_bitrate = 80,
>     param.rc.i_vbv_max_bitrate = 80
>     param.rc.i_vbv_buffer-size = 100
>     param.rc.i_rc_method = X264_RC_ABR
>
>     Here is how many million processor clock cycles were consumed for
>     each of the first ten frames of the "carphone" video sequence:
>
>     chroma ostensibly defeated (and analyse.b_chroma_me = 0)
>     22 27 26 30 30 27 30 29 31 31
>
>     default mode
>     24 18 20 21 22 22 26 22 27 25
>
>     (For default mode with analyse.b_chroma_me = 0, the numbers are
>     essentially those shown for "default mode".)
>
>     That disabling chroma made things worse made no sense to me.  Then,
>     I excitedly thought perhaps disabling it freed enough bits to cause
>     the rate control to reduce the quantization and thus decrease the
>     number of zeros in the DCT (perhaps increasing the complexity for
>     the VLC encoder), so I tried another test in constant quality mode:
>
>     param.rc.qp_constant = 37
>     param.rc.i_rc_method = X264_RC_CQP
>
>     Here is how many million processor clock cycles were consumed for
>     each of the first ten frames of the "carphone" video sequence:
>
>     chroma ostensibly defeated (and analyse.b_chroma_me = 0)
>     18 22 23 23 23 22 22 22 23 24
>
>     default mode
>     20 16 13 16 15 14 17 16 18 19
>
>     So, under this (flawed?) test setup, it seems worse to defeat chroma
>     in both rate control scenarios.
>
>     The output bitstream with chroma ostensibly defeated is marginally
>     smaller (9238 bytes versus 9262 bytes) in the constant quality mode.
>     (Comparing file sizes in a rate control mode seemed pointless, since
>     the rate control should rightfully use any spare bits.)  Moreover,
>     the output supposedly without chroma decodes in Video Lan Client as
>     grayscale images (versus color for the normal mode).
>
>     Surely my method of defeating chroma must be flawed.  I did it by
>     forcing "chroma" to zero inside x264_macroblock_encode___internal(),
>     which *looked* like it would disable computing the chroma DCTs and
>     generation of the associated VLC coding.
>
>     Specifics of the test environment are:
>
>     Target is a STM32F429 Discovery Kit (180MHz STM32F429 with 8MBytes
>     of SDRAM).
>
>     The code is running on "bare-metal" (e.g. not some Linux "embedded"
>     application).
>
>     I needed practically all of the 8MBytes of SDRAM to hold all the
>     x264 data structures.  The STM32F429 has a sizable 2MBytes of flash,
>     so I used the upper 1MByte to store 27 frames of the "carphone"
>     video sequence and the lower half for code storage.  (The video
>     frames were memcpy-ed from flash to the x264_picture_t structure
>     prior to each x264_encoder_encode() call, rather than accessing them
>     directly in place.)
>
>     Code was compiled using Clang v3.4.1 (-O1 optimization) in the
>     Rowley Crossworks for ARM development environment.
>
>     _________________________________________________
>     x264-devel mailing list
>     x264-devel at videolan.org <mailto:x264-devel at videolan.org>
>     https://mailman.videolan.org/__listinfo/x264-devel
>     <https://mailman.videolan.org/listinfo/x264-devel>
>
>
>
>
> --
> Leon
>
>
> _______________________________________________
> x264-devel mailing list
> x264-devel at videolan.org
> https://mailman.videolan.org/listinfo/x264-devel
>


More information about the x264-devel mailing list