[x264-devel] [PATCH 11/24] arm: Implement neon 8x16c intra predict functions

Mon Aug 24 22:41:46 CEST 2015

On 2015-08-24 23:24:07 +0300, Martin Storsjö wrote:
> On Wed, 19 Aug 2015, Janne Grunau wrote:
> 
> >On 2015-08-13 23:59:32 +0300, Martin Storsjö wrote:
> >>This implements the same functions as are implemented for 8x8c
> >>and as for 8x16c on aarch64.
> >>
> >>Some of the simpler ones actually turn out to be slower than the
> >>plain C version, at least on some CPUs.
> >
> >See 'arm64: optimize various intra_predict asm functions'
> >(<1439822360-17282-1-git-send-email-janne-x264 at jannau.net>)
> >
> >That makes all intra_predict functions at least as fast as the C version
> >on a cortex-a53 in arm64 mode.
> >
> >>checkasm timing       Cortex-A7      A8     A9
> >>intra_predict_8x16c_dc_c     1347    910    1017
> >>intra_predict_8x16c_dc_neon  1271    1366   1247
> >>intra_predict_8x16c_dcl_c    859     677    692
> >>intra_predict_8x16c_dcl_neon 1006    1209   1065
> >>intra_predict_8x16c_dct_c    871     540    590
> >>intra_predict_8x16c_dct_neon 672     511    657
> >>intra_predict_8x16c_h_c      937     712    719
> >>intra_predict_8x16c_h_neon   722     682    672
> >>intra_predict_8x16c_p_c      10184   9967   8652
> >>intra_predict_8x16c_p_neon   2617    1973   1983
> >>intra_predict_8x16c_v_c      610     380    429
> >>intra_predict_8x16c_v_neon   570     513    507
> >>---
> >> common/arm/predict-a.S |  158 ++++++++++++++++++++++++++++++++++++++++++++++++
> >> common/arm/predict-c.c |   15 +++++
> >> common/arm/predict.h   |    8 +++
> >> common/predict.c       |    4 ++
> >> 4 files changed, 185 insertions(+)
> >>
> >>diff --git a/common/arm/predict-a.S b/common/arm/predict-a.S
> >>index 7e5d9d3..228fd2e 100644
> >>--- a/common/arm/predict-a.S
> >>+++ b/common/arm/predict-a.S
> >>@@ -5,6 +5,7 @@
> >>  *
> >>  * Authors: David Conrad <lessen42 at gmail.com>
> >>  *          Mans Rullgard <mans at mansr.com>
> >>+ *          Martin Storsjo <martin at martin.st>
> >>  *
> >>  * This program is free software; you can redistribute it and/or modify
> >>  * it under the terms of the GNU General Public License as published by
> >>@@ -552,6 +553,163 @@ function x264_predict_8x8c_p_neon
> >> endfunc
> >>
> >>
> >>+function x264_predict_8x16c_dc_top_neon
> >>+    sub         r2,  r0,  #FDEC_STRIDE
> >>+    mov         r1,  #FDEC_STRIDE
> >>+    vld1.8      {d0}, [r2,:64]
> >>+    vpaddl.u8   d0,  d0
> >>+    vpadd.u16   d0,  d0,  d0
> >>+    vrshrn.u16  d0,  q0,  #2
> >>+    vdup.8      d1,  d0[1]
> >>+    vdup.8      d0,  d0[0]
> >>+    vtrn.32     d0,  d1
> >
> >vmov d1, d0
> 
> Hmm, I'm not quite sure what you suggest here, should I replace the
> vtrn with vmov d1, d0, or use it instead of the vmov q1, q0 below?
> (After this vtrn, d1 and d0 should be identical, right?)

Sorry, I can't make sense out of that suggestion either. Disregard it, 
either I was suggesting to replace the vdup/vtrn with vmov which is 
clearly wrong or it is leftover.

> >
> >>+    vmov        q1,  q0
> >>+    b           pred8x16_dc_end
> >
> >since we need every cycle to it probably makes sense to avoid the branch
> >and vmov
> 
> So copypaste (or macroize) pred8x16_dc_end here, and make it use the
> same input register for all writes to avoid the vmov q1, q0?

yes

Janne