<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><pre><br>At 2016-04-20 19:14:39,radhakrishnan@multicorewareinc.com wrote:
># HG changeset patch
># User Radhakrishnan VR <radhakrishnan@multicorewareinc.com>
># Date 1460718872 -19800
># Fri Apr 15 16:44:32 2016 +0530
># Node ID 534b8e2845b8156010b3c79bfa88c81c7b0b9295
># Parent c1bee15b165dd29e524501ba969973f24ea29007
>arm: Implement dequant_normal ARM NEON
>
>diff -r c1bee15b165d -r 534b8e2845b8 source/common/arm/asm-primitives.cpp
>--- a/source/common/arm/asm-primitives.cpp Fri Apr 15 11:39:39 2016 +0530
>+++ b/source/common/arm/asm-primitives.cpp Fri Apr 15 16:44:32 2016 +0530
>@@ -45,6 +45,8 @@
> {
> // dequant_scaling
> p.dequant_scaling = PFX(dequant_scaling_neon);
>+ p.dequant_normal = PFX(dequant_normal_neon);
>+
> // luma satd
> p.pu[LUMA_4x4].satd = PFX(pixel_satd_4x4_neon);
> p.pu[LUMA_4x8].satd = PFX(pixel_satd_4x8_neon);
>diff -r c1bee15b165d -r 534b8e2845b8 source/common/arm/pixel-util.S
>--- a/source/common/arm/pixel-util.S Fri Apr 15 11:39:39 2016 +0530
>+++ b/source/common/arm/pixel-util.S Fri Apr 15 16:44:32 2016 +0530
>@@ -2027,3 +2027,46 @@
> pop {r4, r5, r6, r7}
> bx lr
> endfunc
>+
>+// void dequant_normal_c(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift)
>+function x265_dequant_normal_neon
>+ push {r4, r5, r6}
>+ ldr r4, [sp, #12] // shift
>+#if HIGH_BIT_DEPTH
>+ cmp r3, #32767
>+ jle .skip
the ARM support<span style="line-height: 23.8px;"> instruction</span><span style="line-height: 23.8px;"> </span><span style="line-height: 23.8px;">execute based on</span><span style="line-height: 23.8px;"> </span><span style="line-height: 1.7;">conditional</span></pre><pre>
>+ shr r3, (BIT_DEPTH - 8)
>+ sub r4, (BIT_DEPTH - 8)
>+.skip:
>+#endif
>+ mov r12, #1
>+ sub r5, r4, #1
>+ lsr r2, #3 // num / 8
>+ lsl r5, r12, r5 // 1 << shift - 1
>+
>+ neg r6, r4
>+ vdup.32 q0, r3
>+ vdup.32 q1, r6
>+ vdup.32 q2, r5
>+
>+dqn_loop1:
>+ vld1.16 {q3}, [r0]!
>+ vmovl.s16 q8, d6
>+ vmovl.s16 q9, d7
>+
>+ vmul.s32 q8, q0
>+ vmul.s32 q9, q0
>+ vadd.s32 q8, q2
>+ vadd.s32 q9, q2
<pre style="line-height: 23.8px;">vmovl+vmul+vadd = vmlal
</pre><div><br></div>
>+
>+ vshl.s32 q8, q1
>+ vshl.s32 q9, q1
In the document arm_assembly_reference, there don't tell us may use Negative shift count, I just see below comment:</pre><pre>imm --- "1 to (size(datatype) ¨C 1) for VSHL, VQSHL, or VQSHLU"
</pre><pre>>+ vqmovn.s32 d16, q8
>+ vqmovn.s32 d17, q9
may combo above and replace by vqshrun
>+
>+ subs r2, #1
>+ vst1.16 {q8}, [r1]!
>+ bne dqn_loop1
>+ pop {r4, r5, r6}
>+ bx lr
>+endfunc
</pre></div>