[x265-commits] [x265] testbench(quant): the Round value must be less than (2 ^ ...

Tue Sep 9 18:19:26 CEST 2014

details:   http://hg.videolan.org/x265/rev/53e0969c605f
branches:  
changeset: 8011:53e0969c605f
user:      Min Chen <chenm003 at 163.com>
date:      Mon Sep 08 19:38:41 2014 -0700
description:
testbench(quant): the Round value must be less than (2 ^ qbits)
Subject: [x265] testbench(quant): the qBits value must be more than or equal to 8

details:   http://hg.videolan.org/x265/rev/5dbf9e8f4028
branches:  
changeset: 8012:5dbf9e8f4028
user:      Min Chen <chenm003 at 163.com>
date:      Mon Sep 08 19:38:56 2014 -0700
description:
testbench(quant): the qBits value must be more than or equal to 8
Subject: [x265] asm: improve quant by replace variant shift to fixed shift, 19k cycles -> 16.6k cycles

details:   http://hg.videolan.org/x265/rev/277c1e05c247
branches:  
changeset: 8013:277c1e05c247
user:      Min Chen <chenm003 at 163.com>
date:      Mon Sep 08 19:39:14 2014 -0700
description:
asm: improve quant by replace variant shift to fixed shift, 19k cycles -> 16.6k cycles
Subject: [x265] asm: avx2 version of quant, improve 16.6k cycles -> 8.4k cycles

details:   http://hg.videolan.org/x265/rev/c4fb044c901b
branches:  
changeset: 8014:c4fb044c901b
user:      Min Chen <chenm003 at 163.com>
date:      Mon Sep 08 19:39:34 2014 -0700
description:
asm: avx2 version of quant, improve 16.6k cycles -> 8.4k cycles
Subject: [x265] search: remove warning from MS compiler

details:   http://hg.videolan.org/x265/rev/44cb33846e0e
branches:  
changeset: 8015:44cb33846e0e
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Tue Sep 09 10:39:52 2014 +0530
description:
search: remove warning from MS compiler
Subject: [x265] frameencoder: use x265_emms() prior to double QP clipping for VBV

details:   http://hg.videolan.org/x265/rev/a414ca1c9067
branches:  
changeset: 8016:a414ca1c9067
user:      Steve Borho <steve at borho.org>
date:      Tue Sep 09 14:24:37 2014 +0200
description:
frameencoder: use x265_emms() prior to double QP clipping for VBV
Subject: [x265] frameencoder: use simple shifts to scale 2-pass CU type counters

details:   http://hg.videolan.org/x265/rev/ebd5a0cac758
branches:  
changeset: 8017:ebd5a0cac758
user:      Steve Borho <steve at borho.org>
date:      Tue Sep 09 14:23:20 2014 +0200
description:
frameencoder: use simple shifts to scale 2-pass CU type counters

the cu type counters are summed at the end and turned into percentages, so
it doesn't matter what base unit is used, only that each depth has 4x the
value as depth+1
Subject: [x265] copy_cnt_4: enable fast non zero coefficient count path

details:   http://hg.videolan.org/x265/rev/0dc2cbc36ee5
branches:  
changeset: 8018:0dc2cbc36ee5
user:      Praveen Tiwari
date:      Tue Sep 09 11:07:59 2014 +0530
description:
copy_cnt_4: enable fast non zero coefficient count path
Subject: [x265] copy_cnt_4: combine mova and paddb to reduce code size, same speedup

details:   http://hg.videolan.org/x265/rev/5edcbcbb338f
branches:  
changeset: 8019:5edcbcbb338f
user:      Praveen Tiwari
date:      Tue Sep 09 11:36:58 2014 +0530
description:
copy_cnt_4: combine mova and paddb to reduce code size, same speedup
Subject: [x265] copy_cnt_4: faster AVX2 code

details:   http://hg.videolan.org/x265/rev/f7f8206a70bd
branches:  
changeset: 8020:f7f8206a70bd
user:      Praveen Tiwari
date:      Tue Sep 09 14:07:14 2014 +0530
description:
copy_cnt_4: faster AVX2 code
Subject: [x265] copy_cnt_8 AVX2 asm code, as per new interface

details:   http://hg.videolan.org/x265/rev/331ef5121676
branches:  
changeset: 8021:331ef5121676
user:      Praveen Tiwari
date:      Tue Sep 09 17:53:09 2014 +0530
description:
copy_cnt_8 AVX2 asm code, as per new interface
Subject: [x265] search: return distortion from xIntraCodingLumaBlk, do not pass by reference

details:   http://hg.videolan.org/x265/rev/a7f4f750e9d4
branches:  
changeset: 8022:a7f4f750e9d4
user:      Steve Borho <steve at borho.org>
date:      Tue Sep 09 16:33:54 2014 +0200
description:
search: return distortion from xIntraCodingLumaBlk, do not pass by reference
Subject: [x265] search: return distortion from xRecurIntraChromaCodingQT, do not pass by ref

details:   http://hg.videolan.org/x265/rev/b0a018562d29
branches:  
changeset: 8023:b0a018562d29
user:      Steve Borho <steve at borho.org>
date:      Tue Sep 09 16:39:34 2014 +0200
description:
search: return distortion from xRecurIntraChromaCodingQT, do not pass by ref
Subject: [x265] search: return distortion from xIntraCodingChromaBlk, do not pass by ref

details:   http://hg.videolan.org/x265/rev/62f6924be843
branches:  
changeset: 8024:62f6924be843
user:      Steve Borho <steve at borho.org>
date:      Tue Sep 09 16:47:11 2014 +0200
description:
search: return distortion from xIntraCodingChromaBlk, do not pass by ref
Subject: [x265] search: return distortion from xRecurIntraCodingQT

details:   http://hg.videolan.org/x265/rev/68ac5ca5d676
branches:  
changeset: 8025:68ac5ca5d676
user:      Steve Borho <steve at borho.org>
date:      Tue Sep 09 17:00:02 2014 +0200
description:
search: return distortion from xRecurIntraCodingQT
Subject: [x265] search: pass depthRange uniformly as uint32_t depthRange[2]

details:   http://hg.videolan.org/x265/rev/cead9fe7ff30
branches:  
changeset: 8026:cead9fe7ff30
user:      Steve Borho <steve at borho.org>
date:      Tue Sep 09 17:13:51 2014 +0200
description:
search: pass depthRange uniformly as uint32_t depthRange[2]

effectively the same as uint32_t but compilers and debuggers can often do more
with the length info. plus it just makes the code more readable
Subject: [x265] search: return distortion from xEstimateResidualQT

details:   http://hg.videolan.org/x265/rev/7d8e4935c1ca
branches:  
changeset: 8027:7d8e4935c1ca
user:      Steve Borho <steve at borho.org>
date:      Tue Sep 09 17:19:45 2014 +0200
description:
search: return distortion from xEstimateResidualQT
Subject: [x265] search: don't pass a zeroDistortion pointer if you don't want the answer

details:   http://hg.videolan.org/x265/rev/84b1d287333f
branches:  
changeset: 8028:84b1d287333f
user:      Steve Borho <steve at borho.org>
date:      Tue Sep 09 17:22:46 2014 +0200
description:
search: don't pass a zeroDistortion pointer if you don't want the answer
Subject: [x265] search: fix camel case of residualQTIntraChroma

details:   http://hg.videolan.org/x265/rev/d85792b9f373
branches:  
changeset: 8029:d85792b9f373
user:      Steve Borho <steve at borho.org>
date:      Tue Sep 09 17:24:33 2014 +0200
description:
search: fix camel case of residualQTIntraChroma
Subject: [x265] analysis: modified compressInterCU_rd0_4() with CU-specific information

details:   http://hg.videolan.org/x265/rev/2d9eb8cebb71
branches:  
changeset: 8030:2d9eb8cebb71
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Tue Sep 09 20:02:39 2014 +0530
description:
analysis: modified compressInterCU_rd0_4() with CU-specific information

diffstat:

 source/common/dct.cpp                |    2 +
 source/common/x86/asm-primitives.cpp |    2 +
 source/common/x86/blockcopy8.asm     |  127 ++++++------------
 source/common/x86/const-a.asm        |    2 +-
 source/common/x86/pixel-util.h       |    1 +
 source/common/x86/pixel-util8.asm    |  228 ++++++++++++++++++++++++++++------
 source/encoder/analysis.cpp          |   43 ++---
 source/encoder/analysis.h            |    4 +-
 source/encoder/frameencoder.cpp      |   13 +-
 source/encoder/search.cpp            |  153 ++++++++++-------------
 source/encoder/search.h              |   42 +++---
 source/test/mbdstharness.cpp         |    4 +-
 12 files changed, 351 insertions(+), 270 deletions(-)

diffs (truncated from 1249 to 300 lines):

diff -r b5f81a839403 -r 2d9eb8cebb71 source/common/dct.cpp

--- a/source/common/dct.cpp	Mon Sep 08 22:40:00 2014 +0200
+++ b/source/common/dct.cpp	Tue Sep 09 20:02:39 2014 +0530
@@ -772,6 +772,8 @@ void dequant_scaling_c(const int16_t* qu
 
 uint32_t quant_c(int32_t* coef, int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff)
 {
+    X265_CHECK(qBits >= 8, "qBits less than 8\n");
+    X265_CHECK((numCoeff % 16) == 0, "numCoeff must be multiple of 16\n");
     int qBits8 = qBits - 8;
     uint32_t numSig = 0;
 
diff -r b5f81a839403 -r 2d9eb8cebb71 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Mon Sep 08 22:40:00 2014 +0200
+++ b/source/common/x86/asm-primitives.cpp	Tue Sep 09 20:02:39 2014 +0530
@@ -1441,6 +1441,7 @@ void Setup_Assembly_Primitives(EncoderPr
     if (cpuMask & X265_CPU_AVX2)
     {
         p.dct[DCT_4x4] = x265_dct4_avx2;
+        p.quant = x265_quant_avx2;
         p.nquant = x265_nquant_avx2;
         p.dequant_normal = x265_dequant_normal_avx2;
     }
@@ -1739,6 +1740,7 @@ void Setup_Assembly_Primitives(EncoderPr
         p.denoiseDct = x265_denoise_dct_avx2;
 
         p.dct[DCT_4x4] = x265_dct4_avx2;
+        p.quant = x265_quant_avx2;
         p.nquant = x265_nquant_avx2;
         p.dequant_normal = x265_dequant_normal_avx2;
     }
diff -r b5f81a839403 -r 2d9eb8cebb71 source/common/x86/blockcopy8.asm
--- a/source/common/x86/blockcopy8.asm	Mon Sep 08 22:40:00 2014 +0200
+++ b/source/common/x86/blockcopy8.asm	Tue Sep 09 20:02:39 2014 +0530
@@ -3973,13 +3973,12 @@ cglobal copy_cnt_4, 3,3,3
 
     ; get count
     ; CHECK_ME: Intel documents said POPCNT is SSE4.2 instruction, but just implement after Nehalem
-%if 0
+%if 1
     pmovmskb    eax, m0
     not         ax
     popcnt      ax, ax
 %else
-    mova        m1, [pb_1]
-    paddb       m0, m1
+    paddb       m0, [pb_1]
     psadbw      m0, m2
     pshufd      m1, m0, 2
     paddw       m0, m1
@@ -3991,7 +3990,7 @@ cglobal copy_cnt_4, 3,3,3
 INIT_YMM avx2
 cglobal copy_cnt_4, 3,3,3
     add         r2d, r2d
-    xorpd       xm2, xm2
+    xorpd       m2,  m2
 
     ; row 0 & 1
     movq        xm0, [r1]
@@ -4005,11 +4004,9 @@ cglobal copy_cnt_4, 3,3,3
     vinserti128 m0, m0, xm1, 1
     movu    [r0], m0
 
-    vextractf128 xm1, m0, 1
-    packsswb     xm0, xm1
-    pcmpeqb      xm0, xm2
-
     ; get count
+    packsswb    xm0, xm1
+    pcmpeqb     xm0, xm2
     pmovmskb    eax, xm0
     not         ax
     popcnt      ax, ax
@@ -4079,85 +4076,49 @@ cglobal copy_cnt_8, 3,3,6
 
 
 INIT_YMM avx2
-%if ARCH_X86_64 == 1
-cglobal copy_cnt_8, 3,4,6
-  %define tmpd eax
-%else
-cglobal copy_cnt_8, 3,5,6
-  %define tmpd r4d
-%endif
+cglobal copy_cnt_8, 3,3,6
     add         r2d, r2d
-    pxor        m4, m4
-    lea         r3, [r2 * 3]
-
-    ; row 0
+    xorpd       m5, m5
+
+    ; row 0 - 1
     movu        xm0, [r1]
-    mova        xm2, xm0
-    pmovsxwd    m1, xm0
-    movu        [r0 + 0 * mmsize], m1
-
-    ; row 1
-    movu        xm0, [r1 + r2]
-    vinserti128 m2, m2, xm0, 1
-    pmovsxwd    m1, xm0
-    movu        [r0 + 1 * mmsize], m1
-
-    ; row 2
-    movu        xm0, [r1 + r2 * 2]
-    mova        xm5, xm0
-    pmovsxwd    m1, xm0
-    movu        [r0 + 2 * mmsize], m1
-
-    ; row 3
-    movu        xm0, [r1 + r3]
-    vinserti128 m5, m5, xm0, 1
-    packsswb    m2, m5
-    pcmpeqb     m2, m4
-    pmovmskb    tmpd, m2
-    not         tmpd
-    popcnt      tmpd, tmpd
-    pmovsxwd    m1, xm0
-    movu        [r0 + 3 * mmsize], m1
-
-    add         r0, 4 * mmsize
-    lea         r1, [r1 + r2 * 4]
-
-    ; row 4
-    movu        xm0, [r1]
-    mova        xm2, xm0
-    pmovsxwd    m1, xm0
-    movu        [r0 + 0 * mmsize], m1
-
-    ; row 5
-    movu        xm0, [r1 + r2]
-    vinserti128 m2, m2, xm0, 1
-    pmovsxwd    m1, xm0
-    movu        [r0 + 1 * mmsize], m1
-
-    ; row 6
-    movu        xm0, [r1 + r2 * 2]
-    mova        xm5, xm0
-    pmovsxwd    m1, xm0
-    movu        [r0 + 2 * mmsize], m1
-
-    ; row 7
-    movu        xm0, [r1 + r3]
-    pmovsxwd    m1, xm0
-    movu        [r0 + 3 * mmsize], m1
-    vinserti128 m5, m5, xm0, 1
+    movu        xm1, [r1 + r2]
+    vinserti128 m0, m0, xm1, 1
+    movu        [r0], m0
+
+    ; row 2 - 3
+    movu        xm1, [r1 + r2 * 2]
+    lea         r1,  [r1 + r2 * 2]
+    movu        xm2, [r1 + r2]
+    vinserti128 m1, m1, xm2, 1
+    movu        [r0 + 32], m1
+
+    ; row 4 - 5
+    movu        xm2, [r1 + r2 * 2]
+    lea         r1,  [r1 + r2 * 2]
+    movu        xm3, [r1 + r2]
+    vinserti128 m2, m2, xm3, 1
+    movu        [r0 + 64], m2
+
+    ; row 6 - 7
+    movu        xm3, [r1 + r2 * 2]
+    lea         r1,  [r1 + r2 * 2]
+    movu        xm4, [r1 + r2]
+    vinserti128 m3, m3, xm4, 1
+    movu        [r0 + 96], m3
 
     ; get count
-    packsswb    m2, m5
-    pcmpeqb     m2, m4
-    pmovmskb    r0d, m2
-    not         r0d
-    popcnt      r0d, r0d
-
-%if ARCH_X86_64 == 1
-    add         tmpd, r0d
-%else
-    add         r0d, tmpd
-%endif
+    vpacksswb    m0, m1
+    vpcmpeqb     m0, m5
+    vpmovmskb    eax, m0
+    not          eax
+    popcnt       eax, eax
+    vpacksswb    m2, m3
+    vpcmpeqb     m2, m5
+    vpmovmskb    r1d, m2
+    not          r1d
+    popcnt       r1d, r1d
+    add          eax, r1d
     RET
 
 
diff -r b5f81a839403 -r 2d9eb8cebb71 source/common/x86/const-a.asm
--- a/source/common/x86/const-a.asm	Mon Sep 08 22:40:00 2014 +0200
+++ b/source/common/x86/const-a.asm	Tue Sep 09 20:02:39 2014 +0530
@@ -76,7 +76,7 @@ const pw_ppppmmmm, dw 1,1,1,1,-1,-1,-1,-
 const pw_ppmmppmm, dw 1,1,-1,-1,1,1,-1,-1
 const pw_pmpmpmpm, dw 1,-1,1,-1,1,-1,1,-1
 const pw_pmmpzzzz, dw 1,-1,-1,1,0,0,0,0
-const pd_1,        times 4 dd 1
+const pd_1,        times 8 dd 1
 const pd_2,        times 4 dd 2
 const pd_4,        times 4 dd 4
 const pd_8,        times 4 dd 8
diff -r b5f81a839403 -r 2d9eb8cebb71 source/common/x86/pixel-util.h
--- a/source/common/x86/pixel-util.h	Mon Sep 08 22:40:00 2014 +0200
+++ b/source/common/x86/pixel-util.h	Tue Sep 09 20:02:39 2014 +0530
@@ -45,6 +45,7 @@ void x265_transpose32_sse2(pixel *dest, 
 void x265_transpose64_sse2(pixel *dest, pixel *src, intptr_t stride);
 
 uint32_t x265_quant_sse4(int32_t *coef, int32_t *quantCoeff, int32_t *deltaU, int16_t *qCoef, int qBits, int add, int numCoeff);
+uint32_t x265_quant_avx2(int32_t *coef, int32_t *quantCoeff, int32_t *deltaU, int16_t *qCoef, int qBits, int add, int numCoeff);
 uint32_t x265_nquant_sse4(int32_t *coef, int32_t *quantCoeff, int16_t *qCoef, int qBits, int add, int numCoeff);
 uint32_t x265_nquant_avx2(int32_t *coef, int32_t *quantCoeff, int16_t *qCoef, int qBits, int add, int numCoeff);
 void x265_dequant_normal_sse4(const int16_t* quantCoef, int32_t* coef, int num, int scale, int shift);
diff -r b5f81a839403 -r 2d9eb8cebb71 source/common/x86/pixel-util8.asm
--- a/source/common/x86/pixel-util8.asm	Mon Sep 08 22:40:00 2014 +0200
+++ b/source/common/x86/pixel-util8.asm	Tue Sep 09 20:02:39 2014 +0530
@@ -54,6 +54,7 @@ cextern pw_1
 cextern pw_00ff
 cextern pw_2000
 cextern pw_pixel_max
+cextern pd_1
 cextern pd_32767
 cextern pd_n32768
 
@@ -861,7 +862,6 @@ cglobal getResidual32, 4,5,7
 ;-----------------------------------------------------------------------------
 INIT_XMM sse4
 cglobal quant, 5,6,8
-
     ; fill qbits
     movd        m4, r4d         ; m4 = qbits
 
@@ -873,52 +873,45 @@ cglobal quant, 5,6,8
     movd        m5, r5m
     pshufd      m5, m5, 0       ; m5 = add
 
+    lea         r5, [pd_1]
+
     mov         r4d, r6m
     shr         r4d, 3
     pxor        m7, m7          ; m7 = numZero
 .loop:
     ; 4 coeff
     movu        m0, [r0]        ; m0 = level
-    pxor        m1, m1
-    pcmpgtd     m1, m0          ; m1 = sign
-    movu        m2, [r1]        ; m2 = qcoeff
-    pabsd       m0, m0
-    pmulld      m0, m2          ; m0 = tmpLevel1
-    paddd       m2, m0, m5
+    pabsd       m1, m0
+    pmulld      m1, [r1]        ; m0 = tmpLevel1
+    paddd       m2, m1, m5
     psrad       m2, m4          ; m2 = level1
-    pslld       m3, m2, m4
-    psubd       m0, m3
-    psrad       m0, m6          ; m0 = deltaU1
-    movu        [r2], m0
-    pxor        m0, m0
-    pcmpeqd     m0, m2          ; m0 = mask4
-    psubd       m7, m0
-
-    pxor        m2, m1
-    psubd       m2, m1
-    packssdw    m2, m2
-    movh        [r3], m2
+
+    pslld       m3, m2, 8
+    psrad       m1, m6
+    psubd       m1, m3          ; m1 = deltaU1
+
+    movu        [r2], m1
+    psignd      m3, m2, m0
+    pminud      m2, [r5]
+    paddd       m7, m2
+    packssdw    m3, m3
+    movh        [r3], m3
+
     ; 4 coeff
     movu        m0, [r0 + 16]   ; m0 = level
-    pxor        m1, m1
-    pcmpgtd     m1, m0          ; m1 = sign
-    movu        m2, [r1 + 16]   ; m2 = qcoeff
-    pabsd       m0, m0
-    pmulld      m0, m2          ; m0 = tmpLevel1
-    paddd       m2, m0, m5
+    pabsd       m1, m0
+    pmulld      m1, [r1 + 16]   ; m0 = tmpLevel1
+    paddd       m2, m1, m5
     psrad       m2, m4          ; m2 = level1
-    pslld       m3, m2, m4
-    psubd       m0, m3