[x265-commits] [x265] asm: declare asm function pointers for sad_64xN partitions

Wed Oct 30 20:38:28 CET 2013

details:   http://hg.videolan.org/x265/rev/9f9b2f8d293a
branches:  
changeset: 4754:9f9b2f8d293a
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Wed Oct 30 12:54:18 2013 +0530
description:
asm: declare asm function pointers for sad_64xN partitions
Subject: [x265] chroma interp_4tap_vert_pp all blocks asm code

details:   http://hg.videolan.org/x265/rev/74bf8634037c
branches:  
changeset: 4755:74bf8634037c
user:      Praveen Tiwari
date:      Wed Oct 30 13:44:16 2013 +0530
description:
chroma interp_4tap_vert_pp all blocks asm code
Subject: [x265] no-rdo: use bit estimates from ME to calculate RDcost.

details:   http://hg.videolan.org/x265/rev/77db80a67f4e
branches:  
changeset: 4756:77db80a67f4e
user:      Deepthi Devaki <deepthidevaki at multicorewareinc.com>
date:      Wed Oct 30 15:16:59 2013 +0530
description:
no-rdo: use bit estimates from ME to calculate RDcost.

bits estimated in ME stored in CU and used for calculating rdcost along with distortion. This results in better bitrate with no-rdo, with small drop in PSNR.
Subject: [x265] asm: modified common macro for pixel_sad_64xN

details:   http://hg.videolan.org/x265/rev/e9340727231d
branches:  
changeset: 4757:e9340727231d
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Wed Oct 30 12:57:17 2013 +0530
description:
asm: modified common macro for pixel_sad_64xN
Subject: [x265] asm: assembly code for pixel_sad_64x16

details:   http://hg.videolan.org/x265/rev/4414f3394a61
branches:  
changeset: 4758:4414f3394a61
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Wed Oct 30 13:25:12 2013 +0530
description:
asm: assembly code for pixel_sad_64x16
Subject: [x265] asm: assembly code for pixel_sad_64x32

details:   http://hg.videolan.org/x265/rev/42ad273b1d4f
branches:  
changeset: 4759:42ad273b1d4f
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Wed Oct 30 13:45:38 2013 +0530
description:
asm: assembly code for pixel_sad_64x32
Subject: [x265] asm: assembly code for pixel_sad_64x48 and pixel_sad_64x64

details:   http://hg.videolan.org/x265/rev/700b46a1a0cf
branches:  
changeset: 4760:700b46a1a0cf
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Wed Oct 30 14:11:41 2013 +0530
description:
asm: assembly code for pixel_sad_64x48 and pixel_sad_64x64
Subject: [x265] asm: filterConvertPelToShort

details:   http://hg.videolan.org/x265/rev/1a51e6cb0e0c
branches:  
changeset: 4761:1a51e6cb0e0c
user:      Min Chen <chenm003 at 163.com>
date:      Wed Oct 30 22:47:44 2013 +0800
description:
asm: filterConvertPelToShort
Subject: [x265] asm: assembly code for pixel_sad_48x64

details:   http://hg.videolan.org/x265/rev/78db76b7abec
branches:  
changeset: 4762:78db76b7abec
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Wed Oct 30 15:29:59 2013 +0530
description:
asm: assembly code for pixel_sad_48x64
Subject: [x265] asm: assembly code for pixel_sad_24x32

details:   http://hg.videolan.org/x265/rev/ed5d877b8452
branches:  
changeset: 4763:ed5d877b8452
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Wed Oct 30 18:11:01 2013 +0530
description:
asm: assembly code for pixel_sad_24x32
Subject: [x265] asm: assembly code for pixel_sad_12x16

details:   http://hg.videolan.org/x265/rev/8ee637b11d17
branches:  
changeset: 4764:8ee637b11d17
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Wed Oct 30 19:46:15 2013 +0530
description:
asm: assembly code for pixel_sad_12x16
Subject: [x265] assembly code for pixel_sad_x3_24x32

details:   http://hg.videolan.org/x265/rev/de91fbc95b4a
branches:  
changeset: 4765:de91fbc95b4a
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Wed Oct 30 14:37:25 2013 +0530
description:
assembly code for pixel_sad_x3_24x32
Subject: [x265] assembly code for pixel_sad_x4_24x32

details:   http://hg.videolan.org/x265/rev/f021f06f3b80
branches:  
changeset: 4766:f021f06f3b80
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Wed Oct 30 15:48:55 2013 +0530
description:
assembly code for pixel_sad_x4_24x32
Subject: [x265] assembly code for pixel_sad_x3_32xN

details:   http://hg.videolan.org/x265/rev/e371719c4c47
branches:  
changeset: 4767:e371719c4c47
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Wed Oct 30 18:41:40 2013 +0530
description:
assembly code for pixel_sad_x3_32xN
Subject: [x265] assembly code for pixel_sad_x4_32xN

details:   http://hg.videolan.org/x265/rev/c3cf2c42e854
branches:  
changeset: 4768:c3cf2c42e854
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Wed Oct 30 18:57:53 2013 +0530
description:
assembly code for pixel_sad_x4_32xN
Subject: [x265] pixel: remove 24 and 32 width sad intrinsic functions

details:   http://hg.videolan.org/x265/rev/eccfe236169b
branches:  
changeset: 4769:eccfe236169b
user:      Steve Borho <steve at borho.org>
date:      Wed Oct 30 13:10:52 2013 -0500
description:
pixel: remove 24 and 32 width sad intrinsic functions

These are now covered by assembly.  Only 12, 48, and 64 remain because they
still lack x3 and x4 versions.
Subject: [x265] pixel: remove sad_12, sad_48, and sad_64

details:   http://hg.videolan.org/x265/rev/645899ddda59
branches:  
changeset: 4770:645899ddda59
user:      Steve Borho <steve at borho.org>
date:      Wed Oct 30 13:14:06 2013 -0500
description:
pixel: remove sad_12, sad_48, and sad_64

All single sads have asm coverage
Subject: [x265] added test code for blockcopy_pp function

details:   http://hg.videolan.org/x265/rev/e8e84b67cf8f
branches:  
changeset: 4771:e8e84b67cf8f
user:      Praveen Tiwari
date:      Wed Oct 30 20:30:17 2013 +0530
description:
added test code for blockcopy_pp function
Subject: [x265] added blockcopy_pp_c primitive according to modified argument list

details:   http://hg.videolan.org/x265/rev/7f68debc632b
branches:  
changeset: 4772:7f68debc632b
user:      Praveen Tiwari
date:      Wed Oct 30 20:23:37 2013 +0530
description:
added blockcopy_pp_c primitive according to modified argument list

diffstat:

 source/Lib/TLibCommon/TComPrediction.cpp |     3 +
 source/Lib/TLibEncoder/TEncSearch.cpp    |    18 +-
 source/Lib/TLibEncoder/TEncSearch.h      |     2 +-
 source/common/ipfilter.cpp               |    19 +
 source/common/pixel.cpp                  |    46 +
 source/common/primitives.h               |     7 +
 source/common/vec/pixel-sse41.cpp        |   965 +-----------------------
 source/common/x86/asm-primitives.cpp     |    37 +-
 source/common/x86/ipfilter8.asm          |  1226 ++++++++++++++++++++++++++++++
 source/common/x86/ipfilter8.h            |     1 +
 source/common/x86/pixel.h                |     8 +
 source/common/x86/sad-a.asm              |  1052 +++++++++++++++++++++++++-
 source/encoder/compress.cpp              |     6 +-
 source/test/ipfilterharness.cpp          |    51 +
 source/test/ipfilterharness.h            |     1 +
 source/test/pixelharness.cpp             |    56 +
 source/test/pixelharness.h               |     1 +
 17 files changed, 2532 insertions(+), 967 deletions(-)

diffs (truncated from 3923 to 300 lines):

diff -r 65462024832b -r 7f68debc632b source/Lib/TLibCommon/TComPrediction.cpp

--- a/source/Lib/TLibCommon/TComPrediction.cpp	Wed Oct 30 01:54:16 2013 -0500
+++ b/source/Lib/TLibCommon/TComPrediction.cpp	Wed Oct 30 20:23:37 2013 +0530
@@ -516,6 +516,9 @@ void TComPrediction::xPredInterLumaBlk(T
     int xFrac = mv->x & 0x3;
     int yFrac = mv->y & 0x3;
 
+    assert((width % 4) + (height % 4) == 0);
+    assert(dstStride == MAX_CU_SIZE);
+
     if ((yFrac | xFrac) == 0)
     {
         primitives.ipfilter_p2s(ref, refStride, dst, dstStride, width, height);
diff -r 65462024832b -r 7f68debc632b source/Lib/TLibEncoder/TEncSearch.cpp
--- a/source/Lib/TLibEncoder/TEncSearch.cpp	Wed Oct 30 01:54:16 2013 -0500
+++ b/source/Lib/TLibEncoder/TEncSearch.cpp	Wed Oct 30 20:23:37 2013 +0530
@@ -2115,7 +2115,7 @@ uint32_t TEncSearch::xGetInterPrediction
  * \param bValid
  * \returns void
  */
-void TEncSearch::xMergeEstimation(TComDataCU* cu, int puIdx, uint32_t& interDir, TComMvField* mvField, uint32_t& mergeIndex, uint32_t& outCost, TComMvField* mvFieldNeighbours, UChar* interDirNeighbours, int& numValidMergeCand)
+void TEncSearch::xMergeEstimation(TComDataCU* cu, int puIdx, uint32_t& interDir, TComMvField* mvField, uint32_t& mergeIndex, uint32_t& outCost, uint32_t& outbits, TComMvField* mvFieldNeighbours, UChar* interDirNeighbours, int& numValidMergeCand)
 {
     uint32_t absPartIdx = 0;
     int width = 0;
@@ -2144,7 +2144,7 @@ void TEncSearch::xMergeEstimation(TComDa
     {
         uint32_t costCand = MAX_UINT;
         uint32_t bitsCand = 0;
-       
+
         cu->getCUMvField(REF_PIC_LIST_0)->m_mv[absPartIdx] = mvFieldNeighbours[0 + 2 * mergeCand].mv;
         cu->getCUMvField(REF_PIC_LIST_0)->m_refIdx[absPartIdx] = mvFieldNeighbours[0 + 2 * mergeCand].refIdx;
         cu->getCUMvField(REF_PIC_LIST_1)->m_mv[absPartIdx] = mvFieldNeighbours[1 + 2 * mergeCand].mv;
@@ -2160,6 +2160,7 @@ void TEncSearch::xMergeEstimation(TComDa
         if (costCand < outCost)
         {
             outCost = costCand;
+            outbits = bitsCand;
             mvField[0] = mvFieldNeighbours[0 + 2 * mergeCand];
             mvField[1] = mvFieldNeighbours[1 + 2 * mergeCand];
             interDir = interDirNeighbours[mergeCand];
@@ -2226,6 +2227,8 @@ void TEncSearch::predInterSearch(TComDat
     UChar interDirNeighbours[MRG_MAX_NUM_CANDS];
     int numValidMergeCand = 0;
 
+    int totalmebits = 0;
+
     for (int partIdx = 0; partIdx < numPart; partIdx++)
     {
         uint32_t listCost[2] = { MAX_UINT, MAX_UINT };
@@ -2495,7 +2498,8 @@ void TEncSearch::predInterSearch(TComDat
 
             // find Merge result
             uint32_t mrgCost = MAX_UINT;
-            xMergeEstimation(cu, partIdx, mrgInterDir, mrgMvField, mrgIndex, mrgCost, mvFieldNeighbours, interDirNeighbours, numValidMergeCand);
+            uint32_t mrgBits = 0;
+            xMergeEstimation(cu, partIdx, mrgInterDir, mrgMvField, mrgIndex, mrgCost, mrgBits, mvFieldNeighbours, interDirNeighbours, numValidMergeCand);
             if (mrgCost < meCost)
             {
                 // set Merge result
@@ -2517,6 +2521,7 @@ void TEncSearch::predInterSearch(TComDat
 #if CU_STAT_LOGFILE
                 meCost += mrgCost;
 #endif
+                totalmebits += mrgBits;
             }
             else
             {
@@ -2530,11 +2535,18 @@ void TEncSearch::predInterSearch(TComDat
 #if CU_STAT_LOGFILE
                 meCost += meCost;
 #endif
+                totalmebits += mebits;
             }
         }
+        else
+        {
+            totalmebits += mebits;
+        }
         motionCompensation(cu, predYuv, REF_PIC_LIST_X, partIdx, bLuma, bChroma);
     }
 
+    cu->m_totalBits = totalmebits;
+
     setWpScalingDistParam(cu, -1, REF_PIC_LIST_X);
 }
 
diff -r 65462024832b -r 7f68debc632b source/Lib/TLibEncoder/TEncSearch.h
--- a/source/Lib/TLibEncoder/TEncSearch.h	Wed Oct 30 01:54:16 2013 -0500
+++ b/source/Lib/TLibEncoder/TEncSearch.h	Wed Oct 30 20:23:37 2013 +0530
@@ -211,7 +211,7 @@ protected:
     void xGetBlkBits(PartSize cuMode, bool bPSlice, int partIdx, uint32_t lastMode, uint32_t blockBit[3]);
 
     void xMergeEstimation(TComDataCU* cu, int partIdx, uint32_t& uiInterDir,
-                          TComMvField* pacMvField, uint32_t& mergeIndex, uint32_t& outCost,
+                          TComMvField* pacMvField, uint32_t& mergeIndex, uint32_t& outCost, uint32_t& outbits,
                           TComMvField* mvFieldNeighbors, UChar* interDirNeighbors, int& numValidMergeCand);
 
     void xRestrictBipredMergeCand(TComDataCU* cu, uint32_t puIdx, TComMvField* mvFieldNeighbours,
diff -r 65462024832b -r 7f68debc632b source/common/ipfilter.cpp
--- a/source/common/ipfilter.cpp	Wed Oct 30 01:54:16 2013 -0500
+++ b/source/common/ipfilter.cpp	Wed Oct 30 20:23:37 2013 +0530
@@ -264,6 +264,24 @@ void filterConvertPelToShort_c(pixel *sr
     }
 }
 
+void filterConvertPelToShort_c(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
+{
+    int shift = IF_INTERNAL_PREC - X265_DEPTH;
+    int row, col;
+
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col++)
+        {
+            int16_t val = src[col] << shift;
+            dst[col] = val - (int16_t)IF_INTERNAL_OFFS;
+        }
+
+        src += srcStride;
+        dst += MAX_CU_SIZE;
+    }
+}
+
 template<int N>
 void filterVertical_pp_c(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int16_t const *c)
 {
@@ -471,6 +489,7 @@ void Setup_C_IPFilterPrimitives(EncoderP
 
     p.ipfilter_p2s = filterConvertPelToShort_c;
     p.ipfilter_s2p = filterConvertShortToPel_c;
+    p.luma_p2s = filterConvertPelToShort_c;
 
     p.extendRowBorder = extendCURowColBorder;
 }
diff -r 65462024832b -r 7f68debc632b source/common/pixel.cpp
--- a/source/common/pixel.cpp	Wed Oct 30 01:54:16 2013 -0500
+++ b/source/common/pixel.cpp	Wed Oct 30 20:23:37 2013 +0530
@@ -758,6 +758,21 @@ void plane_copy_deinterleave_chroma(pixe
         }
     }
 }
+
+template<int bx, int by>
+void blockcopy_pp_c(pixel *a, intptr_t stridea, pixel *b, intptr_t strideb)
+{
+    for (int y = 0; y < by; y++)
+    {
+        for (int x = 0; x < bx; x++)
+        {
+            a[x] = b[x];
+        }
+
+        a += stridea;
+        b += strideb;
+    }
+}
 }  // end anonymous namespace
 
 namespace x265 {
@@ -798,6 +813,37 @@ void Setup_C_PixelPrimitives(EncoderPrim
     p.satd[LUMA_64x16] = satd8<64, 16>;
     p.satd[LUMA_16x64] = satd8<16, 64>;
 
+#define CHROMA(W, H) \
+    p.chroma_copy_pp[CHROMA_ ## W ## x ## H] = blockcopy_pp_c<W, H>
+#define LUMA(W, H) \
+    p.luma_copy_pp[LUMA_ ## W ## x ## H] = blockcopy_pp_c<W, H>
+
+    LUMA(4, 4);
+    LUMA(8, 8);   CHROMA(4, 4);
+    LUMA(4, 8);   CHROMA(2, 4);
+    LUMA(8, 4);   CHROMA(4, 2);
+    LUMA(16, 16); CHROMA(8, 8);
+    LUMA(16,  8); CHROMA(8, 4);
+    LUMA( 8, 16); CHROMA(4, 8);
+    LUMA(16, 12); CHROMA(8, 6);
+    LUMA(12, 16); CHROMA(6, 8);
+    LUMA(16,  4); CHROMA(8, 2);
+    LUMA( 4, 16); CHROMA(2, 8);
+    LUMA(32, 32); CHROMA(16, 16);
+    LUMA(32, 16); CHROMA(16, 8);
+    LUMA(16, 32); CHROMA(8, 16);
+    LUMA(32, 24); CHROMA(16, 12);
+    LUMA(24, 32); CHROMA(12, 16);
+    LUMA(32,  8); CHROMA(16, 4);
+    LUMA( 8, 32); CHROMA(4, 16);
+    LUMA(64, 64); CHROMA(32, 32);
+    LUMA(64, 32); CHROMA(32, 16);
+    LUMA(32, 64); CHROMA(16, 32);
+    LUMA(64, 48); CHROMA(32, 24);
+    LUMA(48, 64); CHROMA(24, 32);
+    LUMA(64, 16); CHROMA(32, 8);
+    LUMA(16, 64); CHROMA(8, 32);
+
     //sse
 #if HIGH_BIT_DEPTH
     SET_FUNC_PRIMITIVE_TABLE_C(sse_pp, sse, pixelcmp_t, int16_t, int16_t)
diff -r 65462024832b -r 7f68debc632b source/common/primitives.h
--- a/source/common/primitives.h	Wed Oct 30 01:54:16 2013 -0500
+++ b/source/common/primitives.h	Wed Oct 30 20:23:37 2013 +0530
@@ -210,6 +210,9 @@ typedef void (*plane_copy_deinterleave_t
 
 typedef void (*filter_pp_t) (pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx);
 typedef void (*filter_hv_pp_t) (pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY);
+typedef void (*filter_p2s_t)(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height);
+
+typedef void (*copy_pp_t)(pixel *dst, intptr_t dstride, pixel *src, intptr_t sstride); // dst is aligned
 
 /* Define a structure containing function pointers to optimized encoder
  * primitives.  Each pointer can reference either an assembly routine,
@@ -235,6 +238,9 @@ struct EncoderPrimitives
     cvt16to16_shl_t cvt16to16_shl;
     cvt32to16_shr_t cvt32to16_shr;
 
+    copy_pp_t       luma_copy_pp[NUM_LUMA_PARTITIONS];
+    copy_pp_t       chroma_copy_pp[NUM_CHROMA_PARTITIONS];
+
     ipfilter_pp_t   ipfilter_pp[NUM_IPFILTER_P_P];
     ipfilter_ps_t   ipfilter_ps[NUM_IPFILTER_P_S];
     ipfilter_sp_t   ipfilter_sp[NUM_IPFILTER_S_P];
@@ -247,6 +253,7 @@ struct EncoderPrimitives
     filter_pp_t     chroma_vpp[NUM_CHROMA_PARTITIONS];
     filter_pp_t     luma_vpp[NUM_LUMA_PARTITIONS];
     filter_hv_pp_t  luma_hvpp[NUM_LUMA_PARTITIONS];
+    filter_p2s_t    luma_p2s;
 
     intra_dc_t      intra_pred_dc;
     intra_planar_t  intra_pred_planar;
diff -r 65462024832b -r 7f68debc632b source/common/vec/pixel-sse41.cpp
--- a/source/common/vec/pixel-sse41.cpp	Wed Oct 30 01:54:16 2013 -0500
+++ b/source/common/vec/pixel-sse41.cpp	Wed Oct 30 20:23:37 2013 +0530
@@ -34,496 +34,6 @@ using namespace x265;
 namespace {
 #if !HIGH_BIT_DEPTH
 template<int ly>
-// will only be instanced with ly == 16
-int sad_12(pixel *fenc, intptr_t fencstride, pixel *fref, intptr_t frefstride)
-{
-    assert(ly == 16);
-    __m128i sum0 = _mm_setzero_si128();
-    __m128i sum1 = _mm_setzero_si128();
-    __m128i T00, T01, T02, T03;
-    __m128i T10, T11, T12, T13;
-    __m128i T20, T21, T22, T23;
-
-#define MASK _mm_set_epi32(0x00000000, 0xffffffff, 0xffffffff, 0xffffffff)
-
-#define PROCESS_12x4(BASE) \
-    T00 = _mm_load_si128((__m128i*)(fenc + (BASE + 0) * fencstride)); \
-    T00 = _mm_and_si128(T00, MASK); \
-    T01 = _mm_load_si128((__m128i*)(fenc + (BASE + 1) * fencstride)); \
-    T01 = _mm_and_si128(T01, MASK); \
-    T02 = _mm_load_si128((__m128i*)(fenc + (BASE + 2) * fencstride)); \
-    T02 = _mm_and_si128(T02, MASK); \
-    T03 = _mm_load_si128((__m128i*)(fenc + (BASE + 3) * fencstride)); \
-    T03 = _mm_and_si128(T03, MASK); \
-    T10 = _mm_loadu_si128((__m128i*)(fref + (BASE + 0) * frefstride)); \
-    T10 = _mm_and_si128(T10, MASK); \
-    T11 = _mm_loadu_si128((__m128i*)(fref + (BASE + 1) * frefstride)); \
-    T11 = _mm_and_si128(T11, MASK); \
-    T12 = _mm_loadu_si128((__m128i*)(fref + (BASE + 2) * frefstride)); \
-    T12 = _mm_and_si128(T12, MASK); \
-    T13 = _mm_loadu_si128((__m128i*)(fref + (BASE + 3) * frefstride)); \
-    T13 = _mm_and_si128(T13, MASK); \
-    T20 = _mm_sad_epu8(T00, T10); \
-    T21 = _mm_sad_epu8(T01, T11); \
-    T22 = _mm_sad_epu8(T02, T12); \
-    T23 = _mm_sad_epu8(T03, T13); \
-    sum0 = _mm_add_epi16(sum0, T20); \
-    sum0 = _mm_add_epi16(sum0, T21); \
-    sum0 = _mm_add_epi16(sum0, T22); \
-    sum0 = _mm_add_epi16(sum0, T23)
-
-    PROCESS_12x4(0);
-    PROCESS_12x4(4);
-    PROCESS_12x4(8);
-    PROCESS_12x4(12);
-
-    sum1 = _mm_shuffle_epi32(sum0, 2);
-    sum0 = _mm_add_epi32(sum0, sum1);
-
-    return _mm_cvtsi128_si32(sum0);
-}
-
-template<int ly>
-// always instanced for 32 rows
-int sad_24(pixel *fenc, intptr_t fencstride, pixel *fref, intptr_t frefstride)
-{
-    __m128i sum0 = _mm_setzero_si128();
-    __m128i sum1 = _mm_setzero_si128();
-    __m128i T00, T01, T02, T03;
-    __m128i T10, T11, T12, T13;
-    __m128i T20, T21, T22, T23;
-
-#define PROCESS_24x4(BASE) \
-    T00 = _mm_load_si128((__m128i*)(fenc + (BASE + 0) * fencstride)); \
-    T01 = _mm_load_si128((__m128i*)(fenc + (BASE + 1) * fencstride)); \
-    T02 = _mm_load_si128((__m128i*)(fenc + (BASE + 2) * fencstride)); \