[x265-commits] [x265] pixel: stop building 16x16, 16x8, and 8x16 intrinsic prim...

Sun Oct 6 07:39:14 CEST 2013

details:   http://hg.videolan.org/x265/rev/73f14d5ca8a9
branches:  
changeset: 4228:73f14d5ca8a9
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 19:46:23 2013 -0500
description:
pixel: stop building 16x16, 16x8, and 8x16 intrinsic primitives
Subject: [x265] asm: quit instantiating functions which are not necessary

details:   http://hg.videolan.org/x265/rev/5c27d330da43
branches:  
changeset: 4229:5c27d330da43
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 20:35:45 2013 -0500
description:
asm: quit instantiating functions which are not necessary

Re-order functions for more clarity
Subject: [x265] pixelharness: report sad, sad_x3, and sad_x4 scores together

details:   http://hg.videolan.org/x265/rev/4089b17f33ed
branches:  
changeset: 4230:4089b17f33ed
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 20:37:27 2013 -0500
description:
pixelharness: report sad, sad_x3, and sad_x4 scores together
Subject: [x265] primitives: move small block sa8d_inter setup to primitives.cpp

details:   http://hg.videolan.org/x265/rev/83ae910874e3
branches:  
changeset: 4231:83ae910874e3
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 20:50:11 2013 -0500
description:
primitives: move small block sa8d_inter setup to primitives.cpp

This hack didn't belong in the assembly setup function
Subject: [x265] asm: use x265_pixel_satd_8x4_xop for p.satd[PARTITION_16x4] for 32 bit builds

details:   http://hg.videolan.org/x265/rev/4f837e3ebd26
branches:  
changeset: 4232:4f837e3ebd26
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 21:17:12 2013 -0500
description:
asm: use x265_pixel_satd_8x4_xop for p.satd[PARTITION_16x4] for 32 bit builds

On 64bit builds, we have native sse2 functions
Subject: [x265] primitives: setup square sa8d_inter function pointers from sa8d block pointers

details:   http://hg.videolan.org/x265/rev/58bacc9ae3d1
branches:  
changeset: 4233:58bacc9ae3d1
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 21:18:08 2013 -0500
description:
primitives: setup square sa8d_inter function pointers from sa8d block pointers
Subject: [x265] primitives: fixup 12x16 and 16x2 sa8d_inter pointers

details:   http://hg.videolan.org/x265/rev/884016c98502
branches:  
changeset: 4234:884016c98502
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 21:19:51 2013 -0500
description:
primitives: fixup 12x16 and 16x2 sa8d_inter pointers

32x12 isn't used but 12x16 and 16x12 are (for AMP)
Subject: [x265] primitives: fix off-by one initialization of primitives

details:   http://hg.videolan.org/x265/rev/6e46fabdef40
branches:  
changeset: 4235:6e46fabdef40
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 21:20:17 2013 -0500
description:
primitives: fix off-by one initialization of primitives
Subject: [x265] pixel: add back intrinsics for sad_x3_4x16 and sad_x4_4x16

details:   http://hg.videolan.org/x265/rev/2e8d7b261880
branches:  
changeset: 4236:2e8d7b261880
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 21:30:10 2013 -0500
description:
pixel: add back intrinsics for sad_x3_4x16 and sad_x4_4x16

These routines do not yet have assembly code
Subject: [x265] testbench: fix off-by one initialization of primitives

details:   http://hg.videolan.org/x265/rev/e352d1f1a7c6
branches:  
changeset: 4237:e352d1f1a7c6
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 21:45:17 2013 -0500
description:
testbench: fix off-by one initialization of primitives
Subject: [x265] asm: simplify generation of sa8d_inter functions from 8x8 and 16x16 blocks

details:   http://hg.videolan.org/x265/rev/276f98fe1c59
branches:  
changeset: 4238:276f98fe1c59
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 22:01:24 2013 -0500
description:
asm: simplify generation of sa8d_inter functions from 8x8 and 16x16 blocks
Subject: [x265] asm: cleanup the assignment of SSD primitives

details:   http://hg.videolan.org/x265/rev/dc74d9932a3f
branches:  
changeset: 4239:dc74d9932a3f
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 22:15:45 2013 -0500
description:
asm: cleanup the assignment of SSD primitives
Subject: [x265] pixel: drop SSE primitives that have assembly

details:   http://hg.videolan.org/x265/rev/08b4bb1e5dbe
branches:  
changeset: 4240:08b4bb1e5dbe
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 22:24:57 2013 -0500
description:
pixel: drop SSE primitives that have assembly
Subject: [x265] asm: don't build wrappers for functions with intrinsic implementations

details:   http://hg.videolan.org/x265/rev/da37cd44a77c
branches:  
changeset: 4241:da37cd44a77c
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 22:36:20 2013 -0500
description:
asm: don't build wrappers for functions with intrinsic implementations
Subject: [x265] pixel: add missing sse_pp_12x16, untemplatize others

details:   http://hg.videolan.org/x265/rev/017aab1983dd
branches:  
changeset: 4242:017aab1983dd
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 22:41:58 2013 -0500
description:
pixel: add missing sse_pp_12x16, untemplatize others
Subject: [x265] pixel: fix HIGH_BIT_DEPTH builds

details:   http://hg.videolan.org/x265/rev/bc3d1a8ebc89
branches:  
changeset: 4243:bc3d1a8ebc89
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 22:59:45 2013 -0500
description:
pixel: fix HIGH_BIT_DEPTH builds
Subject: [x265] pixel: simplify sad_16 to make it easier to maintain

details:   http://hg.videolan.org/x265/rev/bf5852bbf75f
branches:  
changeset: 4244:bf5852bbf75f
user:      Steve Borho <steve at borho.org>
date:      Sat Oct 05 23:30:38 2013 -0500
description:
pixel: simplify sad_16 to make it easier to maintain
Subject: [x265] pixel: simplify sad_x3_16 and sad_x4_16 to make them easier to maintain

details:   http://hg.videolan.org/x265/rev/d27d01ffa4f0
branches:  
changeset: 4245:d27d01ffa4f0
user:      Steve Borho <steve at borho.org>
date:      Sun Oct 06 00:07:09 2013 -0500
description:
pixel: simplify sad_x3_16 and sad_x4_16 to make them easier to maintain
Subject: [x265] asm: simplify setup of HEVC partitions for SATD primitives

details:   http://hg.videolan.org/x265/rev/484d1d98710b
branches:  
changeset: 4246:484d1d98710b
user:      Steve Borho <steve at borho.org>
date:      Sun Oct 06 00:36:00 2013 -0500
description:
asm: simplify setup of HEVC partitions for SATD primitives
Subject: [x265] pixel: fix eoln damage to pixel-avx2.cpp

details:   http://hg.videolan.org/x265/rev/2190f2f036a1
branches:  
changeset: 4247:2190f2f036a1
user:      Steve Borho <steve at borho.org>
date:      Sun Oct 06 00:36:24 2013 -0500
description:
pixel: fix eoln damage to pixel-avx2.cpp

diffstat:

 source/common/primitives.cpp         |    16 +-
 source/common/vec/pixel-avx2.cpp     |    28 +-
 source/common/vec/pixel-sse41.cpp    |  2898 ++++++++++++---------------------
 source/common/x86/asm-primitives.cpp |   565 +-----
 source/test/pixelharness.cpp         |    24 +-
 source/test/testbench.cpp            |     2 +-
 6 files changed, 1232 insertions(+), 2301 deletions(-)

diffs (truncated from 3813 to 300 lines):

diff -r 19b319c9a6aa -r 2190f2f036a1 source/common/primitives.cpp

--- a/source/common/primitives.cpp	Sat Oct 05 19:43:50 2013 -0500
+++ b/source/common/primitives.cpp	Sun Oct 06 00:36:24 2013 -0500
@@ -128,7 +128,7 @@ void x265_setup_primitives(x265_param_t 
 
     Setup_C_Primitives(primitives);
 
-    for (int i = 2; i < cpuid; i++)
+    for (int i = 2; i <= cpuid; i++)
     {
 #if ENABLE_VECTOR_PRIMITIVES
         Setup_Vector_Primitives(primitives, 1 << i);
@@ -138,6 +138,20 @@ void x265_setup_primitives(x265_param_t 
 #endif
     }
 
+    primitives.sa8d_inter[PARTITION_8x8] = primitives.sa8d[BLOCK_8x8];
+    primitives.sa8d_inter[PARTITION_16x16] = primitives.sa8d[BLOCK_16x16];
+    primitives.sa8d_inter[PARTITION_32x32] = primitives.sa8d[BLOCK_32x32];
+    primitives.sa8d_inter[PARTITION_64x64] = primitives.sa8d[BLOCK_64x64];
+
+    // SA8D devolves to SATD for blocks not even multiples of 8x8
+    primitives.sa8d_inter[PARTITION_4x4]   = primitives.satd[PARTITION_4x4];
+    primitives.sa8d_inter[PARTITION_4x8]   = primitives.satd[PARTITION_4x8];
+    primitives.sa8d_inter[PARTITION_4x16]  = primitives.satd[PARTITION_4x16];
+    primitives.sa8d_inter[PARTITION_8x4]   = primitives.satd[PARTITION_8x4];
+    primitives.sa8d_inter[PARTITION_16x4]  = primitives.satd[PARTITION_16x4];
+    primitives.sa8d_inter[PARTITION_16x12] = primitives.satd[PARTITION_16x12];
+    primitives.sa8d_inter[PARTITION_12x16] = primitives.satd[PARTITION_12x16];
+
 #if ENABLE_VECTOR_PRIMITIVES
     if (param->logLevel >= X265_LOG_INFO) fprintf(stderr, " intrinsic");
 #endif
diff -r 19b319c9a6aa -r 2190f2f036a1 source/common/vec/pixel-avx2.cpp
--- a/source/common/vec/pixel-avx2.cpp	Sat Oct 05 19:43:50 2013 -0500
+++ b/source/common/vec/pixel-avx2.cpp	Sun Oct 06 00:36:24 2013 -0500
@@ -448,22 +448,22 @@ namespace x265 {
 void Setup_Vec_PixelPrimitives_avx2(EncoderPrimitives &p)
 {
     p.sad[0] = p.sad[0];
-#define SET_SADS(W, H) \
-    p.sad[PARTITION_##W##x##H] = sad_avx2_##W<H>; \
-    p.sad_x3[PARTITION_##W##x##H] = sad_avx2_x3_##W<H>; \
-    p.sad_x4[PARTITION_##W##x##H] = sad_avx2_x4_##W<H>; \
-
+#define SET_SADS(W, H) \
+    p.sad[PARTITION_##W##x##H] = sad_avx2_##W<H>; \
+    p.sad_x3[PARTITION_##W##x##H] = sad_avx2_x3_##W<H>; \
+    p.sad_x4[PARTITION_##W##x##H] = sad_avx2_x4_##W<H>; \
+
 #if !HIGH_BIT_DEPTH 
 #if (defined(__GNUC__) || defined(__INTEL_COMPILER))
-    SET_SADS(32, 8);
-    SET_SADS(32, 16);
-    SET_SADS(32, 24);
-    SET_SADS(32, 32);
-    SET_SADS(32, 64);
-    SET_SADS(64, 16);
-    SET_SADS(64, 32);
-    SET_SADS(64, 48);
-    SET_SADS(64, 64);
+    SET_SADS(32, 8);
+    SET_SADS(32, 16);
+    SET_SADS(32, 24);
+    SET_SADS(32, 32);
+    SET_SADS(32, 64);
+    SET_SADS(64, 16);
+    SET_SADS(64, 32);
+    SET_SADS(64, 48);
+    SET_SADS(64, 64);
 #endif
 #endif
 }
diff -r 19b319c9a6aa -r 2190f2f036a1 source/common/vec/pixel-sse41.cpp
--- a/source/common/vec/pixel-sse41.cpp	Sat Oct 05 19:43:50 2013 -0500
+++ b/source/common/vec/pixel-sse41.cpp	Sun Oct 06 00:36:24 2013 -0500
@@ -334,228 +334,49 @@ int sad_12(pixel *fenc, intptr_t fencstr
 template<int ly>
 int sad_16(pixel * fenc, intptr_t fencstride, pixel * fref, intptr_t frefstride)
 {
-    assert((ly % 4) == 0);
-
     __m128i sum0 = _mm_setzero_si128();
     __m128i sum1 = _mm_setzero_si128();
     __m128i T00, T01, T02, T03;
     __m128i T10, T11, T12, T13;
     __m128i T20, T21, T22, T23;
 
-    if (ly == 4)
+#define PROCESS_16x4(BASE)\
+    T00 = _mm_load_si128((__m128i*)(fenc + (BASE + 0) * fencstride)); \
+    T01 = _mm_load_si128((__m128i*)(fenc + (BASE + 1) * fencstride)); \
+    T02 = _mm_load_si128((__m128i*)(fenc + (BASE + 2) * fencstride)); \
+    T03 = _mm_load_si128((__m128i*)(fenc + (BASE + 3) * fencstride)); \
+    T10 = _mm_loadu_si128((__m128i*)(fref + (BASE + 0) * frefstride)); \
+    T11 = _mm_loadu_si128((__m128i*)(fref + (BASE + 1) * frefstride)); \
+    T12 = _mm_loadu_si128((__m128i*)(fref + (BASE + 2) * frefstride)); \
+    T13 = _mm_loadu_si128((__m128i*)(fref + (BASE + 3) * frefstride)); \
+    T20 = _mm_sad_epu8(T00, T10); \
+    T21 = _mm_sad_epu8(T01, T11); \
+    T22 = _mm_sad_epu8(T02, T12); \
+    T23 = _mm_sad_epu8(T03, T13); \
+    sum0 = _mm_add_epi16(sum0, T20); \
+    sum0 = _mm_add_epi16(sum0, T21); \
+    sum0 = _mm_add_epi16(sum0, T22); \
+    sum0 = _mm_add_epi16(sum0, T23)
+
+    PROCESS_16x4(0);
+    if (ly >= 8)
     {
-        T00 = _mm_load_si128((__m128i*)(fenc + (0) * fencstride));
-        T01 = _mm_load_si128((__m128i*)(fenc + (1) * fencstride));
-        T02 = _mm_load_si128((__m128i*)(fenc + (2) * fencstride));
-        T03 = _mm_load_si128((__m128i*)(fenc + (3) * fencstride));
-
-        T10 = _mm_loadu_si128((__m128i*)(fref + (0) * frefstride));
-        T11 = _mm_loadu_si128((__m128i*)(fref + (1) * frefstride));
-        T12 = _mm_loadu_si128((__m128i*)(fref + (2) * frefstride));
-        T13 = _mm_loadu_si128((__m128i*)(fref + (3) * frefstride));
-
-        T20 = _mm_sad_epu8(T00, T10);
-        T21 = _mm_sad_epu8(T01, T11);
-        T22 = _mm_sad_epu8(T02, T12);
-        T23 = _mm_sad_epu8(T03, T13);
-
-        sum0 = _mm_add_epi16(sum0, T20);
-        sum0 = _mm_add_epi16(sum0, T21);
-        sum0 = _mm_add_epi16(sum0, T22);
-        sum0 = _mm_add_epi16(sum0, T23);
+        PROCESS_16x4(4);
     }
-    else if (ly == 8)
+    if (ly >= 12)
     {
-        T00 = _mm_load_si128((__m128i*)(fenc + (0) * fencstride));
-        T01 = _mm_load_si128((__m128i*)(fenc + (1) * fencstride));
-        T02 = _mm_load_si128((__m128i*)(fenc + (2) * fencstride));
-        T03 = _mm_load_si128((__m128i*)(fenc + (3) * fencstride));
-
-        T10 = _mm_loadu_si128((__m128i*)(fref + (0) * frefstride));
-        T11 = _mm_loadu_si128((__m128i*)(fref + (1) * frefstride));
-        T12 = _mm_loadu_si128((__m128i*)(fref + (2) * frefstride));
-        T13 = _mm_loadu_si128((__m128i*)(fref + (3) * frefstride));
-
-        T20 = _mm_sad_epu8(T00, T10);
-        T21 = _mm_sad_epu8(T01, T11);
-        T22 = _mm_sad_epu8(T02, T12);
-        T23 = _mm_sad_epu8(T03, T13);
-
-        sum0 = _mm_add_epi16(sum0, T20);
-        sum0 = _mm_add_epi16(sum0, T21);
-        sum0 = _mm_add_epi16(sum0, T22);
-        sum0 = _mm_add_epi16(sum0, T23);
-
-        T00 = _mm_load_si128((__m128i*)(fenc + (4) * fencstride));
-        T01 = _mm_load_si128((__m128i*)(fenc + (5) * fencstride));
-        T02 = _mm_load_si128((__m128i*)(fenc + (6) * fencstride));
-        T03 = _mm_load_si128((__m128i*)(fenc + (7) * fencstride));
-
-        T10 = _mm_loadu_si128((__m128i*)(fref + (4) * frefstride));
-        T11 = _mm_loadu_si128((__m128i*)(fref + (5) * frefstride));
-        T12 = _mm_loadu_si128((__m128i*)(fref + (6) * frefstride));
-        T13 = _mm_loadu_si128((__m128i*)(fref + (7) * frefstride));
-
-        T20 = _mm_sad_epu8(T00, T10);
-        T21 = _mm_sad_epu8(T01, T11);
-        T22 = _mm_sad_epu8(T02, T12);
-        T23 = _mm_sad_epu8(T03, T13);
-
-        sum0 = _mm_add_epi16(sum0, T20);
-        sum0 = _mm_add_epi16(sum0, T21);
-        sum0 = _mm_add_epi16(sum0, T22);
-        sum0 = _mm_add_epi16(sum0, T23);
+        PROCESS_16x4(8);
     }
-    else if (ly == 16)
+    if (ly >= 16)
     {
-        T00 = _mm_load_si128((__m128i*)(fenc + (0) * fencstride));
-        T01 = _mm_load_si128((__m128i*)(fenc + (1) * fencstride));
-        T02 = _mm_load_si128((__m128i*)(fenc + (2) * fencstride));
-        T03 = _mm_load_si128((__m128i*)(fenc + (3) * fencstride));
-
-        T10 = _mm_loadu_si128((__m128i*)(fref + (0) * frefstride));
-        T11 = _mm_loadu_si128((__m128i*)(fref + (1) * frefstride));
-        T12 = _mm_loadu_si128((__m128i*)(fref + (2) * frefstride));
-        T13 = _mm_loadu_si128((__m128i*)(fref + (3) * frefstride));
-
-        T20 = _mm_sad_epu8(T00, T10);
-        T21 = _mm_sad_epu8(T01, T11);
-        T22 = _mm_sad_epu8(T02, T12);
-        T23 = _mm_sad_epu8(T03, T13);
-
-        sum0 = _mm_add_epi16(sum0, T20);
-        sum0 = _mm_add_epi16(sum0, T21);
-        sum0 = _mm_add_epi16(sum0, T22);
-        sum0 = _mm_add_epi16(sum0, T23);
-
-        T00 = _mm_load_si128((__m128i*)(fenc + (4) * fencstride));
-        T01 = _mm_load_si128((__m128i*)(fenc + (5) * fencstride));
-        T02 = _mm_load_si128((__m128i*)(fenc + (6) * fencstride));
-        T03 = _mm_load_si128((__m128i*)(fenc + (7) * fencstride));
-
-        T10 = _mm_loadu_si128((__m128i*)(fref + (4) * frefstride));
-        T11 = _mm_loadu_si128((__m128i*)(fref + (5) * frefstride));
-        T12 = _mm_loadu_si128((__m128i*)(fref + (6) * frefstride));
-        T13 = _mm_loadu_si128((__m128i*)(fref + (7) * frefstride));
-
-        T20 = _mm_sad_epu8(T00, T10);
-        T21 = _mm_sad_epu8(T01, T11);
-        T22 = _mm_sad_epu8(T02, T12);
-        T23 = _mm_sad_epu8(T03, T13);
-
-        sum0 = _mm_add_epi16(sum0, T20);
-        sum0 = _mm_add_epi16(sum0, T21);
-        sum0 = _mm_add_epi16(sum0, T22);
-        sum0 = _mm_add_epi16(sum0, T23);
-
-        T00 = _mm_load_si128((__m128i*)(fenc + (8) * fencstride));
-        T01 = _mm_load_si128((__m128i*)(fenc + (9) * fencstride));
-        T02 = _mm_load_si128((__m128i*)(fenc + (10) * fencstride));
-        T03 = _mm_load_si128((__m128i*)(fenc + (11) * fencstride));
-
-        T10 = _mm_loadu_si128((__m128i*)(fref + (8) * frefstride));
-        T11 = _mm_loadu_si128((__m128i*)(fref + (9) * frefstride));
-        T12 = _mm_loadu_si128((__m128i*)(fref + (10) * frefstride));
-        T13 = _mm_loadu_si128((__m128i*)(fref + (11) * frefstride));
-
-        T20 = _mm_sad_epu8(T00, T10);
-        T21 = _mm_sad_epu8(T01, T11);
-        T22 = _mm_sad_epu8(T02, T12);
-        T23 = _mm_sad_epu8(T03, T13);
-
-        sum0 = _mm_add_epi16(sum0, T20);
-        sum0 = _mm_add_epi16(sum0, T21);
-        sum0 = _mm_add_epi16(sum0, T22);
-        sum0 = _mm_add_epi16(sum0, T23);
-
-        T00 = _mm_load_si128((__m128i*)(fenc + (12) * fencstride));
-        T01 = _mm_load_si128((__m128i*)(fenc + (13) * fencstride));
-        T02 = _mm_load_si128((__m128i*)(fenc + (14) * fencstride));
-        T03 = _mm_load_si128((__m128i*)(fenc + (15) * fencstride));
-
-        T10 = _mm_loadu_si128((__m128i*)(fref + (12) * frefstride));
-        T11 = _mm_loadu_si128((__m128i*)(fref + (13) * frefstride));
-        T12 = _mm_loadu_si128((__m128i*)(fref + (14) * frefstride));
-        T13 = _mm_loadu_si128((__m128i*)(fref + (15) * frefstride));
-
-        T20 = _mm_sad_epu8(T00, T10);
-        T21 = _mm_sad_epu8(T01, T11);
-        T22 = _mm_sad_epu8(T02, T12);
-        T23 = _mm_sad_epu8(T03, T13);
-
-        sum0 = _mm_add_epi16(sum0, T20);
-        sum0 = _mm_add_epi16(sum0, T21);
-        sum0 = _mm_add_epi16(sum0, T22);
-        sum0 = _mm_add_epi16(sum0, T23);
+        PROCESS_16x4(12);
     }
-    else if ((ly % 8) == 0)
+    if (ly > 16)
     {
-        for (int i = 0; i < ly; i += 8)
+        for (int i = 16; i < ly; i += 8)
         {
-            T00 = _mm_load_si128((__m128i*)(fenc + (i + 0) * fencstride));
-            T01 = _mm_load_si128((__m128i*)(fenc + (i + 1) * fencstride));
-            T02 = _mm_load_si128((__m128i*)(fenc + (i + 2) * fencstride));
-            T03 = _mm_load_si128((__m128i*)(fenc + (i + 3) * fencstride));
-
-            T10 = _mm_loadu_si128((__m128i*)(fref + (i + 0) * frefstride));
-            T11 = _mm_loadu_si128((__m128i*)(fref + (i + 1) * frefstride));
-            T12 = _mm_loadu_si128((__m128i*)(fref + (i + 2) * frefstride));
-            T13 = _mm_loadu_si128((__m128i*)(fref + (i + 3) * frefstride));
-
-            T20 = _mm_sad_epu8(T00, T10);
-            T21 = _mm_sad_epu8(T01, T11);
-            T22 = _mm_sad_epu8(T02, T12);
-            T23 = _mm_sad_epu8(T03, T13);
-
-            sum0 = _mm_add_epi16(sum0, T20);
-            sum0 = _mm_add_epi16(sum0, T21);
-            sum0 = _mm_add_epi16(sum0, T22);
-            sum0 = _mm_add_epi16(sum0, T23);
-
-            T00 = _mm_load_si128((__m128i*)(fenc + (i + 4) * fencstride));
-            T01 = _mm_load_si128((__m128i*)(fenc + (i + 5) * fencstride));
-            T02 = _mm_load_si128((__m128i*)(fenc + (i + 6) * fencstride));
-            T03 = _mm_load_si128((__m128i*)(fenc + (i + 7) * fencstride));
-
-            T10 = _mm_loadu_si128((__m128i*)(fref + (i + 4) * frefstride));
-            T11 = _mm_loadu_si128((__m128i*)(fref + (i + 5) * frefstride));
-            T12 = _mm_loadu_si128((__m128i*)(fref + (i + 6) * frefstride));
-            T13 = _mm_loadu_si128((__m128i*)(fref + (i + 7) * frefstride));
-
-            T20 = _mm_sad_epu8(T00, T10);
-            T21 = _mm_sad_epu8(T01, T11);
-            T22 = _mm_sad_epu8(T02, T12);
-            T23 = _mm_sad_epu8(T03, T13);
-