[x265-commits] [x265] disable SIGPIPE on Windows platform

Wed Apr 15 08:37:35 CEST 2015

details:   http://hg.videolan.org/x265/rev/dd456de98c23
branches:  
changeset: 10170:dd456de98c23
user:      Min Chen <chenm003 at 163.com>
date:      Tue Apr 14 13:41:40 2015 +0800
description:
disable SIGPIPE on Windows platform
Subject: [x265] asm: improve sub_ps[16x16] (477 -> 461) and reduce code size

details:   http://hg.videolan.org/x265/rev/e21ede5958ea
branches:  
changeset: 10171:e21ede5958ea
user:      Sumalatha Polureddy
date:      Mon Apr 13 16:25:08 2015 +0530
description:
asm: improve sub_ps[16x16] (477 -> 461) and reduce code size
Subject: [x265] asm: intra_pred_ang32_18 improved by ~45% over SSE4

details:   http://hg.videolan.org/x265/rev/becd2f63197d
branches:  
changeset: 10172:becd2f63197d
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Tue Apr 14 11:49:12 2015 +0530
description:
asm: intra_pred_ang32_18 improved by ~45% over SSE4

AVX2:
intra_ang_32x32[18]     33.10x   354.58          11737.10

SSE4:
intra_ang_32x32[18]     17.51x   650.80          11396.64
Subject: [x265] sao: add saoCuOrgE3_2Rows function to process 2 rows

details:   http://hg.videolan.org/x265/rev/6fce6c27e22b
branches:  
changeset: 10173:6fce6c27e22b
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Tue Apr 14 10:18:29 2015 +0530
description:
sao: add saoCuOrgE3_2Rows function to process 2 rows
Subject: [x265] asm: avx2 code for satd_32xN

details:   http://hg.videolan.org/x265/rev/8e583b1e4de8
branches:  
changeset: 10174:8e583b1e4de8
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Tue Apr 14 14:13:31 2015 +0530
description:
asm: avx2 code for satd_32xN

AVX2:
satd[ 32x8]        8.40x    957.22          8040.38
satd[32x16]        8.31x    1950.86         16214.44
satd[32x24]        8.50x    2897.62         24636.81
satd[32x32]        8.88x    3952.35         35115.40
satd[32x64]        9.18x    7334.90         67312.13

AVX:
satd[ 32x8]        4.63x    1738.62         8048.18
satd[32x16]        5.01x    3249.63         16295.51
satd[32x24]        5.30x    4767.54         25279.60
satd[32x32]        5.67x    6156.74         34895.57
satd[32x64]        5.59x    11708.14        65479.60
Subject: [x265] asm: avx code for chroma copy_ss 32x64, reused luma code (2616 -> 1313)

details:   http://hg.videolan.org/x265/rev/dc4e269d1dec
branches:  
changeset: 10175:dc4e269d1dec
user:      Sumalatha Polureddy
date:      Tue Apr 14 15:51:59 2015 +0530
description:
asm: avx code for chroma copy_ss 32x64, reused luma code (2616 -> 1313)

sse2
[i422] copy_ss[32x64]  8.36x    2616.62         21881.62

avx
[i422] copy_ss[32x64]  16.80x   1313.77         22065.42
Subject: [x265] asm: ssse3 10bit code for convert_p2s[4xN]

details:   http://hg.videolan.org/x265/rev/07848ecda186
branches:  
changeset: 10176:07848ecda186
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Tue Apr 14 18:10:45 2015 +0530
description:
asm: ssse3 10bit code for convert_p2s[4xN]

     convert_p2s[4x4](2.70x), convert_p2s[4x8](3.53x), convert_p2s[4x16](3.82x)
Subject: [x265] asm: ssse3 10bit code for convert_p2s[8xN]

details:   http://hg.videolan.org/x265/rev/3adff3b58196
branches:  
changeset: 10177:3adff3b58196
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Tue Apr 14 18:28:02 2015 +0530
description:
asm: ssse3 10bit code for convert_p2s[8xN]

     convert_p2s[8x4](4.06x), convert_p2s[8x8](5.07x), convert_p2s[8x16](6.00x),
     convert_p2s[8x32](6.42x)
Subject: [x265] asm: ssse3 10bit code for convert_p2s[16xN]

details:   http://hg.videolan.org/x265/rev/c6d0421a367d
branches:  
changeset: 10178:c6d0421a367d
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Tue Apr 14 18:50:02 2015 +0530
description:
asm: ssse3 10bit code for convert_p2s[16xN]

     convert_p2s[16x4](8.18x), convert_p2s[16x8](10.59x),
     convert_p2s[16x12](11.01x), convert_p2s[16x16](11.00x),
     convert_p2s[16x32](11.59x), convert_p2s[16x64](11.68x)
Subject: [x265] asm: ssse3 10bit code for convert_p2s[32xN],[64xN]

details:   http://hg.videolan.org/x265/rev/565829b7a970
branches:  
changeset: 10179:565829b7a970
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Tue Apr 14 19:05:49 2015 +0530
description:
asm: ssse3 10bit code for convert_p2s[32xN],[64xN]

     convert_p2s[32x8](9.51x), convert_p2s[32x16](10.44x),
     convert_p2s[32x24](9.64x), convert_p2s[32x32](10.70x),
     convert_p2s[32x64](11.52x), convert_p2s[64x16](10.35x),
     convert_p2s[64x32](9.12x), convert_p2s[64x48](10.05x),
     convert_p2s[64x64](9.00x)
Subject: [x265] asm: ssse3 10bit code for convert_p2s[24xN]

details:   http://hg.videolan.org/x265/rev/00b90fb64d5f
branches:  
changeset: 10180:00b90fb64d5f
user:      Rajesh Paulraj<rajesh at multicorewareinc.com>
date:      Tue Apr 14 19:18:46 2015 +0530
description:
asm: ssse3 10bit code for convert_p2s[24xN]

     convert_p2s[24x32](14.57x)
Subject: [x265] improve rdoQuant() by reduce count of code group scan

details:   http://hg.videolan.org/x265/rev/b26385e20632
branches:  
changeset: 10181:b26385e20632
user:      Min Chen <chenm003 at 163.com>
date:      Tue Apr 14 21:18:58 2015 +0800
description:
improve rdoQuant() by reduce count of code group scan
Subject: [x265] improve rdoQuant() by use non-zero coeff group mask to reduce count of coeff scan

details:   http://hg.videolan.org/x265/rev/3a87866c76ad
branches:  
changeset: 10182:3a87866c76ad
user:      Min Chen <chenm003 at 163.com>
date:      Tue Apr 14 21:19:02 2015 +0800
description:
improve rdoQuant() by use non-zero coeff group mask to reduce count of coeff scan
Subject: [x265] improve rdoQuant() by block fill on non-zero coeff group

details:   http://hg.videolan.org/x265/rev/44edfb7f0a0a
branches:  
changeset: 10183:44edfb7f0a0a
user:      Min Chen <chenm003 at 163.com>
date:      Tue Apr 14 21:19:06 2015 +0800
description:
improve rdoQuant() by block fill on non-zero coeff group
Subject: [x265] asm: improve algorithm logic on saoCuOrgE3

details:   http://hg.videolan.org/x265/rev/7f32086318d9
branches:  
changeset: 10184:7f32086318d9
user:      Min Chen <chenm003 at 163.com>
date:      Wed Apr 15 14:08:36 2015 +0800
description:
asm: improve algorithm logic on saoCuOrgE3
Subject: [x265] regression: typo in rc tests

details:   http://hg.videolan.org/x265/rev/737edf5ac008
branches:  
changeset: 10185:737edf5ac008
user:      mahesh pittala <mahesh at multicorewareinc.com>
date:      Wed Apr 15 10:58:54 2015 +0530
description:
regression: typo in rc tests

diffstat:

 source/common/loopfilter.cpp         |   24 +
 source/common/primitives.h           |    2 +
 source/common/quant.cpp              |   23 +-
 source/common/x86/asm-primitives.cpp |   32 +
 source/common/x86/intrapred.h        |    1 +
 source/common/x86/intrapred8.asm     |   94 +++++
 source/common/x86/ipfilter16.asm     |  586 +++++++++++++++++++++++++++++++---
 source/common/x86/ipfilter8.h        |   23 +
 source/common/x86/loopfilter.asm     |   40 +-
 source/common/x86/pixel-a.asm        |  300 +++++++++++++++++
 source/common/x86/pixel-util8.asm    |   50 ++-
 source/encoder/sao.cpp               |   24 +-
 source/output/reconplay.cpp          |    4 +
 source/test/pixelharness.cpp         |   46 ++-
 source/test/pixelharness.h           |    3 +
 source/test/rate-control-tests.txt   |    4 +-
 16 files changed, 1146 insertions(+), 110 deletions(-)

diffs (truncated from 1572 to 300 lines):

diff -r abfbfdf724a0 -r 737edf5ac008 source/common/loopfilter.cpp

--- a/source/common/loopfilter.cpp	Mon Apr 13 14:13:19 2015 -0700
+++ b/source/common/loopfilter.cpp	Wed Apr 15 10:58:54 2015 +0530
@@ -122,6 +122,29 @@ void processSaoCUE3(pixel *rec, int8_t *
     }
 }
 
+void processSaoCUE3_2Rows(pixel *rec, int8_t *upBuff1, int8_t *offsetEo, intptr_t stride, int startX, int endX, int8_t* signDown)
+{
+    int8_t signDown1;
+    int8_t edgeType;
+
+    for (int y = 0; y < 2; y++)
+    {
+        edgeType = signDown[y] + upBuff1[startX] + 2;
+        upBuff1[startX - 1] = -signDown[y];
+        rec[startX] = x265_clip(rec[startX] + offsetEo[edgeType]);
+
+        for (int x = startX + 1; x < endX; x++)
+        {
+            signDown1 = signOf(rec[x] - rec[x + stride]);
+            edgeType = signDown1 + upBuff1[x] + 2;
+            upBuff1[x - 1] = -signDown1;
+            rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
+        }
+        upBuff1[endX - 1] = signOf(rec[endX - 1 + stride + 1] - rec[endX]);
+        rec += stride + 1;
+    }
+}
+
 void processSaoCUB0(pixel* rec, const int8_t* offset, int ctuWidth, int ctuHeight, intptr_t stride)
 {
     #define SAO_BO_BITS 5
@@ -146,6 +169,7 @@ void setupLoopFilterPrimitives_c(Encoder
     p.saoCuOrgE1_2Rows = processSaoCUE1_2Rows;
     p.saoCuOrgE2 = processSaoCUE2;
     p.saoCuOrgE3 = processSaoCUE3;
+    p.saoCuOrgE3_2Rows = processSaoCUE3_2Rows;
     p.saoCuOrgB0 = processSaoCUB0;
     p.sign = calSign;
 }
diff -r abfbfdf724a0 -r 737edf5ac008 source/common/primitives.h
--- a/source/common/primitives.h	Mon Apr 13 14:13:19 2015 -0700
+++ b/source/common/primitives.h	Wed Apr 15 10:58:54 2015 +0530
@@ -172,6 +172,7 @@ typedef void (*saoCuOrgE0_t)(pixel* rec,
 typedef void (*saoCuOrgE1_t)(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
 typedef void (*saoCuOrgE2_t)(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
 typedef void (*saoCuOrgE3_t)(pixel* rec, int8_t* upBuff1, int8_t* m_offsetEo, intptr_t stride, int startX, int endX);
+typedef void (*saoCuOrgE3_2Rows_t)(pixel* rec, int8_t* upBuff1, int8_t* m_offsetEo, intptr_t stride, int startX, int endX, int8_t* signDown);
 typedef void (*saoCuOrgB0_t)(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride);
 typedef void (*sign_t)(int8_t *dst, const pixel *src1, const pixel *src2, const int endX);
 typedef void (*planecopy_cp_t) (const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift);
@@ -277,6 +278,7 @@ struct EncoderPrimitives
     saoCuOrgE1_t          saoCuOrgE1, saoCuOrgE1_2Rows;
     saoCuOrgE2_t          saoCuOrgE2;
     saoCuOrgE3_t          saoCuOrgE3;
+    saoCuOrgE3_2Rows_t    saoCuOrgE3_2Rows;
     saoCuOrgB0_t          saoCuOrgB0;
 
     downscale_t           frameInitLowres;
diff -r abfbfdf724a0 -r 737edf5ac008 source/common/quant.cpp
--- a/source/common/quant.cpp	Mon Apr 13 14:13:19 2015 -0700
+++ b/source/common/quant.cpp	Wed Apr 15 10:58:54 2015 +0530
@@ -981,26 +981,41 @@ uint32_t Quant::rdoQuant(const CUData& c
         dstCoeff[blkPos] = (int16_t)((level ^ mask) - mask);
     }
 
+    // Average 49.62 pixels
     /* clean uncoded coefficients */
-    for (int pos = bestLastIdx; pos <= lastScanPos; pos++)
+    for (int pos = bestLastIdx; pos <= fastMin(lastScanPos, (bestLastIdx | (SCAN_SET_SIZE - 1))); pos++)
+    {
         dstCoeff[codeParams.scan[pos]] = 0;
+    }
+    for (int pos = (bestLastIdx & ~(SCAN_SET_SIZE - 1)) + SCAN_SET_SIZE; pos <= lastScanPos; pos += SCAN_SET_SIZE)
+    {
+        const uint32_t blkPos = codeParams.scan[pos];
+        memset(&dstCoeff[blkPos + 0 * trSize], 0, 4 * sizeof(*dstCoeff));
+        memset(&dstCoeff[blkPos + 1 * trSize], 0, 4 * sizeof(*dstCoeff));
+        memset(&dstCoeff[blkPos + 2 * trSize], 0, 4 * sizeof(*dstCoeff));
+        memset(&dstCoeff[blkPos + 3 * trSize], 0, 4 * sizeof(*dstCoeff));
+    }
 
     /* rate-distortion based sign-hiding */
     if (cu.m_slice->m_pps->bSignHideEnabled && numSig >= 2)
     {
+        const int realLastScanPos = (bestLastIdx - 1) >> LOG2_SCAN_SET_SIZE;
         int lastCG = true;
-        for (int subSet = cgLastScanPos; subSet >= 0; subSet--)
+        for (int subSet = realLastScanPos; subSet >= 0; subSet--)
         {
             int subPos = subSet << LOG2_SCAN_SET_SIZE;
             int n;
 
+            if (!(sigCoeffGroupFlag64 & (1ULL << codeParams.scanCG[subSet])))
+                continue;
+
             /* measure distance between first and last non-zero coef in this
              * coding group */
             for (n = SCAN_SET_SIZE - 1; n >= 0; --n)
                 if (dstCoeff[codeParams.scan[n + subPos]])
                     break;
-            if (n < 0)
-                continue;
+
+            X265_CHECK(n >= 0, "non-zero coeff scan failuare!\n");
 
             int lastNZPosInCG = n;
 
diff -r abfbfdf724a0 -r 737edf5ac008 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Mon Apr 13 14:13:19 2015 -0700
+++ b/source/common/x86/asm-primitives.cpp	Wed Apr 15 10:58:54 2015 +0530
@@ -948,6 +948,30 @@ void setupAssemblyPrimitives(EncoderPrim
         p.cu[BLOCK_16x16].count_nonzero = x265_count_nonzero_16x16_ssse3;
         p.cu[BLOCK_32x32].count_nonzero = x265_count_nonzero_32x32_ssse3;
         p.frameInitLowres = x265_frame_init_lowres_core_ssse3;
+
+        p.pu[LUMA_4x4].convert_p2s = x265_filterPixelToShort_4x4_ssse3;
+        p.pu[LUMA_4x8].convert_p2s = x265_filterPixelToShort_4x8_ssse3;
+        p.pu[LUMA_4x16].convert_p2s = x265_filterPixelToShort_4x16_ssse3;
+        p.pu[LUMA_8x4].convert_p2s = x265_filterPixelToShort_8x4_ssse3;
+        p.pu[LUMA_8x8].convert_p2s = x265_filterPixelToShort_8x8_ssse3;
+        p.pu[LUMA_8x16].convert_p2s = x265_filterPixelToShort_8x16_ssse3;
+        p.pu[LUMA_8x32].convert_p2s = x265_filterPixelToShort_8x32_ssse3;
+        p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_ssse3;
+        p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_ssse3;
+        p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_ssse3;
+        p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_ssse3;
+        p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_ssse3;
+        p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_ssse3;
+        p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_ssse3;
+        p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_ssse3;
+        p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_ssse3;
+        p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_ssse3;
+        p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_ssse3;
+        p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_ssse3;
+        p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_ssse3;
+        p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_ssse3;
+        p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_ssse3;
+        p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_ssse3;
     }
     if (cpuMask & X265_CPU_SSE4)
     {
@@ -1516,6 +1540,7 @@ void setupAssemblyPrimitives(EncoderPrim
         p.chroma[X265_CSP_I420].cu[CHROMA_420_16x16].copy_ss = x265_blockcopy_ss_16x16_avx;
         p.chroma[X265_CSP_I420].cu[CHROMA_420_32x32].copy_ss = x265_blockcopy_ss_32x32_avx;
         p.chroma[X265_CSP_I422].cu[CHROMA_422_16x32].copy_ss = x265_blockcopy_ss_16x32_avx;
+        p.chroma[X265_CSP_I422].cu[CHROMA_422_32x64].copy_ss = x265_blockcopy_ss_32x64_avx;
 
         p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].copy_pp = x265_blockcopy_pp_32x8_avx;
         p.pu[LUMA_32x8].copy_pp = x265_blockcopy_pp_32x8_avx;
@@ -1669,6 +1694,12 @@ void setupAssemblyPrimitives(EncoderPrim
         p.pu[LUMA_8x16].satd  = x265_pixel_satd_8x16_avx2;
         p.pu[LUMA_8x8].satd   = x265_pixel_satd_8x8_avx2;
 
+        p.pu[LUMA_32x8].satd   = x265_pixel_satd_32x8_avx2;
+        p.pu[LUMA_32x16].satd   = x265_pixel_satd_32x16_avx2;
+        p.pu[LUMA_32x24].satd   = x265_pixel_satd_32x24_avx2;
+        p.pu[LUMA_32x32].satd   = x265_pixel_satd_32x32_avx2;
+        p.pu[LUMA_32x64].satd   = x265_pixel_satd_32x64_avx2;
+
         p.pu[LUMA_32x8].sad = x265_pixel_sad_32x8_avx2;
         p.pu[LUMA_32x16].sad = x265_pixel_sad_32x16_avx2;
         p.pu[LUMA_32x24].sad = x265_pixel_sad_32x24_avx2;
@@ -1821,6 +1852,7 @@ void setupAssemblyPrimitives(EncoderPrim
         p.cu[BLOCK_32x32].intra_pred[23] = x265_intra_pred_ang32_23_avx2;
         p.cu[BLOCK_32x32].intra_pred[22] = x265_intra_pred_ang32_22_avx2;
         p.cu[BLOCK_32x32].intra_pred[21] = x265_intra_pred_ang32_21_avx2;
+        p.cu[BLOCK_32x32].intra_pred[18] = x265_intra_pred_ang32_18_avx2;
 
         // copy_sp primitives
         p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2;
diff -r abfbfdf724a0 -r 737edf5ac008 source/common/x86/intrapred.h
--- a/source/common/x86/intrapred.h	Mon Apr 13 14:13:19 2015 -0700
+++ b/source/common/x86/intrapred.h	Wed Apr 15 10:58:54 2015 +0530
@@ -277,6 +277,7 @@ void x265_intra_pred_ang32_24_avx2(pixel
 void x265_intra_pred_ang32_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang32_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang32_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_18_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_all_angs_pred_4x4_sse2(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
diff -r abfbfdf724a0 -r 737edf5ac008 source/common/x86/intrapred8.asm
--- a/source/common/x86/intrapred8.asm	Mon Apr 13 14:13:19 2015 -0700
+++ b/source/common/x86/intrapred8.asm	Wed Apr 15 10:58:54 2015 +0530
@@ -28,6 +28,7 @@
 SECTION_RODATA 32
 
 intra_pred_shuff_0_8:    times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
+intra_pred_shuff_15_0:   times 2 db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
 
 pb_0_8        times 8 db  0,  8
 pb_unpackbw1  times 2 db  1,  8,  2,  8,  3,  8,  4,  8
@@ -10366,6 +10367,99 @@ cglobal intra_pred_ang32_17, 4,7,8
 
     RET
 
+INIT_YMM avx2
+cglobal intra_pred_ang32_18, 4, 4, 3
+    movu           m0, [r2]
+    movu           xm1, [r2 + 1 + 64]
+    pshufb         xm1, [intra_pred_shuff_15_0]
+    mova           xm2, xm0
+    vinserti128    m1, m1, xm2, 1
+
+    lea            r3, [r1 * 3]
+
+    movu           [r0], m0
+    palignr        m2, m0, m1, 15
+    movu           [r0 + r1], m2
+    palignr        m2, m0, m1, 14
+    movu           [r0 + r1 * 2], m2
+    palignr        m2, m0, m1, 13
+    movu           [r0 + r3], m2
+
+    lea            r0, [r0 + r1 * 4]
+    palignr        m2, m0, m1, 12
+    movu           [r0], m2
+    palignr        m2, m0, m1, 11
+    movu           [r0 + r1], m2
+    palignr        m2, m0, m1, 10
+    movu           [r0 + r1 * 2], m2
+    palignr        m2, m0, m1, 9
+    movu           [r0 + r3], m2
+
+    lea            r0, [r0 + r1 * 4]
+    palignr        m2, m0, m1, 8
+    movu           [r0], m2
+    palignr        m2, m0, m1, 7
+    movu           [r0 + r1], m2
+    palignr        m2, m0, m1, 6
+    movu           [r0 + r1 * 2], m2
+    palignr        m2, m0, m1, 5
+    movu           [r0 + r3], m2
+
+    lea            r0, [r0 + r1 * 4]
+    palignr        m2, m0, m1, 4
+    movu           [r0], m2
+    palignr        m2, m0, m1, 3
+    movu           [r0 + r1], m2
+    palignr        m2, m0, m1, 2
+    movu           [r0 + r1 * 2], m2
+    palignr        m2, m0, m1, 1
+    movu           [r0 + r3], m2
+
+    lea            r0, [r0 + r1 * 4]
+    movu           [r0], m1
+
+    movu           xm0, [r2 + 64 + 17]
+    pshufb         xm0, [intra_pred_shuff_15_0]
+    vinserti128    m0, m0, xm1, 1
+
+    palignr        m2, m1, m0, 15
+    movu           [r0 + r1], m2
+    palignr        m2, m1, m0, 14
+    movu           [r0 + r1 * 2], m2
+    palignr        m2, m1, m0, 13
+    movu           [r0 + r3], m2
+
+    lea            r0, [r0 + r1 * 4]
+    palignr        m2, m1, m0, 12
+    movu           [r0], m2
+    palignr        m2, m1, m0, 11
+    movu           [r0 + r1], m2
+    palignr        m2, m1, m0, 10
+    movu           [r0 + r1 * 2], m2
+    palignr        m2, m1, m0, 9
+    movu           [r0 + r3], m2
+
+    lea            r0, [r0 + r1 * 4]
+    palignr        m2, m1, m0, 8
+    movu           [r0], m2
+    palignr        m2, m1, m0, 7
+    movu           [r0 + r1], m2
+    palignr        m2, m1, m0,6
+    movu           [r0 + r1 * 2], m2
+    palignr        m2, m1, m0, 5
+    movu           [r0 + r3], m2
+
+    lea            r0, [r0 + r1 * 4]
+    palignr        m2, m1, m0, 4
+    movu           [r0], m2
+    palignr        m2, m1, m0, 3
+    movu           [r0 + r1], m2
+    palignr        m2, m1, m0,2
+    movu           [r0 + r1 * 2], m2
+    palignr        m2, m1, m0, 1
+    movu           [r0 + r3], m2
+    RET
+
 INIT_XMM sse4
 cglobal intra_pred_ang32_18, 4,5,5
     movu        m0, [r2]               ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0]
diff -r abfbfdf724a0 -r 737edf5ac008 source/common/x86/ipfilter16.asm
--- a/source/common/x86/ipfilter16.asm	Mon Apr 13 14:13:19 2015 -0700
+++ b/source/common/x86/ipfilter16.asm	Wed Apr 15 10:58:54 2015 +0530
@@ -117,6 +117,7 @@ SECTION .text
 cextern pd_32
 cextern pw_pixel_max