[x265-commits] [x265] dct: modified block copy used in dct8 with convert16to32 ...

Sat Oct 12 06:25:52 CEST 2013

details:   http://hg.videolan.org/x265/rev/855757691efc
branches:  
changeset: 4385:855757691efc
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Fri Oct 11 12:42:16 2013 +0530
description:
dct: modified block copy used in dct8 with convert16to32 inline function
Subject: [x265] dct: manually inline convert16to32, for 10% improvement

details:   http://hg.videolan.org/x265/rev/ab9f6ad97d30
branches:  
changeset: 4386:ab9f6ad97d30
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 11 02:31:33 2013 -0500
description:
dct: manually inline convert16to32, for 10% improvement
Subject: [x265] intra-sse3.cpp: Created common macros PRED_INTRA_ANGLE_4_START, PRED_INTRA_ANGLE_4_END for PredIntraAng4_[ANGLE] function.

details:   http://hg.videolan.org/x265/rev/ee4f9ae07523
branches:  
changeset: 4387:ee4f9ae07523
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 12:41:47 2013 +0530
description:
intra-sse3.cpp: Created common macros PRED_INTRA_ANGLE_4_START, PRED_INTRA_ANGLE_4_END for PredIntraAng4_[ANGLE] function.
Subject: [x265] intra-sse3.cpp: Replace PredIntraAng4_26 vector class function with intrinsic using intrinsic macros PRED_INTRA_ANGLE_4_START and PRED_INTRA_ANGLE_4_END.

details:   http://hg.videolan.org/x265/rev/295973cbc020
branches:  
changeset: 4388:295973cbc020
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 12:59:03 2013 +0530
description:
intra-sse3.cpp: Replace PredIntraAng4_26 vector class function with intrinsic using intrinsic macros PRED_INTRA_ANGLE_4_START and PRED_INTRA_ANGLE_4_END.
Subject: [x265] asm: fix bug in filterHorizontal_p_p_4 with width less than 8 (seed 0x52578C72)

details:   http://hg.videolan.org/x265/rev/953a4e9f3d57
branches:  
changeset: 4389:953a4e9f3d57
user:      Min Chen <chenm003 at 163.com>
date:      Fri Oct 11 13:51:46 2013 +0800
description:
asm: fix bug in filterHorizontal_p_p_4 with width less than 8 (seed 0x52578C72)
Subject: [x265] asm: improvement filterHorizontal_p_p_4 by reorder intermedia data

details:   http://hg.videolan.org/x265/rev/080a9fdada2c
branches:  
changeset: 4390:080a9fdada2c
user:      Min Chen <chenm003 at 163.com>
date:      Fri Oct 11 14:51:58 2013 +0800
description:
asm: improvement filterHorizontal_p_p_4 by reorder intermedia data

1. repleace phaddw to paddw
2. use extra load operator to split data dependency and reduce table size
Subject: [x265] intra-sse3.cpp: Replace PredIntraAng4_21 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/e9b401f5c655
branches:  
changeset: 4391:e9b401f5c655
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 13:16:57 2013 +0530
description:
intra-sse3.cpp: Replace PredIntraAng4_21 vector class function with intrinsic.
Subject: [x265] dct: Replaced partialButterfly16 vector class function to intrinsic

details:   http://hg.videolan.org/x265/rev/f760de7f5596
branches:  
changeset: 4392:f760de7f5596
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Fri Oct 11 14:09:28 2013 +0530
description:
dct: Replaced partialButterfly16 vector class function to intrinsic
Subject: [x265] dct: move dct8 to dct-sse41.cpp, inline convert16to32

details:   http://hg.videolan.org/x265/rev/f0eebdf90a58
branches:  
changeset: 4393:f0eebdf90a58
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 11 13:57:03 2013 -0500
description:
dct: move dct8 to dct-sse41.cpp, inline convert16to32
Subject: [x265] intra-sse3.cpp: Replace PredIntraAng4_17 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/17c772394df3
branches:  
changeset: 4394:17c772394df3
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 14:15:43 2013 +0530
description:
intra-sse3.cpp: Replace PredIntraAng4_17 vector class function with intrinsic.
Subject: [x265] intra-sse3.cpp: Replace PredIntraAng4_13 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/f3d0ced4a4f1
branches:  
changeset: 4395:f3d0ced4a4f1
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 14:17:41 2013 +0530
description:
intra-sse3.cpp: Replace PredIntraAng4_13 vector class function with intrinsic.
Subject: [x265] intra-sse3.cpp: Replace PredIntraAng4_9 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/e65e3714bbb9
branches:  
changeset: 4396:e65e3714bbb9
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 14:20:11 2013 +0530
description:
intra-sse3.cpp: Replace PredIntraAng4_9 vector class function with intrinsic.
Subject: [x265] intra-sse3.cpp: Replace PredIntraAng4_5 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/2b9f94e11cc5
branches:  
changeset: 4397:2b9f94e11cc5
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 14:22:21 2013 +0530
description:
intra-sse3.cpp: Replace PredIntraAng4_5 vector class function with intrinsic.
Subject: [x265] intra-sse3.cpp: Replace PredIntraAng4_2 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/bd335e21744d
branches:  
changeset: 4398:bd335e21744d
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 14:24:31 2013 +0530
description:
intra-sse3.cpp: Replace PredIntraAng4_2 vector class function with intrinsic.
Subject: [x265] intra-sse3.cpp: Replace PredIntraAng4_m_2 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/e4efd408f394
branches:  
changeset: 4399:e4efd408f394
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 15:38:38 2013 +0530
description:
intra-sse3.cpp: Replace PredIntraAng4_m_2 vector class function with intrinsic.
Subject: [x265] intra-sse3.cpp: Replace PredIntraAng4_m_5 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/87a56e0ff6a9
branches:  
changeset: 4400:87a56e0ff6a9
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 15:42:44 2013 +0530
description:
intra-sse3.cpp: Replace PredIntraAng4_m_5 vector class function with intrinsic.
Subject: [x265] intra-sse3.cpp: Replace PredIntraAng4_m_9 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/5c6f7106c918
branches:  
changeset: 4401:5c6f7106c918
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 15:46:32 2013 +0530
description:
intra-sse3.cpp: Replace PredIntraAng4_m_9 vector class function with intrinsic.
Subject: [x265] intra-sse3.cpp: Replace PredIntraAng4_m_13 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/f1013117efab
branches:  
changeset: 4402:f1013117efab
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 15:49:33 2013 +0530
description:
intra-sse3.cpp: Replace PredIntraAng4_m_13 vector class function with intrinsic.
Subject: [x265] intra-sse3.cpp: Replace PredIntraAng4_m_17 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/263acbde8ec1
branches:  
changeset: 4403:263acbde8ec1
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 15:52:48 2013 +0530
description:
intra-sse3.cpp: Replace PredIntraAng4_m_17 vector class function with intrinsic.
Subject: [x265] intra-sse3.cpp: Replace PredIntraAng4_m_21 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/90b34ae5e8de
branches:  
changeset: 4404:90b34ae5e8de
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 16:06:09 2013 +0530
description:
intra-sse3.cpp: Replace PredIntraAng4_m_21 vector class function with intrinsic.
Subject: [x265] intra-sse3.cpp: Replace PredIntraAng4_m_26 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/267fa83cd7b9
branches:  
changeset: 4405:267fa83cd7b9
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 16:08:52 2013 +0530
description:
intra-sse3.cpp: Replace PredIntraAng4_m_26 vector class function with intrinsic.
Subject: [x265] intra-sse3.cpp: Replace PredIntraAng4_m_32 vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/4824f15116e6
branches:  
changeset: 4406:4824f15116e6
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 16:13:21 2013 +0530
description:
intra-sse3.cpp: Replace PredIntraAng4_m_32 vector class function with intrinsic.
Subject: [x265] dct: Replaced partialButterfly32 vector class function to intrinsic

details:   http://hg.videolan.org/x265/rev/ca00db64f5bb
branches:  
changeset: 4407:ca00db64f5bb
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Fri Oct 11 16:52:58 2013 +0530
description:
dct: Replaced partialButterfly32 vector class function to intrinsic
Subject: [x265] dct: move dct32 to dct-sse41.cpp, inline convert16to32

details:   http://hg.videolan.org/x265/rev/def1551c14f0
branches:  
changeset: 4408:def1551c14f0
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 11 14:23:14 2013 -0500
description:
dct: move dct32 to dct-sse41.cpp, inline convert16to32
Subject: [x265] pixel-sse3.cpp: Replace convert32to16_shr vector class function with intrinsic.

details:   http://hg.videolan.org/x265/rev/efb230642757
branches:  
changeset: 4409:efb230642757
user:      Dnyaneshwar Gorade <dnyaneshwar at multicorewareinc.com>
date:      Fri Oct 11 17:18:44 2013 +0530
description:
pixel-sse3.cpp: Replace convert32to16_shr vector class function with intrinsic.
Subject: [x265] pixel-sse3: move convert32to16_shr to top of file, remove vector class includes

details:   http://hg.videolan.org/x265/rev/9f37e3d7818c
branches:  
changeset: 4410:9f37e3d7818c
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 11 14:27:11 2013 -0500
description:
pixel-sse3: move convert32to16_shr to top of file, remove vector class includes
Subject: [x265] dct: Replaced inversedst vector class function to intrinsic

details:   http://hg.videolan.org/x265/rev/df024b91ffd6
branches:  
changeset: 4411:df024b91ffd6
user:      Yuvaraj Venkatesh <yuvaraj at multicorewareinc.com>
date:      Fri Oct 11 18:30:11 2013 +0530
description:
dct: Replaced inversedst vector class function to intrinsic
Subject: [x265] dct-sse3: remove idst4; it uses SSE4.1 but dct-sse41.cpp already has idst4

details:   http://hg.videolan.org/x265/rev/839a9ba551e4
branches:  
changeset: 4412:839a9ba551e4
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 11 14:38:41 2013 -0500
description:
dct-sse3: remove idst4; it uses SSE4.1 but dct-sse41.cpp already has idst4
Subject: [x265] dct-sse41: reorder functions for clarity - no code change

details:   http://hg.videolan.org/x265/rev/d6dc4ebb5cbe
branches:  
changeset: 4413:d6dc4ebb5cbe
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 11 14:41:28 2013 -0500
description:
dct-sse41: reorder functions for clarity - no code change
Subject: [x265] dct-sse3: don't compile dct4 for 16bpp builds when it is not used

details:   http://hg.videolan.org/x265/rev/2267068cc7e1
branches:  
changeset: 4414:2267068cc7e1
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 11 14:43:29 2013 -0500
description:
dct-sse3: don't compile dct4 for 16bpp builds when it is not used
Subject: [x265] dct-ssse3: remove vector class includes; dct files are now clean

details:   http://hg.videolan.org/x265/rev/1cd3bc5e6881
branches:  
changeset: 4415:1cd3bc5e6881
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 11 14:45:31 2013 -0500
description:
dct-ssse3: remove vector class includes; dct files are now clean
Subject: [x265] Some fixes in applyWeight() function

details:   http://hg.videolan.org/x265/rev/b70432f7b275
branches:  
changeset: 4416:b70432f7b275
user:      Shazeb Nawaz Khan <shazeb at multicorewareinc.com>
date:      Fri Oct 11 18:28:54 2013 +0530
description:
Some fixes in applyWeight() function

These wont fix the PSNR drop but are necessary
Subject: [x265] rc: added TEncCfg instance to RateControl to reuse all the rc params directly.

details:   http://hg.videolan.org/x265/rev/ce889cef37be
branches:  
changeset: 4417:ce889cef37be
user:      Aarthi Thirumalai
date:      Fri Oct 11 16:31:21 2013 +0530
description:
rc: added TEncCfg instance to RateControl to reuse all the rc params directly.
Subject: [x265] param: added rc states for setting Aq mode and Aq strength

details:   http://hg.videolan.org/x265/rev/73d085da8533
branches:  
changeset: 4418:73d085da8533
user:      Aarthi Thirumalai
date:      Fri Oct 11 16:37:59 2013 +0530
description:
param: added rc states for setting Aq mode and Aq strength
Subject: [x265] primitves: add c primitives for the following :

details:   http://hg.videolan.org/x265/rev/725ac176cd13
branches:  
changeset: 4419:725ac176cd13
user:      Aarthi Thirumalai
date:      Fri Oct 11 16:10:11 2013 +0530
description:
primitves: add c primitives for the following :

compute AC energy for each block
copy pixels of chroma plane
Subject: [x265] intra: prevent variable shadow warnings from GCC

details:   http://hg.videolan.org/x265/rev/d97cf152f620
branches:  
changeset: 4420:d97cf152f620
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 11 22:45:22 2013 -0500
description:
intra: prevent variable shadow warnings from GCC
Subject: [x265] blockcopy-sse3: consistent naming convention

details:   http://hg.videolan.org/x265/rev/0be273b5f082
branches:  
changeset: 4421:0be273b5f082
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 11 22:55:01 2013 -0500
description:
blockcopy-sse3: consistent naming convention
Subject: [x265] blockcopy-sse3: remove vector class use from last 16bpp intrinsic

details:   http://hg.videolan.org/x265/rev/41b7ceea1e32
branches:  
changeset: 4422:41b7ceea1e32
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 11 22:58:00 2013 -0500
description:
blockcopy-sse3: remove vector class use from last 16bpp intrinsic

blockcopy files are now vector class clean
Subject: [x265] blockcopy-sse3: consistent naming convention

details:   http://hg.videolan.org/x265/rev/8518e39a2b74
branches:  
changeset: 4423:8518e39a2b74
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 11 22:59:53 2013 -0500
description:
blockcopy-sse3: consistent naming convention
Subject: [x265] intra: remove vector class header include from intra-sse41.cpp

details:   http://hg.videolan.org/x265/rev/f77efd501767
branches:  
changeset: 4424:f77efd501767
user:      Steve Borho <steve at borho.org>
date:      Fri Oct 11 23:13:03 2013 -0500
description:
intra: remove vector class header include from intra-sse41.cpp

intra-sse3.cpp is the last file with 8bpp (non-AVX2) vector class primitives

diffstat:

 source/common/common.cpp             |    2 +
 source/common/pixel.cpp              |   31 +
 source/common/primitives.h           |    4 +
 source/common/reference.cpp          |   12 +-
 source/common/vec/blockcopy-sse3.cpp |   90 ++--
 source/common/vec/dct-sse3.cpp       |  554 +---------------------------
 source/common/vec/dct-sse41.cpp      |  702 ++++++++++++++++++++++++++++++----
 source/common/vec/dct-ssse3.cpp      |    6 +-
 source/common/vec/intra-sse3.cpp     |  682 +++++++++++++++++++--------------
 source/common/vec/intra-sse41.cpp    |    5 +-
 source/common/vec/pixel-sse3.cpp     |   45 +-
 source/common/x86/ipfilter8.asm      |   50 +-
 source/encoder/encoder.cpp           |    2 +-
 source/encoder/motion.cpp            |   10 +-
 source/encoder/ratecontrol.cpp       |   27 +-
 source/encoder/ratecontrol.h         |    5 +-
 source/x265.h                        |    2 +
 17 files changed, 1162 insertions(+), 1067 deletions(-)

diffs (truncated from 2748 to 300 lines):

diff -r c6d89dc62e19 -r f77efd501767 source/common/common.cpp

--- a/source/common/common.cpp	Fri Oct 11 01:47:53 2013 -0500
+++ b/source/common/common.cpp	Fri Oct 11 23:13:03 2013 -0500
@@ -169,6 +169,8 @@ void x265_param_default(x265_param_t *pa
     param->rc.qpStep = 4;
     param->rc.rateControlMode = X265_RC_CQP;
     param->rc.qp = 32;
+    param->rc.aqMode = 0;
+    param->rc.aqStrength = 1.0;
 
     /* Quality Measurement Metrics */
     param->bEnablePsnr = 1;
diff -r c6d89dc62e19 -r f77efd501767 source/common/pixel.cpp
--- a/source/common/pixel.cpp	Fri Oct 11 01:47:53 2013 -0500
+++ b/source/common/pixel.cpp	Fri Oct 11 23:13:03 2013 -0500
@@ -688,6 +688,33 @@ float ssim_end_4(ssim_t sum0[5][4], ssim
     }
     return ssim;
 }
+
+template<int w, int h>
+uint64_t pixel_var(pixel *pix, intptr_t i_stride)
+{
+    uint32_t sum = 0, sqr = 0;
+    for (int y = 0; y < h; y++)
+    {
+        for (int x = 0; x < w; x++)
+        {
+            sum += pix[x];
+            sqr += pix[x] * pix[x];
+        }
+        pix += i_stride;
+    }
+    return sum + ((uint64_t)sqr << 32);
+}
+
+void plane_copy_deinterleave_chroma(pixel *dstu, intptr_t dstuStride, pixel *dstv, intptr_t dstvStride,
+                                    pixel *src,  intptr_t srcStride, int w, int h)
+{
+    for (int y = 0; y < h; y++, dstu += dstuStride, dstv += dstvStride, src += srcStride)
+        for (int x = 0; x < w; x++)
+        {
+            dstu[x] = src[2 * x];
+            dstv[x] = src[2 * x + 1];
+        }
+}
 }  // end anonymous namespace
 
 namespace x265 {
@@ -905,5 +932,9 @@ void Setup_C_PixelPrimitives(EncoderPrim
     p.frame_init_lowres_core = frame_init_lowres_core;
     p.ssim_4x4x2_core = ssim_4x4x2_core;
     p.ssim_end_4 = ssim_end_4;
+
+    p.var[PARTITION_16x16] = pixel_var<16,16>;
+    p.var[PARTITION_8x8] = pixel_var<8,8>;
+    p.plane_copy_deinterleave_c = plane_copy_deinterleave_chroma;
 }
 }
diff -r c6d89dc62e19 -r f77efd501767 source/common/primitives.h
--- a/source/common/primitives.h	Fri Oct 11 01:47:53 2013 -0500
+++ b/source/common/primitives.h	Fri Oct 11 23:13:03 2013 -0500
@@ -202,6 +202,8 @@ typedef void (*downscale_t)(pixel *src0,
 typedef void (*extendCURowBorder_t)(pixel* txt, intptr_t stride, int width, int height, int marginX);
 typedef void (*ssim_4x4x2_core_t)(const pixel *pix1, intptr_t stride1, const pixel *pix2, intptr_t stride2, ssim_t sums[2][4]);
 typedef float (*ssim_end4_t)(ssim_t sum0[5][4], ssim_t sum1[5][4], int width);
+typedef uint64_t (*var_t)(pixel *pix, intptr_t stride);
+typedef void (*plane_copy_deinterleave_t)(pixel *dstu, intptr_t dstuStride, pixel *dstv, intptr_t dstvStride, pixel *src,  intptr_t srcStride, int w, int h);
 
 /* Define a structure containing function pointers to optimized encoder
  * primitives.  Each pointer can reference either an assembly routine,
@@ -261,6 +263,8 @@ struct EncoderPrimitives
     downscale_t     frame_init_lowres_core;
     ssim_4x4x2_core_t ssim_4x4x2_core;
     ssim_end4_t       ssim_end_4;
+    var_t             var[NUM_PARTITIONS];
+    plane_copy_deinterleave_t plane_copy_deinterleave_c;
 };
 
 /* This copy of the table is what gets used by the encoder.
diff -r c6d89dc62e19 -r f77efd501767 source/common/reference.cpp
--- a/source/common/reference.cpp	Fri Oct 11 01:47:53 2013 -0500
+++ b/source/common/reference.cpp	Fri Oct 11 23:13:03 2013 -0500
@@ -92,7 +92,7 @@ MotionReference::~MotionReference()
 
 void MotionReference::applyWeight(int rows, int numRows)
 {
-    rows = X265_MIN(rows, numRows-1);
+    rows = X265_MIN(rows, numRows);
     if (m_numWeightedRows >= rows)
         return;
     int marginX = m_reconPic->m_lumaMarginX;
@@ -101,15 +101,15 @@ void MotionReference::applyWeight(int ro
     pixel* dst = fpelPlane + ((m_numWeightedRows * (int)g_maxCUHeight) * lumaStride);
     int width = m_reconPic->getWidth();
     int height = ((rows - m_numWeightedRows) * g_maxCUHeight);
-    if (rows == numRows - 1)
+    if (rows == numRows)
         height = ((m_reconPic->getHeight() % g_maxCUHeight) ? (m_reconPic->getHeight() % g_maxCUHeight) : g_maxCUHeight);
     size_t dstStride = lumaStride;
 
     // Computing weighted CU rows
     int shiftNum = IF_INTERNAL_PREC - X265_DEPTH;
-    shift = shift + shiftNum;
-    round = shift ? (1 << (shift - 1)) : 0;
-    primitives.weightpUniPixel(src, dst, lumaStride, dstStride, width, height, weight, round, shift, offset);
+    int local_shift = shift + shiftNum;
+    int local_round = local_shift ? (1 << (local_shift - 1)) : 0;
+    primitives.weightpUniPixel(src, dst, lumaStride, dstStride, width, height, weight, local_round, local_shift, offset);
 
     // Extending Left & Right
     primitives.extendRowBorder(dst, dstStride, width, height, marginX);
@@ -125,7 +125,7 @@ void MotionReference::applyWeight(int ro
     }
 
     // Extending Bottom
-    if (rows == (numRows - 1))
+    if (rows == numRows)
     {
         pixel *pixY = fpelPlane - marginX + (m_reconPic->getHeight() - 1) * dstStride;
         for (int y = 0; y < marginY; y++)
diff -r c6d89dc62e19 -r f77efd501767 source/common/vec/blockcopy-sse3.cpp
--- a/source/common/vec/blockcopy-sse3.cpp	Fri Oct 11 01:47:53 2013 -0500
+++ b/source/common/vec/blockcopy-sse3.cpp	Fri Oct 11 23:13:03 2013 -0500
@@ -28,8 +28,37 @@
 #include <cstring>
 
 namespace {
-#if !HIGH_BIT_DEPTH
-void blockcopy_p_p(int bx, int by, pixel *dst, intptr_t dstride, pixel *src, intptr_t sstride)
+#if HIGH_BIT_DEPTH
+void blockcopy_pp(int bx, int by, pixel *dst, intptr_t dstride, pixel *src, intptr_t sstride)
+{
+    if ((bx & 7) || (((size_t)dst | (size_t)src | sstride | dstride) & 15))
+    {
+        // slow path, irregular memory alignments or sizes
+        for (int y = 0; y < by; y++)
+        {
+            memcpy(dst, src, bx * sizeof(pixel));
+            src += sstride;
+            dst += dstride;
+        }
+    }
+    else
+    {
+        // fast path, multiples of 8 pixel wide blocks
+        for (int y = 0; y < by; y++)
+        {
+            for (int x = 0; x < bx; x += 8)
+            {
+                __m128i word = _mm_load_si128((__m128i const*)(src + x));
+                _mm_store_si128((__m128i*)&dst[x], word);
+            }
+
+            src += sstride;
+            dst += dstride;
+        }
+    }
+}
+#else
+void blockcopy_pp(int bx, int by, pixel *dst, intptr_t dstride, pixel *src, intptr_t sstride)
 {
     size_t aligncheck = (size_t)dst | (size_t)src | bx | sstride | dstride;
 
@@ -60,7 +89,7 @@ void blockcopy_p_p(int bx, int by, pixel
     }
 }
 
-void blockcopy_p_s(int bx, int by, pixel *dst, intptr_t dstride, short *src, intptr_t sstride)
+void blockcopy_ps(int bx, int by, pixel *dst, intptr_t dstride, short *src, intptr_t sstride)
 {
     size_t aligncheck = (size_t)dst | (size_t)src | bx | sstride | dstride;
     if (!(aligncheck & 15))
@@ -173,7 +202,7 @@ void pixeladd_pp(int bx, int by, pixel *
 }
 #endif /* if HIGH_BIT_DEPTH */
 
-void blockcopy_s_p(int bx, int by, short *dst, intptr_t dstride, uint8_t *src, intptr_t sstride)
+void blockcopy_sp(int bx, int by, short *dst, intptr_t dstride, uint8_t *src, intptr_t sstride)
 {
     size_t aligncheck = (size_t)dst | (size_t)src | bx | sstride | dstride;
     if (!(aligncheck & 15))
@@ -339,64 +368,27 @@ void pixeladd_ss(int bx, int by, short *
 }
 }
 
-#define INSTRSET 3
-#include "vectorclass.h"
-
-namespace {
-#if HIGH_BIT_DEPTH
-void blockcopy_p_p(int bx, int by, pixel *dst, intptr_t dstride, pixel *src, intptr_t sstride)
-{
-    if ((bx & 7) || (((size_t)dst | (size_t)src | sstride | dstride) & 15))
-    {
-        // slow path, irregular memory alignments or sizes
-        for (int y = 0; y < by; y++)
-        {
-            memcpy(dst, src, bx * sizeof(pixel));
-            src += sstride;
-            dst += dstride;
-        }
-    }
-    else
-    {
-        // fast path, multiples of 8 pixel wide blocks
-        for (int y = 0; y < by; y++)
-        {
-            for (int x = 0; x < bx; x += 8)
-            {
-                Vec8s word;
-                word.load_a(src + x);
-                word.store_a(dst + x);
-            }
-
-            src += sstride;
-            dst += dstride;
-        }
-    }
-}
-#endif
-}
-
 namespace x265 {
 void Setup_Vec_BlockCopyPrimitives_sse3(EncoderPrimitives &p)
 {
 #if HIGH_BIT_DEPTH
-    p.blockcpy_pp = blockcopy_p_p;
-    p.blockcpy_ps = (blockcpy_ps_t)blockcopy_p_p;
-    p.blockcpy_sp = (blockcpy_sp_t)blockcopy_p_p;
+    p.blockcpy_pp = blockcopy_pp;
+    p.blockcpy_ps = (blockcpy_ps_t)blockcopy_pp;
+    p.blockcpy_sp = (blockcpy_sp_t)blockcopy_pp;
 #else
     p.pixeladd_pp = pixeladd_pp;
 #endif
 
 #if HIGH_BIT_DEPTH
     // At high bit depth, a pixel is a short
-    p.blockcpy_sc = (blockcpy_sc_t)blockcopy_s_p;
+    p.blockcpy_sc = (blockcpy_sc_t)blockcopy_sp;
     p.pixeladd_pp = (pixeladd_pp_t)pixeladd_ss;
     p.pixeladd_ss = pixeladd_ss;
 #else
-    p.blockcpy_pp = blockcopy_p_p;
-    p.blockcpy_ps = blockcopy_p_s;
-    p.blockcpy_sp = blockcopy_s_p;
-    p.blockcpy_sc = blockcopy_s_p;
+    p.blockcpy_pp = blockcopy_pp;
+    p.blockcpy_ps = blockcopy_ps;
+    p.blockcpy_sp = blockcopy_sp;
+    p.blockcpy_sc = blockcopy_sp;
     p.pixelsub_sp = pixelsub_sp;
     p.pixeladd_ss = pixeladd_ss;
 #endif
diff -r c6d89dc62e19 -r f77efd501767 source/common/vec/dct-sse3.cpp
--- a/source/common/vec/dct-sse3.cpp	Fri Oct 11 01:47:53 2013 -0500
+++ b/source/common/vec/dct-sse3.cpp	Fri Oct 11 23:13:03 2013 -0500
@@ -40,6 +40,7 @@
 using namespace x265;
 
 namespace {
+#if !HIGH_BIT_DEPTH
 ALIGN_VAR_32(static const short, tab_dct_4[][8]) =
 {
     { 64, 64, 64, 64, 64, 64, 64, 64 },
@@ -120,6 +121,7 @@ void dct4(short *src, int *dst, intptr_t
     _mm_storeu_si128((__m128i*)&dst[2 * 4], T72);
     _mm_storeu_si128((__m128i*)&dst[3 * 4], T73);
 }
+#endif
 
 ALIGN_VAR_32(static const short, tab_idct_4x4[4][8]) =
 {
@@ -1730,565 +1732,13 @@ void idct32(int *src, short *dst, intptr
 }
 }
 
-
-/* Vector class primitives */
-#define INSTRSET 3
-#include "vectorclass.h"
-namespace {
-inline void partialButterfly16(short *src, short *dst, int shift, int line)
-{
-    int j;
-    int add = 1 << (shift - 1);
-
-    Vec4i zero_row(64, 64, 0, 0);
-    Vec4i four_row(83, 36, 0, 0);
-    Vec4i eight_row(64, -64, 0, 0);
-    Vec4i twelve_row(36, -83, 0, 0);
-
-    Vec4i two_row(89, 75, 50, 18);
-    Vec4i six_row(75, -18, -89, -50);
-    Vec4i ten_row(50, -89, 18, 75);
-    Vec4i fourteen_row(18, -50, 75, -89);
-
-    Vec4i one_row_first_half(90, 87, 80, 70);
-    Vec4i one_row_second_half(57, 43, 25,  9);