[x265-commits] [x265] analysis: allow intra mode in RD-0/4

Ashok Kumar Mishra ashok at multicorewareinc.com
Mon Jun 22 16:00:51 CEST 2015


details:   http://hg.videolan.org/x265/rev/8aa2bedda740
branches:  
changeset: 10675:8aa2bedda740
user:      Ashok Kumar Mishra<ashok at multicorewareinc.com>
date:      Thu Jun 04 16:40:19 2015 +0530
description:
analysis: allow intra mode in RD-0/4
Output wiil be changed for --limit-refs 0 command line
Subject: [x265] doc: update limit-refs behaviour for intra modes

details:   http://hg.videolan.org/x265/rev/44b6b2df7016
branches:  
changeset: 10676:44b6b2df7016
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Fri Jun 19 16:43:29 2015 +0530
description:
doc: update limit-refs behaviour for intra modes
Subject: [x265] param: move x265_atof into namespace "X265_NS"

details:   http://hg.videolan.org/x265/rev/10f8683f725d
branches:  
changeset: 10677:10f8683f725d
user:      Praveen Tiwari <praveen at multicorewareinc.com>
date:      Fri Jun 19 18:58:01 2015 +0530
description:
param: move x265_atof into namespace "X265_NS"
Subject: [x265] testbench: costCoeffNxN and enable asm code (based on Sumalatha's patch)

details:   http://hg.videolan.org/x265/rev/43ae2f789af1
branches:  
changeset: 10678:43ae2f789af1
user:      Min Chen <chenm003 at 163.com>
date:      Fri Jun 19 17:44:56 2015 -0700
description:
testbench: costCoeffNxN and enable asm code (based on Sumalatha's patch)
Subject: [x265] asm: intrapred_angX_4x4 sse2 performance tweaks

details:   http://hg.videolan.org/x265/rev/3f004e9a1159
branches:  
changeset: 10679:3f004e9a1159
user:      David T Yuen <dtyx265 at gmail.com>
date:      Sun Jun 21 18:33:58 2015 -0700
description:
asm: intrapred_angX_4x4 sse2 performance tweaks

Created individual primitives for angles 19-25 and 27-33 to allow
individual tweaking of each angle for about 20% performance improvement

intra_ang_4x4[ 3]	3.66x 	 542.46   	 1986.43
intra_ang_4x4[ 4]	4.21x 	 507.58   	 2135.09
intra_ang_4x4[ 5]	4.16x 	 510.05   	 2119.99
intra_ang_4x4[ 6]	4.43x 	 482.52   	 2135.18
intra_ang_4x4[ 7]	4.09x 	 477.58   	 1955.19
intra_ang_4x4[ 8]	4.53x 	 460.03   	 2085.06
intra_ang_4x4[ 9]	4.51x 	 462.54   	 2084.99
intra_ang_4x4[11]	4.53x 	 480.05   	 2176.00
intra_ang_4x4[12]	4.66x 	 480.00   	 2235.34
intra_ang_4x4[13]	4.24x 	 550.06   	 2330.84
intra_ang_4x4[14]	4.13x 	 567.51   	 2345.12
intra_ang_4x4[15]	4.08x 	 567.53   	 2315.21
intra_ang_4x4[16]	4.17x 	 567.52   	 2365.42
intra_ang_4x4[17]	3.98x 	 610.05   	 2425.51
intra_ang_4x4[19]	3.54x 	 514.99   	 1825.34
intra_ang_4x4[20]	3.88x 	 452.49   	 1755.41
intra_ang_4x4[21]	3.72x 	 452.66   	 1684.99
intra_ang_4x4[22]	3.79x 	 460.04   	 1745.36
intra_ang_4x4[23]	3.65x 	 470.09   	 1715.27
intra_ang_4x4[24]	4.60x 	 362.51   	 1666.24
intra_ang_4x4[25]	4.32x 	 362.62   	 1565.41
intra_ang_4x4[27]	4.24x 	 352.69   	 1496.58
intra_ang_4x4[28]	4.24x 	 352.60   	 1495.93
intra_ang_4x4[29]	3.66x 	 365.34   	 1336.02
intra_ang_4x4[30]	3.96x 	 377.61   	 1495.37
intra_ang_4x4[31]	3.68x 	 420.17   	 1545.37
intra_ang_4x4[32]	3.86x 	 400.19   	 1545.37
intra_ang_4x4[33]	3.12x 	 427.53   	 1335.37
Subject: [x265] asm: intrapred_angX_4x4 sse2 performance tweaks 10-bit

details:   http://hg.videolan.org/x265/rev/fd899b282f19
branches:  
changeset: 10680:fd899b282f19
user:      David T Yuen <dtyx265 at gmail.com>
date:      Sun Jun 21 20:52:16 2015 -0700
description:
asm: intrapred_angX_4x4 sse2 performance tweaks 10-bit

Created individual primitives for angles 19-25 and 27-33 to allow
individual tweaking of each angle for about 5% performance improvement

intra_ang_4x4[ 3]	3.90x 	 487.44   	 1900.97
intra_ang_4x4[ 4]	4.51x 	 454.99   	 2050.33
intra_ang_4x4[ 5]	4.51x 	 455.00   	 2049.97
intra_ang_4x4[ 6]	4.82x 	 425.00   	 2049.97
intra_ang_4x4[ 7]	4.44x 	 427.50   	 1899.97
intra_ang_4x4[ 8]	4.71x 	 425.00   	 1999.97
intra_ang_4x4[ 9]	4.71x 	 425.00   	 1999.97
intra_ang_4x4[11]	4.76x 	 410.00   	 1951.26
intra_ang_4x4[12]	5.00x 	 410.00   	 2050.27
intra_ang_4x4[13]	4.48x 	 482.50   	 2160.44
intra_ang_4x4[14]	4.70x 	 462.50   	 2172.89
intra_ang_4x4[15]	4.57x 	 460.00   	 2100.26
intra_ang_4x4[16]	4.83x 	 455.00   	 2199.91
intra_ang_4x4[17]	3.96x 	 562.50   	 2230.17
intra_ang_4x4[19]	3.67x 	 475.00   	 1742.82
intra_ang_4x4[20]	4.32x 	 397.49   	 1715.35
intra_ang_4x4[21]	3.88x 	 402.49   	 1562.49
intra_ang_4x4[22]	4.08x 	 410.00   	 1672.74
intra_ang_4x4[23]	3.91x 	 415.00   	 1622.59
intra_ang_4x4[24]	4.09x 	 370.00   	 1513.66
intra_ang_4x4[25]	3.79x 	 372.50   	 1412.90
intra_ang_4x4[27]	4.00x 	 365.01   	 1460.97
intra_ang_4x4[28]	3.85x 	 380.01   	 1462.66
intra_ang_4x4[29]	3.73x 	 365.00   	 1359.97
intra_ang_4x4[30]	4.11x 	 367.50   	 1509.97
intra_ang_4x4[31]	4.00x 	 377.50   	 1509.97
intra_ang_4x4[32]	4.00x 	 377.50   	 1509.97
intra_ang_4x4[33]	3.44x 	 395.00   	 1359.97
Subject: [x265] winxp: partial fix for Issue #146, rename x265 to X265_NS

details:   http://hg.videolan.org/x265/rev/e8dc042008fa
branches:  
changeset: 10681:e8dc042008fa
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Mon Jun 22 11:08:01 2015 +0530
description:
winxp: partial fix for Issue #146, rename x265 to X265_NS
Subject: [x265] winxp: fix typo

details:   http://hg.videolan.org/x265/rev/83a7d8244424
branches:  
changeset: 10682:83a7d8244424
user:      Deepthi Nandakumar <deepthi at multicorewareinc.com>
date:      Mon Jun 22 15:15:33 2015 +0530
description:
winxp: fix typo

diffstat:

 doc/reST/cli.rst                     |     4 +
 source/common/param.cpp              |    20 +-
 source/common/param.h                |     2 +
 source/common/winxp.h                |    12 +-
 source/common/x86/asm-primitives.cpp |    58 +-
 source/common/x86/intrapred16.asm    |   987 +++++++++++++++----------
 source/common/x86/intrapred8.asm     |  1293 +++++++++++++++++++++++----------
 source/encoder/analysis.cpp          |     4 +-
 source/test/pixelharness.cpp         |   164 ++++
 source/test/pixelharness.h           |     2 +
 10 files changed, 1686 insertions(+), 860 deletions(-)

diffs (truncated from 2808 to 300 lines):

diff -r 1c6de5ac3883 -r 83a7d8244424 doc/reST/cli.rst
--- a/doc/reST/cli.rst	Thu Jun 18 15:29:11 2015 -0500
+++ b/doc/reST/cli.rst	Mon Jun 22 15:15:33 2015 +0530
@@ -620,6 +620,10 @@ the prediction quad-tree.
 	CUs and the rect/amp motion searches at that depth will only use the 
 	reference(s) selected by 2Nx2N. 
 
+	For all non-zero values of limit-refs, the current depth will evaluate
+	intra mode (in inter slices), only if intra mode was chosen as the best
+	mode for atleast one of the 4 sub-blocks.
+
 	You can often increase the number of references you are using
 	(within your decoder level limits) if you enable one or
 	both of these flags.
diff -r 1c6de5ac3883 -r 83a7d8244424 source/common/param.cpp
--- a/source/common/param.cpp	Thu Jun 18 15:29:11 2015 -0500
+++ b/source/common/param.cpp	Mon Jun 22 15:15:33 2015 +0530
@@ -471,16 +471,6 @@ static int x265_atobool(const char* str,
     return 0;
 }
 
-static double x265_atof(const char* str, bool& bError)
-{
-    char *end;
-    double v = strtod(str, &end);
-
-    if (end == str || *end != '\0')
-        bError = true;
-    return v;
-}
-
 static int parseName(const char* arg, const char* const* names, bool& bError)
 {
     for (int i = 0; names[i]; i++)
@@ -890,6 +880,16 @@ int x265_atoi(const char* str, bool& bEr
     return v;
 }
 
+double x265_atof(const char* str, bool& bError)
+{
+    char *end;
+    double v = strtod(str, &end);
+
+    if (end == str || *end != '\0')
+        bError = true;
+    return v;
+}
+
 /* cpu name can be:
  *   auto || true - x265::cpu_detect()
  *   false || no  - disabled
diff -r 1c6de5ac3883 -r 83a7d8244424 source/common/param.h
--- a/source/common/param.h	Thu Jun 18 15:29:11 2015 -0500
+++ b/source/common/param.h	Mon Jun 22 15:15:33 2015 +0530
@@ -2,6 +2,7 @@
  * Copyright (C) 2013 x265 project
  *
  * Authors: Deepthi Nandakumar <deepthi at multicorewareinc.com>
+ *          Praveen Kumar Tiwari <praveen at multicorewareinc.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -33,6 +34,7 @@ void  x265_print_reconfigured_params(x26
 void  x265_param_apply_fastfirstpass(x265_param *p);
 char* x265_param2string(x265_param *param);
 int   x265_atoi(const char *str, bool& bError);
+double x265_atof(const char *str, bool& bError);
 int   parseCpuName(const char *value, bool& bError);
 void  setParamAspectRatio(x265_param *p, int width, int height);
 void  getParamAspectRatio(x265_param *p, int& width, int& height);
diff -r 1c6de5ac3883 -r 83a7d8244424 source/common/winxp.h
--- a/source/common/winxp.h	Thu Jun 18 15:29:11 2015 -0500
+++ b/source/common/winxp.h	Mon Jun 22 15:15:33 2015 +0530
@@ -49,12 +49,12 @@ BOOL WINAPI cond_wait(ConditionVariable 
 void cond_destroy(ConditionVariable *cond);
 
 /* map missing API symbols to our structure and functions */
-#define CONDITION_VARIABLE          x265::ConditionVariable
-#define InitializeConditionVariable x265::cond_init
-#define SleepConditionVariableCS    x265::cond_wait
-#define WakeConditionVariable       x265::cond_signal
-#define WakeAllConditionVariable    x265::cond_broadcast
-#define XP_CONDITION_VAR_FREE       x265::cond_destroy
+#define CONDITION_VARIABLE          X265_NS::ConditionVariable
+#define InitializeConditionVariable X265_NS::cond_init
+#define SleepConditionVariableCS    X265_NS::cond_wait
+#define WakeConditionVariable       X265_NS::cond_signal
+#define WakeAllConditionVariable    X265_NS::cond_broadcast
+#define XP_CONDITION_VAR_FREE       X265_NS::cond_destroy
 
 } // namespace X265_NS
 
diff -r 1c6de5ac3883 -r 83a7d8244424 source/common/x86/asm-primitives.cpp
--- a/source/common/x86/asm-primitives.cpp	Thu Jun 18 15:29:11 2015 -0500
+++ b/source/common/x86/asm-primitives.cpp	Mon Jun 22 15:15:33 2015 +0530
@@ -977,21 +977,21 @@ void setupAssemblyPrimitives(EncoderPrim
         p.cu[BLOCK_4x4].intra_pred[16] = PFX(intra_pred_ang4_16_sse2);
         p.cu[BLOCK_4x4].intra_pred[17] = PFX(intra_pred_ang4_17_sse2);
         p.cu[BLOCK_4x4].intra_pred[18] = PFX(intra_pred_ang4_18_sse2);
-        p.cu[BLOCK_4x4].intra_pred[19] = PFX(intra_pred_ang4_17_sse2);
-        p.cu[BLOCK_4x4].intra_pred[20] = PFX(intra_pred_ang4_16_sse2);
-        p.cu[BLOCK_4x4].intra_pred[21] = PFX(intra_pred_ang4_15_sse2);
-        p.cu[BLOCK_4x4].intra_pred[22] = PFX(intra_pred_ang4_14_sse2);
-        p.cu[BLOCK_4x4].intra_pred[23] = PFX(intra_pred_ang4_13_sse2);
-        p.cu[BLOCK_4x4].intra_pred[24] = PFX(intra_pred_ang4_12_sse2);
-        p.cu[BLOCK_4x4].intra_pred[25] = PFX(intra_pred_ang4_11_sse2);
+        p.cu[BLOCK_4x4].intra_pred[19] = PFX(intra_pred_ang4_19_sse2);
+        p.cu[BLOCK_4x4].intra_pred[20] = PFX(intra_pred_ang4_20_sse2);
+        p.cu[BLOCK_4x4].intra_pred[21] = PFX(intra_pred_ang4_21_sse2);
+        p.cu[BLOCK_4x4].intra_pred[22] = PFX(intra_pred_ang4_22_sse2);
+        p.cu[BLOCK_4x4].intra_pred[23] = PFX(intra_pred_ang4_23_sse2);
+        p.cu[BLOCK_4x4].intra_pred[24] = PFX(intra_pred_ang4_24_sse2);
+        p.cu[BLOCK_4x4].intra_pred[25] = PFX(intra_pred_ang4_25_sse2);
         p.cu[BLOCK_4x4].intra_pred[26] = PFX(intra_pred_ang4_26_sse2);
-        p.cu[BLOCK_4x4].intra_pred[27] = PFX(intra_pred_ang4_9_sse2);
-        p.cu[BLOCK_4x4].intra_pred[28] = PFX(intra_pred_ang4_8_sse2);
-        p.cu[BLOCK_4x4].intra_pred[29] = PFX(intra_pred_ang4_7_sse2);
-        p.cu[BLOCK_4x4].intra_pred[30] = PFX(intra_pred_ang4_6_sse2);
-        p.cu[BLOCK_4x4].intra_pred[31] = PFX(intra_pred_ang4_5_sse2);
-        p.cu[BLOCK_4x4].intra_pred[32] = PFX(intra_pred_ang4_4_sse2);
-        p.cu[BLOCK_4x4].intra_pred[33] = PFX(intra_pred_ang4_3_sse2);
+        p.cu[BLOCK_4x4].intra_pred[27] = PFX(intra_pred_ang4_27_sse2);
+        p.cu[BLOCK_4x4].intra_pred[28] = PFX(intra_pred_ang4_28_sse2);
+        p.cu[BLOCK_4x4].intra_pred[29] = PFX(intra_pred_ang4_29_sse2);
+        p.cu[BLOCK_4x4].intra_pred[30] = PFX(intra_pred_ang4_30_sse2);
+        p.cu[BLOCK_4x4].intra_pred[31] = PFX(intra_pred_ang4_31_sse2);
+        p.cu[BLOCK_4x4].intra_pred[32] = PFX(intra_pred_ang4_32_sse2);
+        p.cu[BLOCK_4x4].intra_pred[33] = PFX(intra_pred_ang4_33_sse2);
 
         p.cu[BLOCK_4x4].sse_ss = PFX(pixel_ssd_ss_4x4_mmx2);
         ALL_LUMA_CU(sse_ss, pixel_ssd_ss, sse2);
@@ -2208,21 +2208,21 @@ void setupAssemblyPrimitives(EncoderPrim
         p.cu[BLOCK_4x4].intra_pred[16] = PFX(intra_pred_ang4_16_sse2);
         p.cu[BLOCK_4x4].intra_pred[17] = PFX(intra_pred_ang4_17_sse2);
         p.cu[BLOCK_4x4].intra_pred[18] = PFX(intra_pred_ang4_18_sse2);
-        p.cu[BLOCK_4x4].intra_pred[19] = PFX(intra_pred_ang4_17_sse2);
-        p.cu[BLOCK_4x4].intra_pred[20] = PFX(intra_pred_ang4_16_sse2);
-        p.cu[BLOCK_4x4].intra_pred[21] = PFX(intra_pred_ang4_15_sse2);
-        p.cu[BLOCK_4x4].intra_pred[22] = PFX(intra_pred_ang4_14_sse2);
-        p.cu[BLOCK_4x4].intra_pred[23] = PFX(intra_pred_ang4_13_sse2);
-        p.cu[BLOCK_4x4].intra_pred[24] = PFX(intra_pred_ang4_12_sse2);
-        p.cu[BLOCK_4x4].intra_pred[25] = PFX(intra_pred_ang4_11_sse2);
+        p.cu[BLOCK_4x4].intra_pred[19] = PFX(intra_pred_ang4_19_sse2);
+        p.cu[BLOCK_4x4].intra_pred[20] = PFX(intra_pred_ang4_20_sse2);
+        p.cu[BLOCK_4x4].intra_pred[21] = PFX(intra_pred_ang4_21_sse2);
+        p.cu[BLOCK_4x4].intra_pred[22] = PFX(intra_pred_ang4_22_sse2);
+        p.cu[BLOCK_4x4].intra_pred[23] = PFX(intra_pred_ang4_23_sse2);
+        p.cu[BLOCK_4x4].intra_pred[24] = PFX(intra_pred_ang4_24_sse2);
+        p.cu[BLOCK_4x4].intra_pred[25] = PFX(intra_pred_ang4_25_sse2);
         p.cu[BLOCK_4x4].intra_pred[26] = PFX(intra_pred_ang4_26_sse2);
-        p.cu[BLOCK_4x4].intra_pred[27] = PFX(intra_pred_ang4_9_sse2);
-        p.cu[BLOCK_4x4].intra_pred[28] = PFX(intra_pred_ang4_8_sse2);
-        p.cu[BLOCK_4x4].intra_pred[29] = PFX(intra_pred_ang4_7_sse2);
-        p.cu[BLOCK_4x4].intra_pred[30] = PFX(intra_pred_ang4_6_sse2);
-        p.cu[BLOCK_4x4].intra_pred[31] = PFX(intra_pred_ang4_5_sse2);
-        p.cu[BLOCK_4x4].intra_pred[32] = PFX(intra_pred_ang4_4_sse2);
-        p.cu[BLOCK_4x4].intra_pred[33] = PFX(intra_pred_ang4_3_sse2);
+        p.cu[BLOCK_4x4].intra_pred[27] = PFX(intra_pred_ang4_27_sse2);
+        p.cu[BLOCK_4x4].intra_pred[28] = PFX(intra_pred_ang4_28_sse2);
+        p.cu[BLOCK_4x4].intra_pred[29] = PFX(intra_pred_ang4_29_sse2);
+        p.cu[BLOCK_4x4].intra_pred[30] = PFX(intra_pred_ang4_30_sse2);
+        p.cu[BLOCK_4x4].intra_pred[31] = PFX(intra_pred_ang4_31_sse2);
+        p.cu[BLOCK_4x4].intra_pred[32] = PFX(intra_pred_ang4_32_sse2);
+        p.cu[BLOCK_4x4].intra_pred[33] = PFX(intra_pred_ang4_33_sse2);
 
         p.cu[BLOCK_4x4].intra_pred_allangs = PFX(all_angs_pred_4x4_sse2);
 
@@ -2451,7 +2451,7 @@ void setupAssemblyPrimitives(EncoderPrim
         ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4);
 
         // TODO: it is passed smoke test, but we need testbench, so temporary disable
-        //p.costCoeffNxN = PFX(costCoeffNxN_sse4);
+        p.costCoeffNxN = PFX(costCoeffNxN_sse4);
 #endif
         // TODO: it is passed smoke test, but we need testbench to active it, so temporary disable
         //p.costCoeffRemain = x265_costCoeffRemain_sse4;
diff -r 1c6de5ac3883 -r 83a7d8244424 source/common/x86/intrapred16.asm
--- a/source/common/x86/intrapred16.asm	Thu Jun 18 15:29:11 2015 -0500
+++ b/source/common/x86/intrapred16.asm	Mon Jun 22 15:15:33 2015 +0530
@@ -1030,6 +1030,43 @@ cglobal intra_pred_planar16, 3,3,4
 %undef INTRA_PRED_PLANAR16_AVX2
     RET
 
+%macro TRANSPOSE_4x4 0
+    punpckhwd    m0, m1, m3
+    punpcklwd    m1, m3
+    punpckhwd    m3, m1, m0
+    punpcklwd    m1, m0
+%endmacro
+
+%macro STORE_4x4 0
+    add         r1, r1
+    movh        [r0], m1
+    movhps      [r0 + r1], m1
+    movh        [r0 + r1 * 2], m3
+    lea         r1, [r1 * 3]
+    movhps      [r0 + r1], m3
+%endmacro
+
+%macro CALC_4x4 4
+    mova    m0, [pd_16]
+    pmaddwd m1, [ang_table + %1 * 16]
+    paddd   m1, m0
+    psrld   m1, 5
+
+    pmaddwd m2, [ang_table + %2 * 16]
+    paddd   m2, m0
+    psrld   m2, 5
+    packssdw m1, m2
+
+    pmaddwd m3, [ang_table + %3 * 16]
+    paddd   m3, m0
+    psrld   m3, 5
+
+    pmaddwd m4, [ang_table + %4 * 16]
+    paddd   m4, m0
+    psrld   m4, 5
+    packssdw m3, m4
+%endmacro
+
 ;-----------------------------------------------------------------------------------------
 ; void intraPredAng4(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter)
 ;-----------------------------------------------------------------------------------------
@@ -1052,216 +1089,140 @@ cglobal intra_pred_ang4_2, 3,5,4
     movh        [r0 + r1],     m0
     RET
 
-cglobal intra_pred_ang4_3, 3,5,8
-    mov         r4d, 2
-    cmp         r3m, byte 33
-    mov         r3d, 18
-    cmove       r3d, r4d
-
-    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
-
+cglobal intra_pred_ang4_3, 3,3,5
+    movu        m0, [r2 + 18]           ;[8 7 6 5 4 3 2 1]
+    mova        m1, m0
+    psrldq      m0, 2
+    punpcklwd   m1, m0                  ;[5 4 4 3 3 2 2 1]
     mova        m2, m0
     psrldq      m0, 2
-    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
+    punpcklwd   m2, m0                  ;[6 5 5 4 4 3 3 2]
     mova        m3, m0
     psrldq      m0, 2
-    punpcklwd   m3, m0      ; [6 5 5 4 4 3 3 2]
+    punpcklwd   m3, m0                  ;[7 6 6 5 5 4 4 3]
     mova        m4, m0
     psrldq      m0, 2
-    punpcklwd   m4, m0      ; [7 6 6 5 5 4 4 3]
-    mova        m5, m0
+    punpcklwd   m4, m0                  ;[8 7 7 6 6 5 5 4]
+
+    CALC_4x4 26, 20, 14, 8
+
+    TRANSPOSE_4x4
+
+    STORE_4x4
+    RET
+
+cglobal intra_pred_ang4_33, 3,3,5
+    movu        m0, [r2 + 2]            ;[8 7 6 5 4 3 2 1]
+    mova        m1, m0
     psrldq      m0, 2
-    punpcklwd   m5, m0      ; [8 7 7 6 6 5 5 4]
-
-
-    lea         r3, [ang_table + 20 * 16]
-    mova        m0, [r3 + 6 * 16]   ; [26]
-    mova        m1, [r3]            ; [20]
-    mova        m6, [r3 - 6 * 16]   ; [14]
-    mova        m7, [r3 - 12 * 16]  ; [ 8]
-    jmp        .do_filter4x4
-
-
-ALIGN 16
-.do_filter4x4:
-    lea     r4, [pd_16]
-    pmaddwd m2, m0
-    paddd   m2, [r4]
-    psrld   m2, 5
-
-    pmaddwd m3, m1
-    paddd   m3, [r4]
-    psrld   m3, 5
-    packssdw m2, m3
-
-    pmaddwd m4, m6
-    paddd   m4, [r4]
-    psrld   m4, 5
-
-    pmaddwd m5, m7
-    paddd   m5, [r4]
-    psrld   m5, 5
-    packssdw m4, m5
-
-    jz         .store
-
-    ; transpose 4x4


More information about the x265-commits mailing list