[x265-commits] [x265] asm: avx2 code for sad[64x16] (3620 -> 1279) for 10 bpp

Tue May 19 02:34:04 CEST 2015

details:   http://hg.videolan.org/x265/rev/4db250a49f1e
branches:  
changeset: 10438:4db250a49f1e
user:      Sumalatha Polureddy
date:      Thu May 14 10:31:29 2015 +0530
description:
asm: avx2 code for sad[64x16] (3620 -> 1279) for 10 bpp

sse2
sad[64x16]  2.45x    3620.17         8851.77

avx2
sad[64x16]  7.00x    1279.94         8961.61
Subject: [x265] asm: filter_vpp, filter_vps for 4x32 in avx2

details:   http://hg.videolan.org/x265/rev/07503e14e7ce
branches:  
changeset: 10439:07503e14e7ce
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Thu May 14 10:34:17 2015 +0530
description:
asm: filter_vpp, filter_vps for 4x32 in avx2

filter_vpp[4x32]: 1564c->1172c
filter_vps[4x32]: 1283c->1035c
Subject: [x265] asm: avx2 code for sad[64x32] (7156 -> 2625) for 10 bpp

details:   http://hg.videolan.org/x265/rev/6237beac41d2
branches:  
changeset: 10440:6237beac41d2
user:      Sumalatha Polureddy
date:      Thu May 14 10:49:41 2015 +0530
description:
asm: avx2 code for sad[64x32] (7156 -> 2625) for 10 bpp

sse2
sad[64x32]  2.42x    7156.98         17353.95

avx2
sad[64x32]  6.59x    2625.69         17314.50
Subject: [x265] asm: filter_vpp, filter_vps for 24x64 in avx2

details:   http://hg.videolan.org/x265/rev/080a2924ccc0
branches:  
changeset: 10441:080a2924ccc0
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Thu May 14 10:43:07 2015 +0530
description:
asm: filter_vpp, filter_vps for 24x64 in avx2

filter_vpp[24x64]: 5661c->4150c
filter_vps[24x64]: 6059c->4784c
Subject: [x265] asm: avx2 code for sad[64x48] (10791 -> 4053) for 10 bpp

details:   http://hg.videolan.org/x265/rev/e16ad34e7ee8
branches:  
changeset: 10442:e16ad34e7ee8
user:      Sumalatha Polureddy
date:      Thu May 14 11:07:36 2015 +0530
description:
asm: avx2 code for sad[64x48] (10791 -> 4053) for 10 bpp

sse2
sad[64x48]  2.34x    10791.39        25291.58

avx2
sad[64x48]  6.45x    4053.05         26139.05
Subject: [x265] asm: addAvg avx2 code for high_bit_depth sizes >= 8, improved over ~45% than previous code

details:   http://hg.videolan.org/x265/rev/af5bb7d20e55
branches:  
changeset: 10443:af5bb7d20e55
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Thu May 14 11:14:31 2015 +0530
description:
asm: addAvg avx2 code for high_bit_depth sizes >= 8, improved over ~45% than previous code
Subject: [x265] asm: filter_vpp, filter_vps for 48x64 in avx2

details:   http://hg.videolan.org/x265/rev/2a8aad20a016
branches:  
changeset: 10444:2a8aad20a016
user:      Divya Manivannan <divya at multicorewareinc.com>
date:      Thu May 14 10:53:49 2015 +0530
description:
asm: filter_vpp, filter_vps for 48x64 in avx2

filter_vpp[48x64]: 11492c->7586c
filter_vps[48x64]: 11784c->8684c
Subject: [x265] asm: avx2 code for sad[64x64] (13997 -> 5214) for 10 bpp

details:   http://hg.videolan.org/x265/rev/cce04a88e5e8
branches:  
changeset: 10445:cce04a88e5e8
user:      Sumalatha Polureddy
date:      Thu May 14 11:38:48 2015 +0530
description:
asm: avx2 code for sad[64x64] (13997 -> 5214) for 10 bpp

sse2
sad[64x64]  2.31x    13997.47        32364.27

avx2
sad[64x64]  6.68x    5214.84         34847.11
Subject: [x265] asm: addAvg high_bit_depth avx2 asm for chroma sizes width >= 8, reused code from luma

details:   http://hg.videolan.org/x265/rev/8592bf81d084
branches:  
changeset: 10446:8592bf81d084
user:      Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date:      Thu May 14 17:12:14 2015 +0530
description:
asm: addAvg high_bit_depth avx2 asm for chroma sizes width >= 8, reused code from luma
Subject: [x265] api: do not log warnings from x265_api_get() on typical failures

details:   http://hg.videolan.org/x265/rev/a7b2a1cfd10e
branches:  stable
changeset: 10447:a7b2a1cfd10e
user:      Steve Borho <steve at borho.org>
date:      Fri May 15 12:57:19 2015 -0500
description:
api: do not log warnings from x265_api_get() on typical failures

applications might use x265_api_get() to probe which bit depths are available,
we do not want to spew warnings during these operations. Also, plumb in main12
here even though we don't expect to have one for some time, it is ok if current
libs allow queries for them
Subject: [x265] lowres: cache the lowres maxNumBlocks and reuse this in other places

details:   http://hg.videolan.org/x265/rev/3b30284c9912
branches:  
changeset: 10448:3b30284c9912
user:      Gopu Govindaswamy <gopu at multicorewareinc.com>
date:      Fri May 15 10:16:02 2015 +0530
description:
lowres: cache the lowres maxNumBlocks and reuse this in other places
Subject: [x265] asm: interp_4tap_vert_ps_2xN sse2

details:   http://hg.videolan.org/x265/rev/7dad7c16c5d8
branches:  
changeset: 10449:7dad7c16c5d8
user:      David T Yuen <dtyx265 at gmail.com>
date:      Sun May 17 18:03:29 2015 -0700
description:
asm: interp_4tap_vert_ps_2xN sse2

Updated vert_pp_2xN macro to also create ps.  This replaces c code for ps with minimal impact on pp.

64-bit

/test/TestBench --testbench interp | grep vp | grep " 2x"
chroma_vpp[  2x4]	1.80x 	 644.93   	 1159.97
chroma_vps[  2x4]	1.42x 	 630.00   	 894.95
chroma_vpp[  2x8]	1.72x 	 1204.99  	 2067.47
chroma_vps[  2x8]	1.49x 	 1152.50  	 1712.50
chroma_vpp[ 2x16]	1.94x 	 2314.96  	 4482.00
chroma_vps[ 2x16]	1.91x 	 2222.45  	 4252.86

32-bit

./test/TestBench --testbench interp | grep vp | grep " 2x"
chroma_vpp[  2x4]	1.74x 	 849.94   	 1479.88
chroma_vps[  2x4]	1.64x 	 762.49   	 1247.46
chroma_vpp[  2x8]	1.89x 	 1482.47  	 2807.46
chroma_vps[  2x8]	1.93x 	 1392.49  	 2682.46
chroma_vpp[ 2x16]	2.26x 	 2769.98  	 6249.80
chroma_vps[ 2x16]	1.91x 	 2632.49  	 5028.81
Subject: [x265] asm: interp_4tap_vert_ps_4x2 sse2

details:   http://hg.videolan.org/x265/rev/d12ce0a926a3
branches:  
changeset: 10450:d12ce0a926a3
user:      David T Yuen <dtyx265 at gmail.com>
date:      Sun May 17 18:13:35 2015 -0700
description:
asm: interp_4tap_vert_ps_4x2 sse2

Converted vert_pp_4x2 primitive to macro that also creates ps.  This replaces c code for ps with minimal impact on pp.

64-bit

/test/TestBench --testbench interp | grep vp | grep " 4x2"
chroma_vpp[  4x2]	2.13x 	 524.99   	 1117.40
chroma_vps[  4x2]	1.87x 	 457.54   	 854.98

32-bit

./test/TestBench --testbench interp | grep vp | grep " 4x2"
chroma_vpp[  4x2]	2.34x 	 592.50   	 1387.29
chroma_vps[  4x2]	2.41x 	 542.48   	 1304.96
Subject: [x265] asm: interp_4tap_vert_ps_4xN sse2

details:   http://hg.videolan.org/x265/rev/3d5f0ce3dcd4
branches:  
changeset: 10451:3d5f0ce3dcd4
user:      David T Yuen <dtyx265 at gmail.com>
date:      Sun May 17 18:24:12 2015 -0700
description:
asm: interp_4tap_vert_ps_4xN sse2

Converted vert_pp_4xN macro to also create ps primitives.  This replaces c code for ps with minimal impact on pp.

64-bit

./test/TestBench --testbench interp | grep vp | grep " 4x"
chroma_vpp[  4x4]	2.10x 	 1005.00  	 2110.31
chroma_vps[  4x4]	1.76x 	 927.49   	 1634.98
chroma_vpp[  4x2]	2.17x 	 515.01   	 1117.32
chroma_vps[  4x2]	1.72x 	 497.49   	 854.98
chroma_vpp[  4x8]	2.28x 	 1928.24  	 4402.00
chroma_vps[  4x8]	1.83x 	 1803.01  	 3294.14
chroma_vpp[ 4x16]	2.30x 	 3782.50  	 8710.95
chroma_vpp[  4x8]	2.28x 	 1927.50  	 4400.15
chroma_vps[  4x8]	1.84x 	 1787.48  	 3294.45
chroma_vpp[  4x4]	2.11x 	 1000.00  	 2109.98
chroma_vps[  4x4]	1.77x 	 924.99   	 1634.97
chroma_vpp[ 4x16]	2.30x 	 3782.50  	 8709.96
chroma_vps[ 4x16]	1.90x 	 3519.99  	 6698.32
chroma_vpp[ 4x32]	2.27x 	 7477.50  	 16995.56
chroma_vpp[  4x4]	2.10x 	 1005.00  	 2109.98
chroma_vps[  4x4]	1.76x 	 927.49   	 1634.98
chroma_vpp[  4x8]	2.28x 	 1927.50  	 4400.14
chroma_vps[  4x8]	1.83x 	 1787.48  	 3270.89
chroma_vpp[ 4x16]	2.30x 	 3782.51  	 8709.96
chroma_vps[ 4x16]	1.90x 	 3517.48  	 6697.70

32-bit

./test/TestBench --testbench interp | grep vp | grep " 4x"
chroma_vpp[  4x4]	2.33x 	 1177.48  	 2747.42
chroma_vps[  4x4]	2.47x 	 1092.47  	 2702.46
chroma_vpp[  4x2]	2.44x 	 579.98   	 1414.90
chroma_vps[  4x2]	2.41x 	 545.91   	 1314.92
chroma_vpp[  4x8]	2.58x 	 2183.79  	 5640.24
chroma_vps[  4x8]	2.08x 	 2020.00  	 4200.24
chroma_vpp[ 4x16]	2.64x 	 4202.49  	 11097.51
chroma_vpp[  4x8]	2.58x 	 2187.48  	 5640.61
chroma_vps[  4x8]	2.06x 	 2019.08  	 4163.00
chroma_vpp[  4x4]	2.37x 	 1159.99  	 2747.38
chroma_vps[  4x4]	2.47x 	 1092.49  	 2702.42
chroma_vpp[ 4x16]	2.64x 	 4207.48  	 11097.51
chroma_vps[ 4x16]	2.07x 	 3887.50  	 8034.82
chroma_vpp[ 4x32]	2.65x 	 8247.49  	 21867.51
chroma_vpp[  4x4]	2.37x 	 1159.99  	 2747.42
chroma_vps[  4x4]	2.48x 	 1088.74  	 2702.42
chroma_vpp[  4x8]	2.58x 	 2187.48  	 5640.30
chroma_vps[  4x8]	2.06x 	 2017.49  	 4162.46
chroma_vpp[ 4x16]	2.64x 	 4202.49  	 11097.51
chroma_vps[ 4x16]	2.07x 	 3889.43  	 8034.94
Subject: [x265] asm: interp_4tap_vert_ps_6xN sse2

details:   http://hg.videolan.org/x265/rev/468cc3b6cc4e
branches:  
changeset: 10452:468cc3b6cc4e
user:      David T Yuen <dtyx265 at gmail.com>
date:      Sun May 17 18:33:34 2015 -0700
description:
asm: interp_4tap_vert_ps_6xN sse2

Converted vert_pp_6xN macro to also create ps primitives.  This replaces c code for ps with minimal impact on pp.

64-bit

./test/TestBench --testbench interp | grep vp | grep " 6x"
chroma_vpp[  6x8]	2.98x 	 2125.00  	 6330.15
chroma_vps[  6x8]	2.45x 	 2032.49  	 4980.56
chroma_vpp[ 6x16]	2.98x 	 4198.94  	 12520.75
chroma_vps[ 6x16]	2.47x 	 4004.99  	 9897.92
Subject: [x265] asm: interp_4tap_vert_ps_8xN sse2

details:   http://hg.videolan.org/x265/rev/1d6e87563c04
branches:  
changeset: 10453:1d6e87563c04
user:      David T Yuen <dtyx265 at gmail.com>
date:      Sun May 17 18:41:31 2015 -0700
description:
asm: interp_4tap_vert_ps_8xN sse2

Converted vert_pp_8xN macro to also create ps primitives.  This replaces c code for ps with minimal impact on pp.

64-bit

./test/TestBench --testbench interp | grep vp | grep " 8x"
chroma_vpp[  8x4]	3.87x 	 1075.00  	 4161.23
chroma_vps[  8x4]	3.25x 	 995.00   	 3232.74
chroma_vpp[  8x6]	3.97x 	 1549.98  	 6160.32
chroma_vps[  8x6]	3.22x 	 1457.48  	 4693.99
chroma_vpp[  8x2]	3.71x 	 560.00   	 2077.48
chroma_vps[  8x2]	3.01x 	 525.00   	 1582.45
chroma_vpp[  8x4]	3.93x 	 1057.50  	 4160.76
chroma_vpp[  8x4]	3.87x 	 1075.00  	 4160.94
chroma_vps[  8x4]	3.21x 	 1007.50  	 3232.90
Subject: [x265] asm: interp_4tap_vert_ps_8xN sse2

details:   http://hg.videolan.org/x265/rev/4d7d23bed21f
branches:  
changeset: 10454:4d7d23bed21f
user:      David T Yuen <dtyx265 at gmail.com>
date:      Sun May 17 18:53:44 2015 -0700
description:
asm: interp_4tap_vert_ps_8xN sse2

Converted vert_pp_8xN macro to also create ps primitives.  This replaces c code for ps with minimal impact on pp.

64-bit

./test/TestBench --testbench interp | grep vp | grep " 8x"
chroma_vpp[  8x8]	4.08x 	 2004.98  	 8188.26
chroma_vps[  8x8]	3.30x 	 1877.49  	 6197.96
chroma_vpp[ 8x16]	4.08x 	 3974.99  	 16231.35
chroma_vps[ 8x16]	3.30x 	 3729.98  	 12308.11
chroma_vpp[ 8x32]	4.07x 	 7885.22  	 32072.63
chroma_vps[ 8x32]	3.36x 	 7284.99  	 24442.68
chroma_vpp[ 8x16]	4.09x 	 3964.98  	 16230.44
chroma_vps[ 8x16]	3.35x 	 3677.49  	 12308.64
chroma_vpp[  8x8]	4.08x 	 2005.00  	 8187.65
chroma_vps[  8x8]	3.30x 	 1877.52  	 6199.94
chroma_vpp[ 8x32]	4.07x 	 7886.48  	 32099.09
chroma_vps[ 8x32]	3.28x 	 7417.49  	 24307.48
chroma_vpp[ 8x12]	4.10x 	 2994.99  	 12269.99
chroma_vps[ 8x12]	3.31x 	 2809.98  	 9307.72
chroma_vpp[ 8x64]	4.05x 	 15735.15 	 63743.21
chroma_vps[ 8x64]	3.30x 	 14640.09 	 48369.12
chroma_vpp[  8x8]	4.08x 	 2005.00  	 8187.79
chroma_vps[  8x8]	3.28x 	 1889.99  	 6198.50
chroma_vpp[ 8x16]	4.04x 	 4013.35  	 16231.04
chroma_vps[ 8x16]	3.33x 	 3692.69  	 12307.46
chroma_vpp[ 8x32]	4.06x 	 7894.98  	 32070.79
chroma_vps[ 8x32]	3.28x 	 7417.48  	 24307.55
Subject: [x265] asm: interp_4tap_vert_ps_12xN sse2

details:   http://hg.videolan.org/x265/rev/fd0904c7bb53
branches:  
changeset: 10455:fd0904c7bb53
user:      David T Yuen <dtyx265 at gmail.com>
date:      Sun May 17 19:00:41 2015 -0700
description:
asm: interp_4tap_vert_ps_12xN sse2

Converted vert_pp_12xN macro to also create ps primitives.  This replaces c code for ps with minimal impact on pp.

64-bit

./test/TestBench --testbench interp | grep vp | grep "12x"
chroma_vpp[12x16]	2.83x 	 8555.04  	 24230.19
chroma_vps[12x16]	2.29x 	 7875.15  	 18027.46
chroma_vpp[12x32]	2.87x 	 17085.12 	 49025.72
chroma_vps[12x32]	2.29x 	 15661.67 	 35787.46
chroma_vpp[12x16]	2.86x 	 8479.97  	 24229.99
chroma_vps[12x16]	2.32x 	 7757.38  	 18027.42
Subject: [x265] asm: interp_4tap_vert_ps_16xN sse2

details:   http://hg.videolan.org/x265/rev/db92414d2771
branches:  
changeset: 10456:db92414d2771
user:      David T Yuen <dtyx265 at gmail.com>
date:      Sun May 17 19:06:51 2015 -0700
description:
asm: interp_4tap_vert_ps_16xN sse2

Converted vert_pp_16xN macro to also create ps primitives.  This replaces c code for ps with minimal impact on pp.

64-bit

./test/TestBench --testbench interp | grep vp | grep "16x"
chroma_vpp[16x16]	3.90x 	 8256.30  	 32230.59
chroma_vps[16x16]	3.17x 	 7599.99  	 24104.14
chroma_vpp[ 16x8]	3.88x 	 4175.83  	 16187.80
chroma_vps[ 16x8]	3.11x 	 3840.00  	 11957.77
chroma_vpp[16x32]	3.93x 	 16435.21 	 64556.77
chroma_vps[16x32]	3.15x 	 15120.00 	 47622.26
chroma_vpp[16x12]	3.92x 	 6195.00  	 24270.27
chroma_vps[16x12]	3.14x 	 5720.05  	 17947.46
chroma_vpp[ 16x4]	3.86x 	 2115.00  	 8163.19
chroma_vps[ 16x4]	3.14x 	 1960.00  	 6160.84
chroma_vpp[16x32]	3.94x 	 16394.99 	 64530.13
chroma_vps[16x32]	3.13x 	 15120.04 	 47347.74
chroma_vpp[16x16]	3.91x 	 8235.00  	 32230.49
chroma_vps[16x16]	2.98x 	 7984.99  	 23827.91
chroma_vpp[16x64]	3.87x 	 33080.13 	 128135.33
chroma_vps[16x64]	3.09x 	 30613.33 	 94704.50
chroma_vpp[16x24]	3.94x 	 12315.02 	 48524.37
chroma_vps[16x24]	3.01x 	 11925.05 	 35897.88
chroma_vpp[ 16x8]	3.90x 	 4155.05  	 16187.66
chroma_vps[ 16x8]	2.96x 	 4044.99  	 11957.97
chroma_vpp[16x16]	3.80x 	 8475.00  	 32230.59
chroma_vps[16x16]	2.98x 	 7984.99  	 23827.48
chroma_vpp[ 16x8]	3.85x 	 4203.07  	 16187.50
chroma_vps[ 16x8]	3.11x 	 3840.00  	 11957.77
chroma_vpp[16x32]	3.81x 	 16922.52 	 64452.15
chroma_vps[16x32]	2.99x 	 15834.39 	 47348.57
chroma_vpp[16x12]	3.92x 	 6195.00  	 24269.99
chroma_vps[16x12]	3.06x 	 5858.04  	 17947.48
chroma_vpp[ 16x4]	3.86x 	 2115.00  	 8163.27
chroma_vps[ 16x4]	3.14x 	 1960.00  	 6152.63
chroma_vpp[16x64]	3.80x 	 33771.68 	 128367.91
chroma_vps[16x64]	2.99x 	 31614.99 	 94669.76
Subject: [x265] asm: interp_4tap_vert_ps_24xN sse2

details:   http://hg.videolan.org/x265/rev/b3b4924d0263
branches:  
changeset: 10457:b3b4924d0263
user:      David T Yuen <dtyx265 at gmail.com>
date:      Sun May 17 19:13:36 2015 -0700
description:
asm: interp_4tap_vert_ps_24xN sse2

Converted vert_pp_24xN macro to also create ps primitives.  This replaces c code for ps with minimal impact on pp.

64-bit

./test/TestBench --testbench interp | grep vp | grep "24x"
chroma_vpp[24x32]	7.42x 	 24657.66 	 182977.44
chroma_vps[24x32]	6.76x 	 22923.31 	 154889.81
chroma_vpp[24x64]	7.43x 	 49350.79 	 366451.75
chroma_vps[24x64]	6.87x 	 44989.23 	 309009.28
chroma_vpp[24x32]	7.42x 	 24602.54 	 182471.03
chroma_vps[24x32]	6.93x 	 22564.47 	 156441.62
Subject: [x265] asm: interp_4tap_vert_ps_32xN sse2

details:   http://hg.videolan.org/x265/rev/933512ac8ba3
branches:  
changeset: 10458:933512ac8ba3
user:      David T Yuen <dtyx265 at gmail.com>
date:      Sun May 17 19:22:41 2015 -0700
description:
asm: interp_4tap_vert_ps_32xN sse2

Converted vert_pp_32xN macro to also create ps primitives.  This replaces c code for ps with minimal impact on pp.

64-bit

./test/TestBench --testbench interp | grep vp | grep "32x"
chroma_vpp[32x32]	8.02x 	 33660.57 	 269893.41
chroma_vps[32x32]	7.34x 	 30918.89 	 227002.17
chroma_vpp[32x16]	8.09x 	 16937.68 	 136942.98
chroma_vps[32x16]	7.35x 	 15547.56 	 114342.84
chroma_vpp[32x24]	8.12x 	 25324.71 	 205517.50
chroma_vps[32x24]	7.36x 	 23167.54 	 170409.39
chroma_vpp[ 32x8]	8.05x 	 8412.51  	 67683.04
chroma_vps[ 32x8]	7.53x 	 7555.30  	 56923.70
chroma_vpp[32x64]	8.06x 	 66996.57 	 539788.81
chroma_vps[32x64]	7.39x 	 61492.46 	 454333.72
chroma_vpp[32x32]	8.06x 	 33655.25 	 271176.75
chroma_vps[32x32]	7.36x 	 30832.21 	 226821.72
chroma_vpp[32x48]	8.02x 	 50441.13 	 404571.69
chroma_vps[32x48]	7.37x 	 46230.04 	 340583.22
chroma_vpp[32x16]	8.09x 	 16937.61 	 137064.12
chroma_vps[32x16]	7.32x 	 15547.49 	 113817.55
chroma_vpp[32x32]	8.04x 	 33663.30 	 270794.66
chroma_vps[32x32]	7.37x 	 30873.11 	 227544.72
chroma_vpp[32x16]	8.07x 	 16937.51 	 136649.12
chroma_vps[32x16]	7.33x 	 15547.57 	 113930.27
chroma_vpp[32x64]	8.08x 	 67008.20 	 541583.00
chroma_vps[32x64]	7.40x 	 61431.15 	 454445.69
chroma_vpp[32x24]	8.03x 	 25322.75 	 203277.30
chroma_vps[32x24]	7.36x 	 23167.54 	 170494.78
chroma_vpp[ 32x8]	8.19x 	 8412.74  	 68903.22
chroma_vps[ 32x8]	7.59x 	 7515.08  	 57067.58
Subject: [x265] asm: interp_4tap_vert_ps_64xN and interp_4tap_vert_ps_48x64 sse2

details:   http://hg.videolan.org/x265/rev/ef127c0b4a02
branches:  
changeset: 10459:ef127c0b4a02
user:      David T Yuen <dtyx265 at gmail.com>
date:      Sun May 17 19:29:32 2015 -0700
description:
asm: interp_4tap_vert_ps_64xN and interp_4tap_vert_ps_48x64 sse2

Converted vert_pp_64xN macro to also create ps primitives.  This replaces c code for ps with minimal impact on pp.

64-bit

./test/TestBench --testbench interp | grep vp | grep "48x"
chroma_vpp[48x64]	8.03x 	 100353.77 	 805653.62
chroma_vps[48x64]	7.51x 	 93710.29 	 704220.94

./test/TestBench --testbench interp | grep vp | grep "64x"
chroma_vpp[64x64]	8.12x 	 133651.45 	 1085310.25
chroma_vps[64x64]	7.41x 	 124089.83 	 919231.12
chroma_vpp[64x32]	8.11x 	 66970.80 	 543079.94
chroma_vps[64x32]	7.38x 	 63249.14 	 466864.12
chroma_vpp[64x48]	8.12x 	 100298.12 	 814796.12
chroma_vps[64x48]	7.18x 	 96270.98 	 690954.25
chroma_vpp[64x16]	8.15x 	 33627.38 	 274119.84
chroma_vps[64x16]	7.28x 	 31222.30 	 227424.44
Subject: [x265] Call macros to reduce code size of primitive setup

details:   http://hg.videolan.org/x265/rev/98553d52d844
branches:  
changeset: 10460:98553d52d844
user:      David T Yuen <dtyx265 at gmail.com>
date:      Sun May 17 19:41:24 2015 -0700
description:
Call macros to reduce code size of primitive setup
Subject: [x265] asm: avx2 code for sad_x3[16xN] for 10 bpp

details:   http://hg.videolan.org/x265/rev/b2c7b95ed9a9
branches:  
changeset: 10461:b2c7b95ed9a9
user:      Sumalatha Polureddy
date:      Mon May 18 12:03:30 2015 +0530
description:
asm: avx2 code for sad_x3[16xN] for 10 bpp

sse2:
sad_x3[ 16x4]  2.93x    680.82          1996.91
sad_x3[ 16x8]  3.03x    1266.26         3834.18
sad_x3[16x12]  3.07x    1834.17         5631.97
sad_x3[16x16]  3.06x    2413.24         7380.88
sad_x3[16x32]  2.82x    5554.36         15654.50
sad_x3[16x64]  2.80x    10161.18        28493.52

avx2:
sad_x3[ 16x4]  4.82x    404.45          1948.78
sad_x3[ 16x8]  5.85x    634.65          3714.40
sad_x3[16x12]  6.17x    885.30          5465.97
sad_x3[16x16]  6.28x    1170.04         7350.87
sad_x3[16x32]  5.34x    2909.76         15547.79
sad_x3[16x64]  6.12x    5071.22         31043.80
Subject: [x265] asm: avx2code fore sad_x3[32xN] for 10bpp

details:   http://hg.videolan.org/x265/rev/dac9417715e5
branches:  
changeset: 10462:dac9417715e5
user:      Sumalatha Polureddy
date:      Mon May 18 12:13:37 2015 +0530
description:
asm: avx2code fore sad_x3[32xN] for 10bpp

sse2
sad_x3[ 32x8]  2.87x    2260.13         6491.12
sad_x3[32x16]  2.95x    4262.20         12583.53
sad_x3[32x24]  2.66x    7356.49         19539.50
sad_x3[32x32]  2.81x    9852.31         27732.33
sad_x3[32x64]  2.80x    17682.81        49470.11

avx2
sad_x3[ 32x8]  5.83x    1100.14         6409.04
sad_x3[32x16]  6.33x    2072.92         13121.29
sad_x3[32x24]  5.16x    3801.07         19625.10
sad_x3[32x32]  5.43x    4781.53         25952.90
sad_x3[32x64]  5.74x    8709.35         49988.55
Subject: [x265] asm: avx2 code for sad_x3[64xN] for 10 bpp

details:   http://hg.videolan.org/x265/rev/618f2ecb7b21
branches:  
changeset: 10463:618f2ecb7b21
user:      Sumalatha Polureddy
date:      Mon May 18 12:32:04 2015 +0530
description:
asm: avx2 code for sad_x3[64xN] for 10 bpp

sse2
sad_x3[64x16]  2.78x    8370.09         23242.03
sad_x3[64x32]  2.67x    17362.56        46289.12
sad_x3[64x48]  2.72x    25053.33        68260.15
sad_x3[64x64]  2.47x    35227.60        87136.18

avx2
sad_x3[64x16]  6.45x    3664.96         23624.50
sad_x3[64x32]  5.74x    8741.48         50144.03
sad_x3[64x48]  6.07x    11401.75        69182.98
sad_x3[64x64]  6.38x    16092.67        102696.92
Subject: [x265] asm: modify API on findPosFirstLast to support all zeros block

details:   http://hg.videolan.org/x265/rev/469f98bcf6a2
branches:  
changeset: 10464:469f98bcf6a2
user:      Min Chen <chenm003 at 163.com>
date:      Mon May 18 09:58:08 2015 -0500
description:
asm: modify API on findPosFirstLast to support all zeros block
Subject: [x265] improve Quant::signBitHidingHDQ by scanPosLast and findPosFirstLast

details:   http://hg.videolan.org/x265/rev/02f1521c89fd
branches:  
changeset: 10465:02f1521c89fd
user:      Min Chen <chenm003 at 163.com>
date:      Mon May 18 09:58:15 2015 -0500
description:
improve Quant::signBitHidingHDQ by scanPosLast and findPosFirstLast
Subject: [x265] faster algorithm to find firstNZPosInCG & lastNZPosInCG in Quant::signBitHidingHDQ()

details:   http://hg.videolan.org/x265/rev/9fb61f65eb91
branches:  
changeset: 10466:9fb61f65eb91
user:      Min Chen <chenm003 at 163.com>
date:      Fri May 15 17:30:19 2015 -0700
description:
faster algorithm to find firstNZPosInCG & lastNZPosInCG in Quant::signBitHidingHDQ()
Subject: [x265] modify logic to remove lastCG in Quant::signBitHidingHDQ()

details:   http://hg.videolan.org/x265/rev/9f8853df8d7d
branches:  
changeset: 10467:9f8853df8d7d
user:      Min Chen <chenm003 at 163.com>
date:      Fri May 15 17:30:22 2015 -0700
description:
modify logic to remove lastCG in Quant::signBitHidingHDQ()
Subject: [x265] reuse coeffFlag to reduce memory operator on coeff[] memory

details:   http://hg.videolan.org/x265/rev/3ebe4c09ca82
branches:  
changeset: 10468:3ebe4c09ca82
user:      Min Chen <chenm003 at 163.com>
date:      Fri May 15 17:30:24 2015 -0700
description:
reuse coeffFlag to reduce memory operator on coeff[] memory
Subject: [x265] improve by replace condition operator to mask based

details:   http://hg.videolan.org/x265/rev/6548dd65da87
branches:  
changeset: 10469:6548dd65da87
user:      Min Chen <chenm003 at 163.com>
date:      Fri May 15 19:29:09 2015 -0700
description:
improve by replace condition operator to mask based
Subject: [x265] api: fix x265.h documentation for x265_max_bit_depth

details:   http://hg.videolan.org/x265/rev/8425278def1e
branches:  stable
changeset: 10470:8425278def1e
user:      Steve Borho <steve at borho.org>
date:      Mon May 18 13:39:44 2015 -0500
description:
api: fix x265.h documentation for x265_max_bit_depth
Subject: [x265] Added tag 1.7 for changeset 8425278def1e

details:   http://hg.videolan.org/x265/rev/ddb5868a4bcd
branches:  stable
changeset: 10471:ddb5868a4bcd
user:      Steve Borho <steve at borho.org>
date:      Mon May 18 18:18:40 2015 -0500
description:
Added tag 1.7 for changeset 8425278def1e
Subject: [x265] api: introduce a less version strict API query

details:   http://hg.videolan.org/x265/rev/b24870cc916f
branches:  
changeset: 10472:b24870cc916f
user:      Steve Borho <steve at borho.org>
date:      Thu May 14 12:52:34 2015 -0500
description:
api: introduce a less version strict API query

The intention is to allow applications to use libx265 libraries with different
X265_BUILD numbers than the x265.h header they were compiled with by keeping
a little more information about the nature of each API bump.

This should have no effect on existing applications, X265_BUILD will still be
incremented each time the public API is changed, but applications which use
this new x265_api_query() method will be able to dlopen() and use any
version of libx265 that returns an API pointer that passes validation checks:

  1. check api->api_major_version == X265_MAJOR_VERSION
  2. check api->sizeof_param == sizeof(x265_param) if param is dereferenced
  3. check api->sizeof_picture == sizeof(x265_picture)
  4. check api->sizeof_analysis_data ..
    etc.

apps that use param_alloc()/param_free()/param_parse() can skip step 2 and
thus ignore the primary cause of most X265_BUILD bumps.

The only additional work for x265 developers is to increment X265_MAJOR_VERSION
when warranted (which should hopefully be very rarely).

Since this commit is modifying x265_api, we take the opportunity to rename
max_bit_depth to the more accurate bit_depth, since each API will only be
capable of encoding at a single bit depth.
Subject: [x265] Merge with stable

details:   http://hg.videolan.org/x265/rev/d7b100e51e82
branches:  
changeset: 10473:d7b100e51e82
user:      Steve Borho <steve at borho.org>
date:      Mon May 18 18:24:08 2015 -0500
description:
Merge with stable

diffstat:

 .hgtags                              |     1 +
 doc/reST/api.rst                     |    54 +
 source/CMakeLists.txt                |     2 +-
 source/common/dct.cpp                |     5 +-
 source/common/lowres.cpp             |    12 +-
 source/common/lowres.h               |     2 +
 source/common/quant.cpp              |    81 +-
 source/common/quant.h                |     2 +-
 source/common/x86/asm-primitives.cpp |   147 ++-
 source/common/x86/const-a.asm        |     2 +-
 source/common/x86/ipfilter8.asm      |  1101 +++++++++++++++++++++++++++------
 source/common/x86/ipfilter8.h        |    40 +
 source/common/x86/mc-a.asm           |   525 ++++++++++++++++
 source/common/x86/pixel-util8.asm    |     5 +-
 source/common/x86/sad16-a.asm        |   242 +++++++-
 source/encoder/api.cpp               |    96 ++-
 source/encoder/slicetype.cpp         |     4 +-
 source/test/pixelharness.cpp         |    32 +-
 source/x265.cpp                      |     4 +-
 source/x265.def.in                   |     1 +
 source/x265.h                        |    72 +-
 21 files changed, 2082 insertions(+), 348 deletions(-)

diffs (truncated from 3530 to 300 lines):

diff -r 479087422e29 -r d7b100e51e82 .hgtags

--- a/.hgtags	Wed May 13 16:52:59 2015 -0700
+++ b/.hgtags	Mon May 18 18:24:08 2015 -0500
@@ -15,3 +15,4 @@ c1e4fc0162c14fdb84f5c3bd404fb28cfe10a17f
 5e604833c5aa605d0b6efbe5234492b5e7d8ac61 1.4
 9f0324125f53a12f766f6ed6f98f16e2f42337f4 1.5
 cbeb7d8a4880e4020c4545dd8e498432c3c6cad3 1.6
+8425278def1edf0931dc33fc518e1950063e76b0 1.7
diff -r 479087422e29 -r d7b100e51e82 doc/reST/api.rst
--- a/doc/reST/api.rst	Wed May 13 16:52:59 2015 -0700
+++ b/doc/reST/api.rst	Mon May 18 18:24:08 2015 -0500
@@ -419,3 +419,57 @@ under the name libx265 (so all applicati
 and then also install libx265_main10.so (symlinked to its numbered solib).
 Thus applications which use x265_api_get() will be able to generate main
 or main10 bitstreams.
+
+There is a second bit-depth introspection method that is designed for
+applications which need more flexibility in API versioning.  If you use
+the public API described at the top of this page or x265_api_get() then
+your application must be recompiled each time x265 changes its public
+API and bumps its build number (X265_BUILD, which is also the SONAME on
+POSIX systems).  But if you use **x265_api_query** and dynamically link to
+libx265 (use dlopen() on POSIX or LoadLibrary() on Windows) your
+application is no longer directly tied to the API version of x265.h that
+it was compiled against.
+
+	/* x265_api_query:
+	 *   Retrieve the programming interface for a linked x265 library, like
+	 *   x265_api_get(), except this function accepts X265_BUILD as the second
+	 *   argument rather than using the build number as part of the function name.
+	 *   Applications which dynamically link to libx265 can use this interface to
+	 *   query the library API and achieve a relative amount of version skew
+	 *   flexibility. The function may return NULL if the library determines that
+	 *   the apiVersion that your application was compiled against is not compatible
+	 *   with the library you have linked with.
+	 *
+	 *   api_major_version will be incremented any time non-backward compatible
+	 *   changes are made to any public structures or functions. If
+	 *   api_major_version does not match X265_MAJOR_VERSION from the x265.h your
+	 *   application compiled against, your application must not use the returned
+	 *   x265_api pointer.
+	 *
+	 *   Users of this API *must* also validate the sizes of any structures which
+	 *   are not treated as opaque in application code. For instance, if your
+	 *   application dereferences a x265_param pointer, then it must check that
+	 *   api->sizeof_param matches the sizeof(x265_param) that your application
+	 *   compiled with. */
+	const x265_api* x265_api_query(int bitDepth, int apiVersion, int* err);
+
+A number of validations must be performed on the returned API structure
+in order to determine if it is safe for use by your application. If you
+do not perform these checks, your application is liable to crash.
+
+	if (api->api_major_version != X265_MAJOR_VERSION) /* do not use */
+	if (api->sizeof_param != sizeof(x265_param))      /* do not use */
+	if (api->sizeof_picture != sizeof(x265_picture))  /* do not use */
+	if (api->sizeof_stats != sizeof(x265_stats))      /* do not use */
+	if (api->sizeof_zone != sizeof(x265_zone))        /* do not use */
+	etc.
+
+Note that if your application does not directly allocate or dereference
+one of these structures, if it treats the structure as opaque or does
+not use it at all, then it can skip the size check for that structure.
+
+In particular, if your application uses api->param_alloc(),
+api->param_free(), api->param_parse(), etc and never directly accesses
+any x265_param fields, then it can skip the check on the
+sizeof(x265_parm) and thereby ignore changes to that structure (which
+account for a large percentage of X265_BUILD bumps).
diff -r 479087422e29 -r d7b100e51e82 source/CMakeLists.txt
--- a/source/CMakeLists.txt	Wed May 13 16:52:59 2015 -0700
+++ b/source/CMakeLists.txt	Mon May 18 18:24:08 2015 -0500
@@ -30,7 +30,7 @@ option(STATIC_LINK_CRT "Statically link 
 mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
 
 # X265_BUILD must be incremented each time the public API is changed
-set(X265_BUILD 59)
+set(X265_BUILD 60)
 configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
                "${PROJECT_BINARY_DIR}/x265.def")
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
diff -r 479087422e29 -r d7b100e51e82 source/common/dct.cpp
--- a/source/common/dct.cpp	Wed May 13 16:52:59 2015 -0700
+++ b/source/common/dct.cpp	Mon May 18 18:24:08 2015 -0500
@@ -798,11 +798,11 @@ uint32_t findPosFirstLast_c(const int16_
             break;
     }
 
-    X265_CHECK(n >= 0, "non-zero coeff scan failuare!\n");
+    X265_CHECK(n >= -1, "non-zero coeff scan failuare!\n");
 
     uint32_t lastNZPosInCG = (uint32_t)n;
 
-    for (n = 0;; n++)
+    for (n = 0; n < SCAN_SET_SIZE; n++)
     {
         const uint32_t idx = scanTbl[n];
         const uint32_t idxY = idx / MLS_CG_SIZE;
@@ -813,6 +813,7 @@ uint32_t findPosFirstLast_c(const int16_
 
     uint32_t firstNZPosInCG = (uint32_t)n;
 
+    // NOTE: when coeff block all ZERO, the lastNZPosInCG is undefined and firstNZPosInCG is 16
     return ((lastNZPosInCG << 16) | firstNZPosInCG);
 }
 
diff -r 479087422e29 -r d7b100e51e82 source/common/lowres.cpp
--- a/source/common/lowres.cpp	Wed May 13 16:52:59 2015 -0700
+++ b/source/common/lowres.cpp	Mon May 18 18:24:08 2015 -0500
@@ -36,13 +36,13 @@ bool Lowres::create(PicYuv *origPic, int
     lumaStride = width + 2 * origPic->m_lumaMarginX;
     if (lumaStride & 31)
         lumaStride += 32 - (lumaStride & 31);
-    int cuWidth = (width + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
-    int cuHeight = (lines + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
-    int cuCount = cuWidth * cuHeight;
+    maxBlocksInRow = (width + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
+    maxBlocksInCol = (lines + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
+    int cuCount = maxBlocksInRow * maxBlocksInCol;
 
     /* rounding the width to multiple of lowres CU size */
-    width = cuWidth * X265_LOWRES_CU_SIZE;
-    lines = cuHeight * X265_LOWRES_CU_SIZE;
+    width = maxBlocksInRow * X265_LOWRES_CU_SIZE;
+    lines = maxBlocksInCol * X265_LOWRES_CU_SIZE;
 
     size_t planesize = lumaStride * (lines + 2 * origPic->m_lumaMarginY);
     size_t padoffset = lumaStride * origPic->m_lumaMarginY + origPic->m_lumaMarginX;
@@ -74,7 +74,7 @@ bool Lowres::create(PicYuv *origPic, int
     {
         for (int j = 0; j < bframes + 2; j++)
         {
-            CHECKED_MALLOC(rowSatds[i][j], int32_t, cuHeight);
+            CHECKED_MALLOC(rowSatds[i][j], int32_t, maxBlocksInCol);
             CHECKED_MALLOC(lowresCosts[i][j], uint16_t, cuCount);
         }
     }
diff -r 479087422e29 -r d7b100e51e82 source/common/lowres.h
--- a/source/common/lowres.h	Wed May 13 16:52:59 2015 -0700
+++ b/source/common/lowres.h	Mon May 18 18:24:08 2015 -0500
@@ -130,6 +130,8 @@ struct Lowres : public ReferencePlanes
     uint16_t(*lowresCosts[X265_BFRAME_MAX + 2][X265_BFRAME_MAX + 2]);
     int32_t*  lowresMvCosts[2][X265_BFRAME_MAX + 1];
     MV*       lowresMvs[2][X265_BFRAME_MAX + 1];
+    uint32_t  maxBlocksInRow;
+    uint32_t  maxBlocksInCol;
 
     /* used for vbvLookahead */
     int       plannedType[X265_LOOKAHEAD_MAX + 1];
diff -r 479087422e29 -r d7b100e51e82 source/common/quant.cpp
--- a/source/common/quant.cpp	Wed May 13 16:52:59 2015 -0700
+++ b/source/common/quant.cpp	Mon May 18 18:24:08 2015 -0500
@@ -251,30 +251,63 @@ void Quant::setChromaQP(int qpin, TextTy
 }
 
 /* To minimize the distortion only. No rate is considered */
-uint32_t Quant::signBitHidingHDQ(int16_t* coeff, int32_t* deltaU, uint32_t numSig, const TUEntropyCodingParameters &codeParams)
+uint32_t Quant::signBitHidingHDQ(int16_t* coeff, int32_t* deltaU, uint32_t numSig, const TUEntropyCodingParameters &codeParams, uint32_t log2TrSize)
 {
-    const uint32_t log2TrSizeCG = codeParams.log2TrSizeCG;
+    uint32_t trSize = 1 << log2TrSize;
     const uint16_t* scan = codeParams.scan;
-    bool lastCG = true;
 
-    for (int cg = (1 << (log2TrSizeCG * 2)) - 1; cg >= 0; cg--)
+    uint8_t coeffNum[MLS_GRP_NUM];      // value range[0, 16]
+    uint16_t coeffSign[MLS_GRP_NUM];    // bit mask map for non-zero coeff sign
+    uint16_t coeffFlag[MLS_GRP_NUM];    // bit mask map for non-zero coeff
+
+#if CHECKED_BUILD || _DEBUG
+    // clean output buffer, the asm version of scanPosLast Never output anything after latest non-zero coeff group
+    memset(coeffNum, 0, sizeof(coeffNum));
+    memset(coeffSign, 0, sizeof(coeffNum));
+    memset(coeffFlag, 0, sizeof(coeffNum));
+#endif
+    const int lastScanPos = primitives.scanPosLast(codeParams.scan, coeff, coeffSign, coeffFlag, coeffNum, numSig, g_scan4x4[codeParams.scanType], trSize);
+    const int cgLastScanPos = (lastScanPos >> LOG2_SCAN_SET_SIZE);
+    unsigned long tmp;
+
+    // first CG need specially processing
+    const uint32_t correctOffset = 0x0F & (lastScanPos ^ 0xF);
+    coeffFlag[cgLastScanPos] <<= correctOffset;
+
+    for (int cg = cgLastScanPos; cg >= 0; cg--)
     {
         int cgStartPos = cg << LOG2_SCAN_SET_SIZE;
         int n;
 
+#if CHECKED_BUILD || _DEBUG
         for (n = SCAN_SET_SIZE - 1; n >= 0; --n)
             if (coeff[scan[n + cgStartPos]])
                 break;
-        if (n < 0)
+        int lastNZPosInCG0 = n;
+#endif
+
+        if (coeffNum[cg] == 0)
+        {
+            X265_CHECK(lastNZPosInCG0 < 0, "all zero block check failure\n");
             continue;
+        }
 
-        int lastNZPosInCG = n;
-
+#if CHECKED_BUILD || _DEBUG
         for (n = 0;; n++)
             if (coeff[scan[n + cgStartPos]])
                 break;
 
-        int firstNZPosInCG = n;
+        int firstNZPosInCG0 = n;
+#endif
+
+        CLZ(tmp, coeffFlag[cg]);
+        const int firstNZPosInCG = (15 ^ tmp);
+
+        CTZ(tmp, coeffFlag[cg]);
+        const int lastNZPosInCG = (15 ^ tmp);
+
+        X265_CHECK(firstNZPosInCG0 == firstNZPosInCG, "firstNZPosInCG0 check failure\n");
+        X265_CHECK(lastNZPosInCG0 == lastNZPosInCG, "lastNZPosInCG0 check failure\n");
 
         if (lastNZPosInCG - firstNZPosInCG >= SBH_THRESHOLD)
         {
@@ -287,12 +320,17 @@ uint32_t Quant::signBitHidingHDQ(int16_t
             if (signbit != (absSum & 0x1)) // compare signbit with sum_parity
             {
                 int minCostInc = MAX_INT,  minPos = -1, curCost = MAX_INT;
-                int16_t finalChange = 0, curChange = 0;
+                int32_t finalChange = 0, curChange = 0;
+                uint32_t cgFlags = coeffFlag[cg];
+                if (cg == cgLastScanPos)
+                    cgFlags >>= correctOffset;
 
-                for (n = (lastCG ? lastNZPosInCG : SCAN_SET_SIZE - 1); n >= 0; --n)
+                for (n = (cg == cgLastScanPos ? lastNZPosInCG : SCAN_SET_SIZE - 1); n >= 0; --n)
                 {
                     uint32_t blkPos = scan[n + cgStartPos];
-                    if (coeff[blkPos])
+                    X265_CHECK(!!coeff[blkPos] == !!(cgFlags & 1), "non zero coeff check failure\n");
+
+                    if (cgFlags & 1)
                     {
                         if (deltaU[blkPos] > 0)
                         {
@@ -301,8 +339,11 @@ uint32_t Quant::signBitHidingHDQ(int16_t
                         }
                         else
                         {
-                            if (n == firstNZPosInCG && abs(coeff[blkPos]) == 1)
+                            if ((cgFlags == 1) && (abs(coeff[blkPos]) == 1))
+                            {
+                                X265_CHECK(n == firstNZPosInCG, "firstNZPosInCG position check failure\n");
                                 curCost = MAX_INT;
+                            }
                             else
                             {
                                 curCost = deltaU[blkPos];
@@ -312,8 +353,9 @@ uint32_t Quant::signBitHidingHDQ(int16_t
                     }
                     else
                     {
-                        if (n < firstNZPosInCG)
+                        if (cgFlags == 0)
                         {
+                            X265_CHECK(n < firstNZPosInCG, "firstNZPosInCG position check failure\n");
                             uint32_t thisSignBit = m_resiDctCoeff[blkPos] >= 0 ? 0 : 1;
                             if (thisSignBit != signbit)
                                 curCost = MAX_INT;
@@ -336,6 +378,7 @@ uint32_t Quant::signBitHidingHDQ(int16_t
                         finalChange = curChange;
                         minPos = blkPos;
                     }
+                    cgFlags>>=1;
                 }
 
                 /* do not allow change to violate coeff clamp */
@@ -347,14 +390,12 @@ uint32_t Quant::signBitHidingHDQ(int16_t
                 else if (finalChange == -1 && abs(coeff[minPos]) == 1)
                     numSig--;
 
-                if (m_resiDctCoeff[minPos] >= 0)
-                    coeff[minPos] += finalChange;
-                else
-                    coeff[minPos] -= finalChange;
+                {
+                    const int16_t sigMask = ((int16_t)m_resiDctCoeff[minPos]) >> 15;
+                    coeff[minPos] += ((int16_t)finalChange ^ sigMask) - sigMask;
+                }
             }
         }
-
-        lastCG = false;
     }
 
     return numSig;
@@ -437,7 +478,7 @@ uint32_t Quant::transformNxN(const CUDat
         {
             TUEntropyCodingParameters codeParams;
             cu.getTUEntropyCodingParameters(codeParams, absPartIdx, log2TrSize, isLuma);