[x265-commits] [x265] asm: avx2 code for sad[64x16] (3620 -> 1279) for 10 bpp
Sumalatha at videolan.org
Sumalatha at videolan.org
Tue May 19 02:34:04 CEST 2015
details: http://hg.videolan.org/x265/rev/4db250a49f1e
branches:
changeset: 10438:4db250a49f1e
user: Sumalatha Polureddy
date: Thu May 14 10:31:29 2015 +0530
description:
asm: avx2 code for sad[64x16] (3620 -> 1279) for 10 bpp
sse2
sad[64x16] 2.45x 3620.17 8851.77
avx2
sad[64x16] 7.00x 1279.94 8961.61
Subject: [x265] asm: filter_vpp, filter_vps for 4x32 in avx2
details: http://hg.videolan.org/x265/rev/07503e14e7ce
branches:
changeset: 10439:07503e14e7ce
user: Divya Manivannan <divya at multicorewareinc.com>
date: Thu May 14 10:34:17 2015 +0530
description:
asm: filter_vpp, filter_vps for 4x32 in avx2
filter_vpp[4x32]: 1564c->1172c
filter_vps[4x32]: 1283c->1035c
Subject: [x265] asm: avx2 code for sad[64x32] (7156 -> 2625) for 10 bpp
details: http://hg.videolan.org/x265/rev/6237beac41d2
branches:
changeset: 10440:6237beac41d2
user: Sumalatha Polureddy
date: Thu May 14 10:49:41 2015 +0530
description:
asm: avx2 code for sad[64x32] (7156 -> 2625) for 10 bpp
sse2
sad[64x32] 2.42x 7156.98 17353.95
avx2
sad[64x32] 6.59x 2625.69 17314.50
Subject: [x265] asm: filter_vpp, filter_vps for 24x64 in avx2
details: http://hg.videolan.org/x265/rev/080a2924ccc0
branches:
changeset: 10441:080a2924ccc0
user: Divya Manivannan <divya at multicorewareinc.com>
date: Thu May 14 10:43:07 2015 +0530
description:
asm: filter_vpp, filter_vps for 24x64 in avx2
filter_vpp[24x64]: 5661c->4150c
filter_vps[24x64]: 6059c->4784c
Subject: [x265] asm: avx2 code for sad[64x48] (10791 -> 4053) for 10 bpp
details: http://hg.videolan.org/x265/rev/e16ad34e7ee8
branches:
changeset: 10442:e16ad34e7ee8
user: Sumalatha Polureddy
date: Thu May 14 11:07:36 2015 +0530
description:
asm: avx2 code for sad[64x48] (10791 -> 4053) for 10 bpp
sse2
sad[64x48] 2.34x 10791.39 25291.58
avx2
sad[64x48] 6.45x 4053.05 26139.05
Subject: [x265] asm: addAvg avx2 code for high_bit_depth sizes >= 8, improved over ~45% than previous code
details: http://hg.videolan.org/x265/rev/af5bb7d20e55
branches:
changeset: 10443:af5bb7d20e55
user: Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date: Thu May 14 11:14:31 2015 +0530
description:
asm: addAvg avx2 code for high_bit_depth sizes >= 8, improved over ~45% than previous code
Subject: [x265] asm: filter_vpp, filter_vps for 48x64 in avx2
details: http://hg.videolan.org/x265/rev/2a8aad20a016
branches:
changeset: 10444:2a8aad20a016
user: Divya Manivannan <divya at multicorewareinc.com>
date: Thu May 14 10:53:49 2015 +0530
description:
asm: filter_vpp, filter_vps for 48x64 in avx2
filter_vpp[48x64]: 11492c->7586c
filter_vps[48x64]: 11784c->8684c
Subject: [x265] asm: avx2 code for sad[64x64] (13997 -> 5214) for 10 bpp
details: http://hg.videolan.org/x265/rev/cce04a88e5e8
branches:
changeset: 10445:cce04a88e5e8
user: Sumalatha Polureddy
date: Thu May 14 11:38:48 2015 +0530
description:
asm: avx2 code for sad[64x64] (13997 -> 5214) for 10 bpp
sse2
sad[64x64] 2.31x 13997.47 32364.27
avx2
sad[64x64] 6.68x 5214.84 34847.11
Subject: [x265] asm: addAvg high_bit_depth avx2 asm for chroma sizes width >= 8, reused code from luma
details: http://hg.videolan.org/x265/rev/8592bf81d084
branches:
changeset: 10446:8592bf81d084
user: Dnyaneshwar G <dnyaneshwar at multicorewareinc.com>
date: Thu May 14 17:12:14 2015 +0530
description:
asm: addAvg high_bit_depth avx2 asm for chroma sizes width >= 8, reused code from luma
Subject: [x265] api: do not log warnings from x265_api_get() on typical failures
details: http://hg.videolan.org/x265/rev/a7b2a1cfd10e
branches: stable
changeset: 10447:a7b2a1cfd10e
user: Steve Borho <steve at borho.org>
date: Fri May 15 12:57:19 2015 -0500
description:
api: do not log warnings from x265_api_get() on typical failures
applications might use x265_api_get() to probe which bit depths are available,
we do not want to spew warnings during these operations. Also, plumb in main12
here even though we don't expect to have one for some time, it is ok if current
libs allow queries for them
Subject: [x265] lowres: cache the lowres maxNumBlocks and reuse this in other places
details: http://hg.videolan.org/x265/rev/3b30284c9912
branches:
changeset: 10448:3b30284c9912
user: Gopu Govindaswamy <gopu at multicorewareinc.com>
date: Fri May 15 10:16:02 2015 +0530
description:
lowres: cache the lowres maxNumBlocks and reuse this in other places
Subject: [x265] asm: interp_4tap_vert_ps_2xN sse2
details: http://hg.videolan.org/x265/rev/7dad7c16c5d8
branches:
changeset: 10449:7dad7c16c5d8
user: David T Yuen <dtyx265 at gmail.com>
date: Sun May 17 18:03:29 2015 -0700
description:
asm: interp_4tap_vert_ps_2xN sse2
Updated vert_pp_2xN macro to also create ps. This replaces c code for ps with minimal impact on pp.
64-bit
/test/TestBench --testbench interp | grep vp | grep " 2x"
chroma_vpp[ 2x4] 1.80x 644.93 1159.97
chroma_vps[ 2x4] 1.42x 630.00 894.95
chroma_vpp[ 2x8] 1.72x 1204.99 2067.47
chroma_vps[ 2x8] 1.49x 1152.50 1712.50
chroma_vpp[ 2x16] 1.94x 2314.96 4482.00
chroma_vps[ 2x16] 1.91x 2222.45 4252.86
32-bit
./test/TestBench --testbench interp | grep vp | grep " 2x"
chroma_vpp[ 2x4] 1.74x 849.94 1479.88
chroma_vps[ 2x4] 1.64x 762.49 1247.46
chroma_vpp[ 2x8] 1.89x 1482.47 2807.46
chroma_vps[ 2x8] 1.93x 1392.49 2682.46
chroma_vpp[ 2x16] 2.26x 2769.98 6249.80
chroma_vps[ 2x16] 1.91x 2632.49 5028.81
Subject: [x265] asm: interp_4tap_vert_ps_4x2 sse2
details: http://hg.videolan.org/x265/rev/d12ce0a926a3
branches:
changeset: 10450:d12ce0a926a3
user: David T Yuen <dtyx265 at gmail.com>
date: Sun May 17 18:13:35 2015 -0700
description:
asm: interp_4tap_vert_ps_4x2 sse2
Converted vert_pp_4x2 primitive to macro that also creates ps. This replaces c code for ps with minimal impact on pp.
64-bit
/test/TestBench --testbench interp | grep vp | grep " 4x2"
chroma_vpp[ 4x2] 2.13x 524.99 1117.40
chroma_vps[ 4x2] 1.87x 457.54 854.98
32-bit
./test/TestBench --testbench interp | grep vp | grep " 4x2"
chroma_vpp[ 4x2] 2.34x 592.50 1387.29
chroma_vps[ 4x2] 2.41x 542.48 1304.96
Subject: [x265] asm: interp_4tap_vert_ps_4xN sse2
details: http://hg.videolan.org/x265/rev/3d5f0ce3dcd4
branches:
changeset: 10451:3d5f0ce3dcd4
user: David T Yuen <dtyx265 at gmail.com>
date: Sun May 17 18:24:12 2015 -0700
description:
asm: interp_4tap_vert_ps_4xN sse2
Converted vert_pp_4xN macro to also create ps primitives. This replaces c code for ps with minimal impact on pp.
64-bit
./test/TestBench --testbench interp | grep vp | grep " 4x"
chroma_vpp[ 4x4] 2.10x 1005.00 2110.31
chroma_vps[ 4x4] 1.76x 927.49 1634.98
chroma_vpp[ 4x2] 2.17x 515.01 1117.32
chroma_vps[ 4x2] 1.72x 497.49 854.98
chroma_vpp[ 4x8] 2.28x 1928.24 4402.00
chroma_vps[ 4x8] 1.83x 1803.01 3294.14
chroma_vpp[ 4x16] 2.30x 3782.50 8710.95
chroma_vpp[ 4x8] 2.28x 1927.50 4400.15
chroma_vps[ 4x8] 1.84x 1787.48 3294.45
chroma_vpp[ 4x4] 2.11x 1000.00 2109.98
chroma_vps[ 4x4] 1.77x 924.99 1634.97
chroma_vpp[ 4x16] 2.30x 3782.50 8709.96
chroma_vps[ 4x16] 1.90x 3519.99 6698.32
chroma_vpp[ 4x32] 2.27x 7477.50 16995.56
chroma_vpp[ 4x4] 2.10x 1005.00 2109.98
chroma_vps[ 4x4] 1.76x 927.49 1634.98
chroma_vpp[ 4x8] 2.28x 1927.50 4400.14
chroma_vps[ 4x8] 1.83x 1787.48 3270.89
chroma_vpp[ 4x16] 2.30x 3782.51 8709.96
chroma_vps[ 4x16] 1.90x 3517.48 6697.70
32-bit
./test/TestBench --testbench interp | grep vp | grep " 4x"
chroma_vpp[ 4x4] 2.33x 1177.48 2747.42
chroma_vps[ 4x4] 2.47x 1092.47 2702.46
chroma_vpp[ 4x2] 2.44x 579.98 1414.90
chroma_vps[ 4x2] 2.41x 545.91 1314.92
chroma_vpp[ 4x8] 2.58x 2183.79 5640.24
chroma_vps[ 4x8] 2.08x 2020.00 4200.24
chroma_vpp[ 4x16] 2.64x 4202.49 11097.51
chroma_vpp[ 4x8] 2.58x 2187.48 5640.61
chroma_vps[ 4x8] 2.06x 2019.08 4163.00
chroma_vpp[ 4x4] 2.37x 1159.99 2747.38
chroma_vps[ 4x4] 2.47x 1092.49 2702.42
chroma_vpp[ 4x16] 2.64x 4207.48 11097.51
chroma_vps[ 4x16] 2.07x 3887.50 8034.82
chroma_vpp[ 4x32] 2.65x 8247.49 21867.51
chroma_vpp[ 4x4] 2.37x 1159.99 2747.42
chroma_vps[ 4x4] 2.48x 1088.74 2702.42
chroma_vpp[ 4x8] 2.58x 2187.48 5640.30
chroma_vps[ 4x8] 2.06x 2017.49 4162.46
chroma_vpp[ 4x16] 2.64x 4202.49 11097.51
chroma_vps[ 4x16] 2.07x 3889.43 8034.94
Subject: [x265] asm: interp_4tap_vert_ps_6xN sse2
details: http://hg.videolan.org/x265/rev/468cc3b6cc4e
branches:
changeset: 10452:468cc3b6cc4e
user: David T Yuen <dtyx265 at gmail.com>
date: Sun May 17 18:33:34 2015 -0700
description:
asm: interp_4tap_vert_ps_6xN sse2
Converted vert_pp_6xN macro to also create ps primitives. This replaces c code for ps with minimal impact on pp.
64-bit
./test/TestBench --testbench interp | grep vp | grep " 6x"
chroma_vpp[ 6x8] 2.98x 2125.00 6330.15
chroma_vps[ 6x8] 2.45x 2032.49 4980.56
chroma_vpp[ 6x16] 2.98x 4198.94 12520.75
chroma_vps[ 6x16] 2.47x 4004.99 9897.92
Subject: [x265] asm: interp_4tap_vert_ps_8xN sse2
details: http://hg.videolan.org/x265/rev/1d6e87563c04
branches:
changeset: 10453:1d6e87563c04
user: David T Yuen <dtyx265 at gmail.com>
date: Sun May 17 18:41:31 2015 -0700
description:
asm: interp_4tap_vert_ps_8xN sse2
Converted vert_pp_8xN macro to also create ps primitives. This replaces c code for ps with minimal impact on pp.
64-bit
./test/TestBench --testbench interp | grep vp | grep " 8x"
chroma_vpp[ 8x4] 3.87x 1075.00 4161.23
chroma_vps[ 8x4] 3.25x 995.00 3232.74
chroma_vpp[ 8x6] 3.97x 1549.98 6160.32
chroma_vps[ 8x6] 3.22x 1457.48 4693.99
chroma_vpp[ 8x2] 3.71x 560.00 2077.48
chroma_vps[ 8x2] 3.01x 525.00 1582.45
chroma_vpp[ 8x4] 3.93x 1057.50 4160.76
chroma_vpp[ 8x4] 3.87x 1075.00 4160.94
chroma_vps[ 8x4] 3.21x 1007.50 3232.90
Subject: [x265] asm: interp_4tap_vert_ps_8xN sse2
details: http://hg.videolan.org/x265/rev/4d7d23bed21f
branches:
changeset: 10454:4d7d23bed21f
user: David T Yuen <dtyx265 at gmail.com>
date: Sun May 17 18:53:44 2015 -0700
description:
asm: interp_4tap_vert_ps_8xN sse2
Converted vert_pp_8xN macro to also create ps primitives. This replaces c code for ps with minimal impact on pp.
64-bit
./test/TestBench --testbench interp | grep vp | grep " 8x"
chroma_vpp[ 8x8] 4.08x 2004.98 8188.26
chroma_vps[ 8x8] 3.30x 1877.49 6197.96
chroma_vpp[ 8x16] 4.08x 3974.99 16231.35
chroma_vps[ 8x16] 3.30x 3729.98 12308.11
chroma_vpp[ 8x32] 4.07x 7885.22 32072.63
chroma_vps[ 8x32] 3.36x 7284.99 24442.68
chroma_vpp[ 8x16] 4.09x 3964.98 16230.44
chroma_vps[ 8x16] 3.35x 3677.49 12308.64
chroma_vpp[ 8x8] 4.08x 2005.00 8187.65
chroma_vps[ 8x8] 3.30x 1877.52 6199.94
chroma_vpp[ 8x32] 4.07x 7886.48 32099.09
chroma_vps[ 8x32] 3.28x 7417.49 24307.48
chroma_vpp[ 8x12] 4.10x 2994.99 12269.99
chroma_vps[ 8x12] 3.31x 2809.98 9307.72
chroma_vpp[ 8x64] 4.05x 15735.15 63743.21
chroma_vps[ 8x64] 3.30x 14640.09 48369.12
chroma_vpp[ 8x8] 4.08x 2005.00 8187.79
chroma_vps[ 8x8] 3.28x 1889.99 6198.50
chroma_vpp[ 8x16] 4.04x 4013.35 16231.04
chroma_vps[ 8x16] 3.33x 3692.69 12307.46
chroma_vpp[ 8x32] 4.06x 7894.98 32070.79
chroma_vps[ 8x32] 3.28x 7417.48 24307.55
Subject: [x265] asm: interp_4tap_vert_ps_12xN sse2
details: http://hg.videolan.org/x265/rev/fd0904c7bb53
branches:
changeset: 10455:fd0904c7bb53
user: David T Yuen <dtyx265 at gmail.com>
date: Sun May 17 19:00:41 2015 -0700
description:
asm: interp_4tap_vert_ps_12xN sse2
Converted vert_pp_12xN macro to also create ps primitives. This replaces c code for ps with minimal impact on pp.
64-bit
./test/TestBench --testbench interp | grep vp | grep "12x"
chroma_vpp[12x16] 2.83x 8555.04 24230.19
chroma_vps[12x16] 2.29x 7875.15 18027.46
chroma_vpp[12x32] 2.87x 17085.12 49025.72
chroma_vps[12x32] 2.29x 15661.67 35787.46
chroma_vpp[12x16] 2.86x 8479.97 24229.99
chroma_vps[12x16] 2.32x 7757.38 18027.42
Subject: [x265] asm: interp_4tap_vert_ps_16xN sse2
details: http://hg.videolan.org/x265/rev/db92414d2771
branches:
changeset: 10456:db92414d2771
user: David T Yuen <dtyx265 at gmail.com>
date: Sun May 17 19:06:51 2015 -0700
description:
asm: interp_4tap_vert_ps_16xN sse2
Converted vert_pp_16xN macro to also create ps primitives. This replaces c code for ps with minimal impact on pp.
64-bit
./test/TestBench --testbench interp | grep vp | grep "16x"
chroma_vpp[16x16] 3.90x 8256.30 32230.59
chroma_vps[16x16] 3.17x 7599.99 24104.14
chroma_vpp[ 16x8] 3.88x 4175.83 16187.80
chroma_vps[ 16x8] 3.11x 3840.00 11957.77
chroma_vpp[16x32] 3.93x 16435.21 64556.77
chroma_vps[16x32] 3.15x 15120.00 47622.26
chroma_vpp[16x12] 3.92x 6195.00 24270.27
chroma_vps[16x12] 3.14x 5720.05 17947.46
chroma_vpp[ 16x4] 3.86x 2115.00 8163.19
chroma_vps[ 16x4] 3.14x 1960.00 6160.84
chroma_vpp[16x32] 3.94x 16394.99 64530.13
chroma_vps[16x32] 3.13x 15120.04 47347.74
chroma_vpp[16x16] 3.91x 8235.00 32230.49
chroma_vps[16x16] 2.98x 7984.99 23827.91
chroma_vpp[16x64] 3.87x 33080.13 128135.33
chroma_vps[16x64] 3.09x 30613.33 94704.50
chroma_vpp[16x24] 3.94x 12315.02 48524.37
chroma_vps[16x24] 3.01x 11925.05 35897.88
chroma_vpp[ 16x8] 3.90x 4155.05 16187.66
chroma_vps[ 16x8] 2.96x 4044.99 11957.97
chroma_vpp[16x16] 3.80x 8475.00 32230.59
chroma_vps[16x16] 2.98x 7984.99 23827.48
chroma_vpp[ 16x8] 3.85x 4203.07 16187.50
chroma_vps[ 16x8] 3.11x 3840.00 11957.77
chroma_vpp[16x32] 3.81x 16922.52 64452.15
chroma_vps[16x32] 2.99x 15834.39 47348.57
chroma_vpp[16x12] 3.92x 6195.00 24269.99
chroma_vps[16x12] 3.06x 5858.04 17947.48
chroma_vpp[ 16x4] 3.86x 2115.00 8163.27
chroma_vps[ 16x4] 3.14x 1960.00 6152.63
chroma_vpp[16x64] 3.80x 33771.68 128367.91
chroma_vps[16x64] 2.99x 31614.99 94669.76
Subject: [x265] asm: interp_4tap_vert_ps_24xN sse2
details: http://hg.videolan.org/x265/rev/b3b4924d0263
branches:
changeset: 10457:b3b4924d0263
user: David T Yuen <dtyx265 at gmail.com>
date: Sun May 17 19:13:36 2015 -0700
description:
asm: interp_4tap_vert_ps_24xN sse2
Converted vert_pp_24xN macro to also create ps primitives. This replaces c code for ps with minimal impact on pp.
64-bit
./test/TestBench --testbench interp | grep vp | grep "24x"
chroma_vpp[24x32] 7.42x 24657.66 182977.44
chroma_vps[24x32] 6.76x 22923.31 154889.81
chroma_vpp[24x64] 7.43x 49350.79 366451.75
chroma_vps[24x64] 6.87x 44989.23 309009.28
chroma_vpp[24x32] 7.42x 24602.54 182471.03
chroma_vps[24x32] 6.93x 22564.47 156441.62
Subject: [x265] asm: interp_4tap_vert_ps_32xN sse2
details: http://hg.videolan.org/x265/rev/933512ac8ba3
branches:
changeset: 10458:933512ac8ba3
user: David T Yuen <dtyx265 at gmail.com>
date: Sun May 17 19:22:41 2015 -0700
description:
asm: interp_4tap_vert_ps_32xN sse2
Converted vert_pp_32xN macro to also create ps primitives. This replaces c code for ps with minimal impact on pp.
64-bit
./test/TestBench --testbench interp | grep vp | grep "32x"
chroma_vpp[32x32] 8.02x 33660.57 269893.41
chroma_vps[32x32] 7.34x 30918.89 227002.17
chroma_vpp[32x16] 8.09x 16937.68 136942.98
chroma_vps[32x16] 7.35x 15547.56 114342.84
chroma_vpp[32x24] 8.12x 25324.71 205517.50
chroma_vps[32x24] 7.36x 23167.54 170409.39
chroma_vpp[ 32x8] 8.05x 8412.51 67683.04
chroma_vps[ 32x8] 7.53x 7555.30 56923.70
chroma_vpp[32x64] 8.06x 66996.57 539788.81
chroma_vps[32x64] 7.39x 61492.46 454333.72
chroma_vpp[32x32] 8.06x 33655.25 271176.75
chroma_vps[32x32] 7.36x 30832.21 226821.72
chroma_vpp[32x48] 8.02x 50441.13 404571.69
chroma_vps[32x48] 7.37x 46230.04 340583.22
chroma_vpp[32x16] 8.09x 16937.61 137064.12
chroma_vps[32x16] 7.32x 15547.49 113817.55
chroma_vpp[32x32] 8.04x 33663.30 270794.66
chroma_vps[32x32] 7.37x 30873.11 227544.72
chroma_vpp[32x16] 8.07x 16937.51 136649.12
chroma_vps[32x16] 7.33x 15547.57 113930.27
chroma_vpp[32x64] 8.08x 67008.20 541583.00
chroma_vps[32x64] 7.40x 61431.15 454445.69
chroma_vpp[32x24] 8.03x 25322.75 203277.30
chroma_vps[32x24] 7.36x 23167.54 170494.78
chroma_vpp[ 32x8] 8.19x 8412.74 68903.22
chroma_vps[ 32x8] 7.59x 7515.08 57067.58
Subject: [x265] asm: interp_4tap_vert_ps_64xN and interp_4tap_vert_ps_48x64 sse2
details: http://hg.videolan.org/x265/rev/ef127c0b4a02
branches:
changeset: 10459:ef127c0b4a02
user: David T Yuen <dtyx265 at gmail.com>
date: Sun May 17 19:29:32 2015 -0700
description:
asm: interp_4tap_vert_ps_64xN and interp_4tap_vert_ps_48x64 sse2
Converted vert_pp_64xN macro to also create ps primitives. This replaces c code for ps with minimal impact on pp.
64-bit
./test/TestBench --testbench interp | grep vp | grep "48x"
chroma_vpp[48x64] 8.03x 100353.77 805653.62
chroma_vps[48x64] 7.51x 93710.29 704220.94
./test/TestBench --testbench interp | grep vp | grep "64x"
chroma_vpp[64x64] 8.12x 133651.45 1085310.25
chroma_vps[64x64] 7.41x 124089.83 919231.12
chroma_vpp[64x32] 8.11x 66970.80 543079.94
chroma_vps[64x32] 7.38x 63249.14 466864.12
chroma_vpp[64x48] 8.12x 100298.12 814796.12
chroma_vps[64x48] 7.18x 96270.98 690954.25
chroma_vpp[64x16] 8.15x 33627.38 274119.84
chroma_vps[64x16] 7.28x 31222.30 227424.44
Subject: [x265] Call macros to reduce code size of primitive setup
details: http://hg.videolan.org/x265/rev/98553d52d844
branches:
changeset: 10460:98553d52d844
user: David T Yuen <dtyx265 at gmail.com>
date: Sun May 17 19:41:24 2015 -0700
description:
Call macros to reduce code size of primitive setup
Subject: [x265] asm: avx2 code for sad_x3[16xN] for 10 bpp
details: http://hg.videolan.org/x265/rev/b2c7b95ed9a9
branches:
changeset: 10461:b2c7b95ed9a9
user: Sumalatha Polureddy
date: Mon May 18 12:03:30 2015 +0530
description:
asm: avx2 code for sad_x3[16xN] for 10 bpp
sse2:
sad_x3[ 16x4] 2.93x 680.82 1996.91
sad_x3[ 16x8] 3.03x 1266.26 3834.18
sad_x3[16x12] 3.07x 1834.17 5631.97
sad_x3[16x16] 3.06x 2413.24 7380.88
sad_x3[16x32] 2.82x 5554.36 15654.50
sad_x3[16x64] 2.80x 10161.18 28493.52
avx2:
sad_x3[ 16x4] 4.82x 404.45 1948.78
sad_x3[ 16x8] 5.85x 634.65 3714.40
sad_x3[16x12] 6.17x 885.30 5465.97
sad_x3[16x16] 6.28x 1170.04 7350.87
sad_x3[16x32] 5.34x 2909.76 15547.79
sad_x3[16x64] 6.12x 5071.22 31043.80
Subject: [x265] asm: avx2code fore sad_x3[32xN] for 10bpp
details: http://hg.videolan.org/x265/rev/dac9417715e5
branches:
changeset: 10462:dac9417715e5
user: Sumalatha Polureddy
date: Mon May 18 12:13:37 2015 +0530
description:
asm: avx2code fore sad_x3[32xN] for 10bpp
sse2
sad_x3[ 32x8] 2.87x 2260.13 6491.12
sad_x3[32x16] 2.95x 4262.20 12583.53
sad_x3[32x24] 2.66x 7356.49 19539.50
sad_x3[32x32] 2.81x 9852.31 27732.33
sad_x3[32x64] 2.80x 17682.81 49470.11
avx2
sad_x3[ 32x8] 5.83x 1100.14 6409.04
sad_x3[32x16] 6.33x 2072.92 13121.29
sad_x3[32x24] 5.16x 3801.07 19625.10
sad_x3[32x32] 5.43x 4781.53 25952.90
sad_x3[32x64] 5.74x 8709.35 49988.55
Subject: [x265] asm: avx2 code for sad_x3[64xN] for 10 bpp
details: http://hg.videolan.org/x265/rev/618f2ecb7b21
branches:
changeset: 10463:618f2ecb7b21
user: Sumalatha Polureddy
date: Mon May 18 12:32:04 2015 +0530
description:
asm: avx2 code for sad_x3[64xN] for 10 bpp
sse2
sad_x3[64x16] 2.78x 8370.09 23242.03
sad_x3[64x32] 2.67x 17362.56 46289.12
sad_x3[64x48] 2.72x 25053.33 68260.15
sad_x3[64x64] 2.47x 35227.60 87136.18
avx2
sad_x3[64x16] 6.45x 3664.96 23624.50
sad_x3[64x32] 5.74x 8741.48 50144.03
sad_x3[64x48] 6.07x 11401.75 69182.98
sad_x3[64x64] 6.38x 16092.67 102696.92
Subject: [x265] asm: modify API on findPosFirstLast to support all zeros block
details: http://hg.videolan.org/x265/rev/469f98bcf6a2
branches:
changeset: 10464:469f98bcf6a2
user: Min Chen <chenm003 at 163.com>
date: Mon May 18 09:58:08 2015 -0500
description:
asm: modify API on findPosFirstLast to support all zeros block
Subject: [x265] improve Quant::signBitHidingHDQ by scanPosLast and findPosFirstLast
details: http://hg.videolan.org/x265/rev/02f1521c89fd
branches:
changeset: 10465:02f1521c89fd
user: Min Chen <chenm003 at 163.com>
date: Mon May 18 09:58:15 2015 -0500
description:
improve Quant::signBitHidingHDQ by scanPosLast and findPosFirstLast
Subject: [x265] faster algorithm to find firstNZPosInCG & lastNZPosInCG in Quant::signBitHidingHDQ()
details: http://hg.videolan.org/x265/rev/9fb61f65eb91
branches:
changeset: 10466:9fb61f65eb91
user: Min Chen <chenm003 at 163.com>
date: Fri May 15 17:30:19 2015 -0700
description:
faster algorithm to find firstNZPosInCG & lastNZPosInCG in Quant::signBitHidingHDQ()
Subject: [x265] modify logic to remove lastCG in Quant::signBitHidingHDQ()
details: http://hg.videolan.org/x265/rev/9f8853df8d7d
branches:
changeset: 10467:9f8853df8d7d
user: Min Chen <chenm003 at 163.com>
date: Fri May 15 17:30:22 2015 -0700
description:
modify logic to remove lastCG in Quant::signBitHidingHDQ()
Subject: [x265] reuse coeffFlag to reduce memory operator on coeff[] memory
details: http://hg.videolan.org/x265/rev/3ebe4c09ca82
branches:
changeset: 10468:3ebe4c09ca82
user: Min Chen <chenm003 at 163.com>
date: Fri May 15 17:30:24 2015 -0700
description:
reuse coeffFlag to reduce memory operator on coeff[] memory
Subject: [x265] improve by replace condition operator to mask based
details: http://hg.videolan.org/x265/rev/6548dd65da87
branches:
changeset: 10469:6548dd65da87
user: Min Chen <chenm003 at 163.com>
date: Fri May 15 19:29:09 2015 -0700
description:
improve by replace condition operator to mask based
Subject: [x265] api: fix x265.h documentation for x265_max_bit_depth
details: http://hg.videolan.org/x265/rev/8425278def1e
branches: stable
changeset: 10470:8425278def1e
user: Steve Borho <steve at borho.org>
date: Mon May 18 13:39:44 2015 -0500
description:
api: fix x265.h documentation for x265_max_bit_depth
Subject: [x265] Added tag 1.7 for changeset 8425278def1e
details: http://hg.videolan.org/x265/rev/ddb5868a4bcd
branches: stable
changeset: 10471:ddb5868a4bcd
user: Steve Borho <steve at borho.org>
date: Mon May 18 18:18:40 2015 -0500
description:
Added tag 1.7 for changeset 8425278def1e
Subject: [x265] api: introduce a less version strict API query
details: http://hg.videolan.org/x265/rev/b24870cc916f
branches:
changeset: 10472:b24870cc916f
user: Steve Borho <steve at borho.org>
date: Thu May 14 12:52:34 2015 -0500
description:
api: introduce a less version strict API query
The intention is to allow applications to use libx265 libraries with different
X265_BUILD numbers than the x265.h header they were compiled with by keeping
a little more information about the nature of each API bump.
This should have no effect on existing applications, X265_BUILD will still be
incremented each time the public API is changed, but applications which use
this new x265_api_query() method will be able to dlopen() and use any
version of libx265 that returns an API pointer that passes validation checks:
1. check api->api_major_version == X265_MAJOR_VERSION
2. check api->sizeof_param == sizeof(x265_param) if param is dereferenced
3. check api->sizeof_picture == sizeof(x265_picture)
4. check api->sizeof_analysis_data ..
etc.
apps that use param_alloc()/param_free()/param_parse() can skip step 2 and
thus ignore the primary cause of most X265_BUILD bumps.
The only additional work for x265 developers is to increment X265_MAJOR_VERSION
when warranted (which should hopefully be very rarely).
Since this commit is modifying x265_api, we take the opportunity to rename
max_bit_depth to the more accurate bit_depth, since each API will only be
capable of encoding at a single bit depth.
Subject: [x265] Merge with stable
details: http://hg.videolan.org/x265/rev/d7b100e51e82
branches:
changeset: 10473:d7b100e51e82
user: Steve Borho <steve at borho.org>
date: Mon May 18 18:24:08 2015 -0500
description:
Merge with stable
diffstat:
.hgtags | 1 +
doc/reST/api.rst | 54 +
source/CMakeLists.txt | 2 +-
source/common/dct.cpp | 5 +-
source/common/lowres.cpp | 12 +-
source/common/lowres.h | 2 +
source/common/quant.cpp | 81 +-
source/common/quant.h | 2 +-
source/common/x86/asm-primitives.cpp | 147 ++-
source/common/x86/const-a.asm | 2 +-
source/common/x86/ipfilter8.asm | 1101 +++++++++++++++++++++++++++------
source/common/x86/ipfilter8.h | 40 +
source/common/x86/mc-a.asm | 525 ++++++++++++++++
source/common/x86/pixel-util8.asm | 5 +-
source/common/x86/sad16-a.asm | 242 +++++++-
source/encoder/api.cpp | 96 ++-
source/encoder/slicetype.cpp | 4 +-
source/test/pixelharness.cpp | 32 +-
source/x265.cpp | 4 +-
source/x265.def.in | 1 +
source/x265.h | 72 +-
21 files changed, 2082 insertions(+), 348 deletions(-)
diffs (truncated from 3530 to 300 lines):
diff -r 479087422e29 -r d7b100e51e82 .hgtags
--- a/.hgtags Wed May 13 16:52:59 2015 -0700
+++ b/.hgtags Mon May 18 18:24:08 2015 -0500
@@ -15,3 +15,4 @@ c1e4fc0162c14fdb84f5c3bd404fb28cfe10a17f
5e604833c5aa605d0b6efbe5234492b5e7d8ac61 1.4
9f0324125f53a12f766f6ed6f98f16e2f42337f4 1.5
cbeb7d8a4880e4020c4545dd8e498432c3c6cad3 1.6
+8425278def1edf0931dc33fc518e1950063e76b0 1.7
diff -r 479087422e29 -r d7b100e51e82 doc/reST/api.rst
--- a/doc/reST/api.rst Wed May 13 16:52:59 2015 -0700
+++ b/doc/reST/api.rst Mon May 18 18:24:08 2015 -0500
@@ -419,3 +419,57 @@ under the name libx265 (so all applicati
and then also install libx265_main10.so (symlinked to its numbered solib).
Thus applications which use x265_api_get() will be able to generate main
or main10 bitstreams.
+
+There is a second bit-depth introspection method that is designed for
+applications which need more flexibility in API versioning. If you use
+the public API described at the top of this page or x265_api_get() then
+your application must be recompiled each time x265 changes its public
+API and bumps its build number (X265_BUILD, which is also the SONAME on
+POSIX systems). But if you use **x265_api_query** and dynamically link to
+libx265 (use dlopen() on POSIX or LoadLibrary() on Windows) your
+application is no longer directly tied to the API version of x265.h that
+it was compiled against.
+
+ /* x265_api_query:
+ * Retrieve the programming interface for a linked x265 library, like
+ * x265_api_get(), except this function accepts X265_BUILD as the second
+ * argument rather than using the build number as part of the function name.
+ * Applications which dynamically link to libx265 can use this interface to
+ * query the library API and achieve a relative amount of version skew
+ * flexibility. The function may return NULL if the library determines that
+ * the apiVersion that your application was compiled against is not compatible
+ * with the library you have linked with.
+ *
+ * api_major_version will be incremented any time non-backward compatible
+ * changes are made to any public structures or functions. If
+ * api_major_version does not match X265_MAJOR_VERSION from the x265.h your
+ * application compiled against, your application must not use the returned
+ * x265_api pointer.
+ *
+ * Users of this API *must* also validate the sizes of any structures which
+ * are not treated as opaque in application code. For instance, if your
+ * application dereferences a x265_param pointer, then it must check that
+ * api->sizeof_param matches the sizeof(x265_param) that your application
+ * compiled with. */
+ const x265_api* x265_api_query(int bitDepth, int apiVersion, int* err);
+
+A number of validations must be performed on the returned API structure
+in order to determine if it is safe for use by your application. If you
+do not perform these checks, your application is liable to crash.
+
+ if (api->api_major_version != X265_MAJOR_VERSION) /* do not use */
+ if (api->sizeof_param != sizeof(x265_param)) /* do not use */
+ if (api->sizeof_picture != sizeof(x265_picture)) /* do not use */
+ if (api->sizeof_stats != sizeof(x265_stats)) /* do not use */
+ if (api->sizeof_zone != sizeof(x265_zone)) /* do not use */
+ etc.
+
+Note that if your application does not directly allocate or dereference
+one of these structures, if it treats the structure as opaque or does
+not use it at all, then it can skip the size check for that structure.
+
+In particular, if your application uses api->param_alloc(),
+api->param_free(), api->param_parse(), etc and never directly accesses
+any x265_param fields, then it can skip the check on the
+sizeof(x265_parm) and thereby ignore changes to that structure (which
+account for a large percentage of X265_BUILD bumps).
diff -r 479087422e29 -r d7b100e51e82 source/CMakeLists.txt
--- a/source/CMakeLists.txt Wed May 13 16:52:59 2015 -0700
+++ b/source/CMakeLists.txt Mon May 18 18:24:08 2015 -0500
@@ -30,7 +30,7 @@ option(STATIC_LINK_CRT "Statically link
mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
# X265_BUILD must be incremented each time the public API is changed
-set(X265_BUILD 59)
+set(X265_BUILD 60)
configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
"${PROJECT_BINARY_DIR}/x265.def")
configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
diff -r 479087422e29 -r d7b100e51e82 source/common/dct.cpp
--- a/source/common/dct.cpp Wed May 13 16:52:59 2015 -0700
+++ b/source/common/dct.cpp Mon May 18 18:24:08 2015 -0500
@@ -798,11 +798,11 @@ uint32_t findPosFirstLast_c(const int16_
break;
}
- X265_CHECK(n >= 0, "non-zero coeff scan failuare!\n");
+ X265_CHECK(n >= -1, "non-zero coeff scan failuare!\n");
uint32_t lastNZPosInCG = (uint32_t)n;
- for (n = 0;; n++)
+ for (n = 0; n < SCAN_SET_SIZE; n++)
{
const uint32_t idx = scanTbl[n];
const uint32_t idxY = idx / MLS_CG_SIZE;
@@ -813,6 +813,7 @@ uint32_t findPosFirstLast_c(const int16_
uint32_t firstNZPosInCG = (uint32_t)n;
+ // NOTE: when coeff block all ZERO, the lastNZPosInCG is undefined and firstNZPosInCG is 16
return ((lastNZPosInCG << 16) | firstNZPosInCG);
}
diff -r 479087422e29 -r d7b100e51e82 source/common/lowres.cpp
--- a/source/common/lowres.cpp Wed May 13 16:52:59 2015 -0700
+++ b/source/common/lowres.cpp Mon May 18 18:24:08 2015 -0500
@@ -36,13 +36,13 @@ bool Lowres::create(PicYuv *origPic, int
lumaStride = width + 2 * origPic->m_lumaMarginX;
if (lumaStride & 31)
lumaStride += 32 - (lumaStride & 31);
- int cuWidth = (width + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
- int cuHeight = (lines + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
- int cuCount = cuWidth * cuHeight;
+ maxBlocksInRow = (width + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
+ maxBlocksInCol = (lines + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
+ int cuCount = maxBlocksInRow * maxBlocksInCol;
/* rounding the width to multiple of lowres CU size */
- width = cuWidth * X265_LOWRES_CU_SIZE;
- lines = cuHeight * X265_LOWRES_CU_SIZE;
+ width = maxBlocksInRow * X265_LOWRES_CU_SIZE;
+ lines = maxBlocksInCol * X265_LOWRES_CU_SIZE;
size_t planesize = lumaStride * (lines + 2 * origPic->m_lumaMarginY);
size_t padoffset = lumaStride * origPic->m_lumaMarginY + origPic->m_lumaMarginX;
@@ -74,7 +74,7 @@ bool Lowres::create(PicYuv *origPic, int
{
for (int j = 0; j < bframes + 2; j++)
{
- CHECKED_MALLOC(rowSatds[i][j], int32_t, cuHeight);
+ CHECKED_MALLOC(rowSatds[i][j], int32_t, maxBlocksInCol);
CHECKED_MALLOC(lowresCosts[i][j], uint16_t, cuCount);
}
}
diff -r 479087422e29 -r d7b100e51e82 source/common/lowres.h
--- a/source/common/lowres.h Wed May 13 16:52:59 2015 -0700
+++ b/source/common/lowres.h Mon May 18 18:24:08 2015 -0500
@@ -130,6 +130,8 @@ struct Lowres : public ReferencePlanes
uint16_t(*lowresCosts[X265_BFRAME_MAX + 2][X265_BFRAME_MAX + 2]);
int32_t* lowresMvCosts[2][X265_BFRAME_MAX + 1];
MV* lowresMvs[2][X265_BFRAME_MAX + 1];
+ uint32_t maxBlocksInRow;
+ uint32_t maxBlocksInCol;
/* used for vbvLookahead */
int plannedType[X265_LOOKAHEAD_MAX + 1];
diff -r 479087422e29 -r d7b100e51e82 source/common/quant.cpp
--- a/source/common/quant.cpp Wed May 13 16:52:59 2015 -0700
+++ b/source/common/quant.cpp Mon May 18 18:24:08 2015 -0500
@@ -251,30 +251,63 @@ void Quant::setChromaQP(int qpin, TextTy
}
/* To minimize the distortion only. No rate is considered */
-uint32_t Quant::signBitHidingHDQ(int16_t* coeff, int32_t* deltaU, uint32_t numSig, const TUEntropyCodingParameters &codeParams)
+uint32_t Quant::signBitHidingHDQ(int16_t* coeff, int32_t* deltaU, uint32_t numSig, const TUEntropyCodingParameters &codeParams, uint32_t log2TrSize)
{
- const uint32_t log2TrSizeCG = codeParams.log2TrSizeCG;
+ uint32_t trSize = 1 << log2TrSize;
const uint16_t* scan = codeParams.scan;
- bool lastCG = true;
- for (int cg = (1 << (log2TrSizeCG * 2)) - 1; cg >= 0; cg--)
+ uint8_t coeffNum[MLS_GRP_NUM]; // value range[0, 16]
+ uint16_t coeffSign[MLS_GRP_NUM]; // bit mask map for non-zero coeff sign
+ uint16_t coeffFlag[MLS_GRP_NUM]; // bit mask map for non-zero coeff
+
+#if CHECKED_BUILD || _DEBUG
+ // clean output buffer, the asm version of scanPosLast Never output anything after latest non-zero coeff group
+ memset(coeffNum, 0, sizeof(coeffNum));
+ memset(coeffSign, 0, sizeof(coeffNum));
+ memset(coeffFlag, 0, sizeof(coeffNum));
+#endif
+ const int lastScanPos = primitives.scanPosLast(codeParams.scan, coeff, coeffSign, coeffFlag, coeffNum, numSig, g_scan4x4[codeParams.scanType], trSize);
+ const int cgLastScanPos = (lastScanPos >> LOG2_SCAN_SET_SIZE);
+ unsigned long tmp;
+
+ // first CG need specially processing
+ const uint32_t correctOffset = 0x0F & (lastScanPos ^ 0xF);
+ coeffFlag[cgLastScanPos] <<= correctOffset;
+
+ for (int cg = cgLastScanPos; cg >= 0; cg--)
{
int cgStartPos = cg << LOG2_SCAN_SET_SIZE;
int n;
+#if CHECKED_BUILD || _DEBUG
for (n = SCAN_SET_SIZE - 1; n >= 0; --n)
if (coeff[scan[n + cgStartPos]])
break;
- if (n < 0)
+ int lastNZPosInCG0 = n;
+#endif
+
+ if (coeffNum[cg] == 0)
+ {
+ X265_CHECK(lastNZPosInCG0 < 0, "all zero block check failure\n");
continue;
+ }
- int lastNZPosInCG = n;
-
+#if CHECKED_BUILD || _DEBUG
for (n = 0;; n++)
if (coeff[scan[n + cgStartPos]])
break;
- int firstNZPosInCG = n;
+ int firstNZPosInCG0 = n;
+#endif
+
+ CLZ(tmp, coeffFlag[cg]);
+ const int firstNZPosInCG = (15 ^ tmp);
+
+ CTZ(tmp, coeffFlag[cg]);
+ const int lastNZPosInCG = (15 ^ tmp);
+
+ X265_CHECK(firstNZPosInCG0 == firstNZPosInCG, "firstNZPosInCG0 check failure\n");
+ X265_CHECK(lastNZPosInCG0 == lastNZPosInCG, "lastNZPosInCG0 check failure\n");
if (lastNZPosInCG - firstNZPosInCG >= SBH_THRESHOLD)
{
@@ -287,12 +320,17 @@ uint32_t Quant::signBitHidingHDQ(int16_t
if (signbit != (absSum & 0x1)) // compare signbit with sum_parity
{
int minCostInc = MAX_INT, minPos = -1, curCost = MAX_INT;
- int16_t finalChange = 0, curChange = 0;
+ int32_t finalChange = 0, curChange = 0;
+ uint32_t cgFlags = coeffFlag[cg];
+ if (cg == cgLastScanPos)
+ cgFlags >>= correctOffset;
- for (n = (lastCG ? lastNZPosInCG : SCAN_SET_SIZE - 1); n >= 0; --n)
+ for (n = (cg == cgLastScanPos ? lastNZPosInCG : SCAN_SET_SIZE - 1); n >= 0; --n)
{
uint32_t blkPos = scan[n + cgStartPos];
- if (coeff[blkPos])
+ X265_CHECK(!!coeff[blkPos] == !!(cgFlags & 1), "non zero coeff check failure\n");
+
+ if (cgFlags & 1)
{
if (deltaU[blkPos] > 0)
{
@@ -301,8 +339,11 @@ uint32_t Quant::signBitHidingHDQ(int16_t
}
else
{
- if (n == firstNZPosInCG && abs(coeff[blkPos]) == 1)
+ if ((cgFlags == 1) && (abs(coeff[blkPos]) == 1))
+ {
+ X265_CHECK(n == firstNZPosInCG, "firstNZPosInCG position check failure\n");
curCost = MAX_INT;
+ }
else
{
curCost = deltaU[blkPos];
@@ -312,8 +353,9 @@ uint32_t Quant::signBitHidingHDQ(int16_t
}
else
{
- if (n < firstNZPosInCG)
+ if (cgFlags == 0)
{
+ X265_CHECK(n < firstNZPosInCG, "firstNZPosInCG position check failure\n");
uint32_t thisSignBit = m_resiDctCoeff[blkPos] >= 0 ? 0 : 1;
if (thisSignBit != signbit)
curCost = MAX_INT;
@@ -336,6 +378,7 @@ uint32_t Quant::signBitHidingHDQ(int16_t
finalChange = curChange;
minPos = blkPos;
}
+ cgFlags>>=1;
}
/* do not allow change to violate coeff clamp */
@@ -347,14 +390,12 @@ uint32_t Quant::signBitHidingHDQ(int16_t
else if (finalChange == -1 && abs(coeff[minPos]) == 1)
numSig--;
- if (m_resiDctCoeff[minPos] >= 0)
- coeff[minPos] += finalChange;
- else
- coeff[minPos] -= finalChange;
+ {
+ const int16_t sigMask = ((int16_t)m_resiDctCoeff[minPos]) >> 15;
+ coeff[minPos] += ((int16_t)finalChange ^ sigMask) - sigMask;
+ }
}
}
-
- lastCG = false;
}
return numSig;
@@ -437,7 +478,7 @@ uint32_t Quant::transformNxN(const CUDat
{
TUEntropyCodingParameters codeParams;
cu.getTUEntropyCodingParameters(codeParams, absPartIdx, log2TrSize, isLuma);
More information about the x265-commits
mailing list