[x265] [PATCH] dct: unroll for 30% speedup on Apple Silicon
chen
chenm003 at 163.com
Sun Nov 24 17:50:38 UTC 2024
In this case, I more like keep current code no change.
The performance strong depends to compiler, and not so much benefits, we may optimize by asm future.
2024-11-24 23:54:48,"Ganesh Ajjanagadde" <gajjanag2 at gmail.com>
Right, this only affects the dct32 case. Others are unaffected and change is within noise for them.
On Sat, Nov 23, 2024 at 9:57 PM chen <chenm003 at 163.com> wrote:
Looks for dct32x32 and idct4x4 only, other size are similar or worse?
At 2024-11-24 12:34:48, gajjanag2 at gmail.com wrote:
>From: Ganesh Ajjanagadde <gajjanag at alum.mit.edu>
>
>Apple silicon has 4 128 bit NEON execution units and benefits from unrolling.
>
>From ./TestBench on an M4 Mac Mini,
>
>before:
>dct8x8 | 2.32x | 205.12 | 476.62
>dct16x16 | 2.02x | 801.20 | 1619.62
>dct32x32 | 3.47x | 7566.39 | 26275.65
>idct4x4 | 0.90x | 175.80 | 157.90
>idct16x16 | 2.05x | 863.30 | 1771.80
>idct32x32 | 1.79x | 6344.33 | 11351.99
>
>after:
>dct8x8 | 2.33x | 204.72 | 476.53
>dct16x16 | 2.04x | 802.16 | 1637.39
>dct32x32 | 4.96x | 5181.02 | 25700.34
>idct4x4 | 1.08x | 162.09 | 174.40
>idct16x16 | 1.95x | 910.01 | 1771.61
>idct32x32 | 1.75x | 6350.72 | 11143.71
>
>~2% end to end encoding speedup
>---
> source/common/aarch64/dct-prim.cpp | 2 ++
> 1 file changed, 2 insertions(+)
>
>diff --git a/source/common/aarch64/dct-prim.cpp b/source/common/aarch64/dct-prim.cpp
>index 8b523ceb0..e6ee7005b 100644
>--- a/source/common/aarch64/dct-prim.cpp
>+++ b/source/common/aarch64/dct-prim.cpp
>@@ -435,6 +435,7 @@ static inline void partialButterfly32_neon(const int16_t *src, int16_t *dst)
> for (int i = 0; i < line; i += 4)
> {
> int32x4_t t[4];
>+X265_PRAGMA_UNROLL(4)
> for (int j = 0; j < 4; ++j) {
> t[j] = vmull_s16(c0, vget_low_s16(O[i + j][0]));
> t[j] = vmlal_s16(t[j], c1, vget_high_s16(O[i + j][0]));
>@@ -461,6 +462,7 @@ static inline void partialButterfly32_neon(const int16_t *src, int16_t *dst)
> for (int i = 0; i < line; i += 4)
> {
> int32x4_t t[4];
>+X265_PRAGMA_UNROLL(4)
> for (int j = 0; j < 4; ++j) {
> t[j] = vmulq_s32(c0, EO[i + j][0]);
> t[j] = vmlaq_s32(t[j], c1, EO[i + j][1]);
>--
>2.39.5 (Apple Git-154)
>
>_______________________________________________
>x265-devel mailing list
>x265-devel at videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
_______________________________________________
x265-devel mailing list
x265-devel at videolan.org
https://mailman.videolan.org/listinfo/x265-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20241125/a1968e1e/attachment.htm>
More information about the x265-devel
mailing list