[x265] [PATCH 00/14] AArch64: Add Armv8.4 Neon DotProd and Armv8.6 Neon I8MM implementations of ipfilter primitives

Mon Sep 9 08:27:51 UTC 2024

Hi Chen,

Thank you for reviewing the patches.

Regarding the patch that you highlighted:
    [PATCH 04/14] AArch64: Add Armv8.4 Neon DotProd implementations of filter_hpp

> performance result looks not good enough,
The key result for this patch is the performance uplift for Neoverse N1 (1.123x), as this machine does not support Neon I8MM instructions.
The results for the other machines are stated for completeness - however these machines will instead run the Neon I8MM implementation:

  https://mailman.videolan.org/pipermail/x265-devel/2024-September/013907.html

the uplift from which is copied here:

  Geomean uplift across all block sizes for chroma filters, relative to
  Armv8.4 Neon DotProd implementations:

      Neoverse N2: 1.402x
      Neoverse V1: 1.214x
      Neoverse V2: 1.289x

>and why shortcut branch in case (coeffIdx == 4)?
As the Armv8.0 Neon implementation can be highly specialized for coeffIdx of 4, the Armv8.4 Neon DotProd implementation is not faster for this filter - so we dispatch to the Armv8.0 Neon implementation in this case.
The uplift for the other values of coeffIdx from the Armv8.4 Neon DotProd implementation (on Neoverse N1) is significant.

Many thanks,
Hari