Hi Chen, Thank you for clarifying. >From the Arm CPU Software Optimisation Guides, LD1R requires an extra micro-op for the broadcast compared to the regular load (LDR). Benchmarking shows that using LD1R in the sad functions of width 4 is ~20% slower than using the LDR, ADD sequence. Many thanks, Hari