[x264-devel] [Git][videolan/x264][stable] 114 commits: ppc: Add x264_cpu_detect() for NetBSD/macppc
Anton Mitrofanov (@BugMaster)
gitlab at videolan.org
Mon Sep 1 19:01:52 UTC 2025
Anton Mitrofanov pushed to branch stable at VideoLAN / x264
Commits:
834c5c92 by Martin Husemann at 2023-10-01T17:35:48+03:00
ppc: Add x264_cpu_detect() for NetBSD/macppc
The altivec instruction set detection is very similar to FreeBSD
and OpenBSD, but uses slightly different sysctl selectors.
- - - - -
249924ea by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add initial support for 10 bit neon support
Add if/else clause in files to control which code is used.
Move generic function out of 8-bit depth scope to common one
for both modes.
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
ba45eba3 by Hubert Mazur at 2023-10-01T15:13:40+00:00
aarch64/mc-c: Unify pixel/uint8_t usage
Previously some functions from motion compensation family used uint8_t,
while the others pixel definition. Unify this and change every uint8_t
usage to pixel.
This commit is a prerequisite to 10 bit depth support.
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
13a24888 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for pixel_avg
Provide neon optimized implementation for pixel_avg functions from
motion compensation family for 10 bit depth.
Checkasm benchmarks are shown below.
avg_4x2_c: 703
avg_4x2_neon: 222
avg_4x4_c: 1405
avg_4x4_neon: 516
avg_4x8_c: 2759
avg_4x8_neon: 898
avg_4x16_c: 5808
avg_4x16_neon: 1776
avg_8x4_c: 2767
avg_8x4_neon: 412
avg_8x8_c: 5559
avg_8x8_neon: 841
avg_8x16_c: 11176
avg_8x16_neon: 1668
avg_16x8_c: 10493
avg_16x8_neon: 1504
avg_16x16_c: 21116
avg_16x16_neon: 2985
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
bb3d83dd by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for pixel_avg2
Provide neon optimized implementation for pixel_avg2 functions from
motion compensation family for 10 bit depth.
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
f0b0489f by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_copy
Provide neon optimized implementation for mc_copy functions from
motion compensation family for 10 bit depth.
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
25d5baf4 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_weight
Provide neon optimized implementation for mc_weight functions from
motion compensation family for 10 bit depth.
Benchmark results are shown below.
weight_w4_c: 4734
weight_w4_neon: 4165
weight_w8_c: 8930
weight_w8_neon: 1620
weight_w16_c: 16939
weight_w16_neon: 2729
weight_w20_c: 20721
weight_w20_neon: 3470
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
08761208 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Move mc_luma and get_ref wrappers
Provide mc_luma and get_ref wrappers were only defined with 8 bit depth.
As all required 10 bit depth helper functions exists, move it out from
if scope and make it always defined regardless the bit depth.
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
7ff0f978 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_chroma
Provide neon optimized implementation for mc_chroma functions from
motion compensation family for 10 bit depth.
Benchmark results are shown below.
mc_chroma_2x2_c: 700
mc_chroma_2x2_neon: 478
mc_chroma_2x4_c: 1300
mc_chroma_2x4_neon: 765
mc_chroma_4x2_c: 1229
mc_chroma_4x2_neon: 483
mc_chroma_4x4_c: 2383
mc_chroma_4x4_neon: 773
mc_chroma_4x8_c: 4662
mc_chroma_4x8_neon: 1319
mc_chroma_8x4_c: 4450
mc_chroma_8x4_neon: 940
mc_chroma_8x8_c: 8797
mc_chroma_8x8_neon: 1638
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
25ef8832 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_integral
Provide neon optimized implementation for mc_integral functions from
motion compensation family for 10 bit depth.
Benchmark results are shown below.
integral_init4h_c: 2651
integral_init4h_neon: 550
integral_init4v_c: 4247
integral_init4v_neon: 612
integral_init8h_c: 2544
integral_init8h_neon: 1027
integral_init8v_c: 1996
integral_init8v_neon: 245
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
0a810f4f by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_lowres
Provide neon optimized implementation for mc_lowres function from
motion compensation family for 10 bit depth.
Benchmark results are shown below.
lowres_init_c: 149446
lowres_init_neon: 13172
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
68d71206 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for mc_load func
Provide neon optimized implementation for mc_load_deinterleave function
from motion compensation family for 10 bit depth.
Benchmark results are shown below.
load_deinterleave_chroma_fdec_c: 2936
load_deinterleave_chroma_fdec_neon: 422
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
df179744 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for store func
Provide neon optimized implementation for mc_store_interleave function
from motion compensation family for 10 bit depth.
Benchmark results are shown below.
load_deinterleave_chroma_fenc_c: 2910
load_deinterleave_chroma_fenc_neon: 430
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
e47bede8 by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for copy funcs
Provide neon optimized implementation for mc_plane_copy function
from motion compensation family for 10 bit depth.
Benchmark results are shown below.
plane_copy_c: 2955
plane_copy_neon: 2910
plane_copy_deinterleave_c: 24056
plane_copy_deinterleave_neon: 3625
plane_copy_deinterleave_rgb_c: 19928
plane_copy_deinterleave_rgb_neon: 3941
plane_copy_interleave_c: 24399
plane_copy_interleave_neon: 4723
plane_copy_swap_c: 32269
plane_copy_swap_neon: 3211
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
cc5c343f by Hubert Mazur at 2023-10-01T15:13:40+00:00
mc: Add arm64 neon implementation for hpel filter
Provide neon optimized implementation for mc_plane_copy function
from motion compensation family for 10 bit depth.
Benchmark results are shown below.
hpel_filter_c: 111495
hpel_filter_neon: 37849
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
b8ea87e0 by Hubert Mazur at 2023-10-01T15:31:51+00:00
quant: Add neon implementation of quant functions
Provide arm64 neon implementations of quant functions for high
bit depth. Benchmarks are shown below.
quant_2x2_dc_c: 217
quant_2x2_dc_neon: 275
quant_4x4_c: 482
quant_4x4_neon: 326
quant_4x4_dc_c: 428
quant_4x4_dc_neon: 348
quant_4x4x4_c: 2508
quant_4x4x4_neon: 1027
quant_8x8_c: 2439
quant_8x8_neon: 936
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
986dd1f3 by Hubert Mazur at 2023-10-01T15:31:51+00:00
quant: Add implementation for dequant
Provide neon arm64 implementations for dequant functions for high bit
depth. Benchmarks are shown below.
dequant_4x4_cqm_c: 359
dequant_4x4_cqm_neon: 225
dequant_4x4_dc_cqm_c: 344
dequant_4x4_dc_cqm_neon: 208
dequant_4x4_dc_flat_c: 348
dequant_4x4_dc_flat_neon: 210
dequant_4x4_flat_c: 362
dequant_4x4_flat_neon: 227
dequant_8x8_cqm_c: 1526
dequant_8x8_cqm_neon: 517
dequant_8x8_flat_c: 1547
dequant_8x8_flat_neon: 520
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
66d000d2 by Hubert Mazur at 2023-10-01T15:31:51+00:00
quant: Add implementation for decimate functions
Provide neon arm64 implementations for decimate score functions
for high bit depth. Benchmarks are shown below.
decimate_score15_c: 273
decimate_score15_neon: 205
decimate_score16_c: 284
decimate_score16_neon: 208
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
7c62a144 by Hubert Mazur at 2023-10-01T15:31:51+00:00
quant: Add implementation for decimate64
Provide neon arm64 implementation for decimate_score64 for high bit
depth. Benchmarks are shown below.
decimate_score64_c: 894
decimate_score64_neon: 431
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
03c0e9a9 by Hubert Mazur at 2023-10-01T15:31:51+00:00
quant: Add neon implementations of coeff_last
Provide arm64 neon implementations for coeff_last functions for high bit
depth. Benchmarks are shown below.
coeff_last4_c: 79
coeff_last4_neon: 107
coeff_last8_c: 109
coeff_last8_neon: 154
coeff_last15_c: 161
coeff_last15_neon: 135
coeff_last16_c: 160
coeff_last16_neon: 132
coeff_last64_c: 782
coeff_last64_neon: 400
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
01e05671 by Hubert Mazur at 2023-10-01T15:31:51+00:00
quant: Add neon implementations of coeff_level_run
Provide arm64 neon implementations for coeff_level_run functions for high bit
depth. Benchmarks are shown below.
coeff_level_run4_c: 135
coeff_level_run4_neon: 155
coeff_level_run8_c: 181
coeff_level_run8_neon: 182
coeff_level_run15_c: 296
coeff_level_run15_neon: 275
coeff_level_run16_c: 305
coeff_level_run16_neon: 264
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
7882a368 by Hubert Mazur at 2023-10-01T15:31:51+00:00
quant: Add implementation for denoise_dct function
Provide arm64 neon implementation for denoise_dct function for high bit
depth. Benchmarks are shown below.
denoise_dct_c: 2149
denoise_dct_neon: 585
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
3afe3c82 by Hubert Mazur at 2023-10-01T15:45:18+00:00
pixel: Add neon sad_x3 implementations for 10 bit
Provide arm64 neon implementations for sad_x3 functions for 10 bit
depth. Benchmarks are shown below.
sad_x3_4x4_c: 710
sad_x3_4x4_neon: 286
sad_x3_4x8_c: 1422
sad_x3_4x8_neon: 430
sad_x3_8x4_c: 1350
sad_x3_8x4_neon: 269
sad_x3_8x8_c: 2851
sad_x3_8x8_neon: 440
sad_x3_8x16_c: 5597
sad_x3_8x16_neon: 734
sad_x3_16x8_c: 5414
sad_x3_16x8_neon: 722
sad_x3_16x16_c: 10729
sad_x3_16x16_neon: 1288
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
8a90ffa7 by Hubert Mazur at 2023-10-01T15:45:18+00:00
pixel: Add neon vsad implementations for 10 bit
Provide arm64 neon implementation for vsad function for 10 bit
depth. Benchmarks are shown below.
vsad_c: 3599
vsad_neon: 392
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
90b3391e by Hubert Mazur at 2023-10-01T15:45:18+00:00
pixel: Add neon asd8 implementations for 10 bit
Provide arm64 neon implementation for asd8 function for 10 bit
depth. Benchmarks are shown below.
asd8_c: 4400
asd8_neon: 857
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
8fd1e5f2 by Hubert Mazur at 2023-10-01T15:45:18+00:00
pixel: Add neon ssd implementations for 10 bit
Provide arm64 neon implementation for ssd functions for 10 bit
depth. Benchmarks are shown below.
ssd_4x4_c: 1466
ssd_4x4_neon: 240
ssd_4x8_c: 1918
ssd_4x8_neon: 482
ssd_4x16_c: 5258
ssd_4x16_neon: 1025
ssd_8x4_c: 1291
ssd_8x4_neon: 235
ssd_8x8_c: 2431
ssd_8x8_neon: 425
ssd_8x16_c: 4635
ssd_8x16_neon: 910
ssd_16x8_c: 4198
ssd_16x8_neon: 897
ssd_16x16_c: 8549
ssd_16x16_neon: 1907
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
1754f6b2 by Grzegorz Bernacki at 2023-10-01T15:45:18+00:00
pixel: Add neon satd implementations for 10 bit
Provide arm64 neon implementation for satd functions for 10 bit
depth. Benchmarks are shown below.
satd_4x4_c: 858
satd_4x4_neon: 712
satd_4x8_c: 1834
satd_4x8_neon: 812
satd_4x16_c: 3677
satd_4x16_neon: 1149
satd_8x4_c: 1290
satd_8x4_neon: 427
Signed-off-by: Grzegorz Bernacki <gjb at semihalf.com>
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
1b59a1f3 by Hubert Mazur at 2023-10-01T15:45:18+00:00
pixel: Add neon satd implementations for 10 bit
Provide arm64 neon implementation for satd 8x8 and 8x16 functions
for 10 bit depth. Benchmarks are shown below.
satd_8x8_c: 2143
satd_8x8_neon: 812
satd_8x16_c: 4228
satd_8x16_neon: 1504
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
a87a9f89 by Hubert Mazur at 2023-10-01T15:45:18+00:00
pixel: Add neon ssd_nv12 implementation for 10 bit
Provide arm64 neon implementation for ssd_nv12 function
for 10 bit depth. Benchmarks are shown below.
ssd_nv12_c: 181441
ssd_nv12_neon: 29037
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
7ae00538 by Hubert Mazur at 2023-10-01T15:45:18+00:00
Add neon pixel_var implementation for 10 bit
Provide arm64 neon implementation for pixel_var function
for 10 bit depth. Benchmarks are shown below.
var_8x8_c: 757
var_8x8_neon: 342
var_8x16_c: 1431
var_8x16_neon: 582
var_16x16_c: 2721
var_16x16_neon: 767
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
9927ac9a by Hubert Mazur at 2023-10-01T15:45:18+00:00
Add neon pixel_var2 implementation for 10 bit
Provide arm64 neon implementation for pixel_var2 function
for 10 bit depth. Benchmarks are shown below.
var2_8x8_c: 1988
var2_8x8_neon: 505
var2_8x16_c: 3800
var2_8x16_neon: 862
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
820fb5a7 by Hubert Mazur at 2023-10-01T15:45:18+00:00
pixel: Add neon satd implementations for 10 bit
Provide arm64 neon implementation for satd 16x8 and 16x16 functions
for 10 bit depth. Benchmarks are shown below.
satd_16x8_c: 4268
satd_16x8_neon: 1493
satd_16x16_c: 8382
satd_16x16_neon: 2908
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
8743a46d by Hubert Mazur at 2023-10-01T15:45:18+00:00
pixel: Add neon sa8d implementations for 10 bit
Provide arm64 neon implementation for sa8d 16x8 and 16x16 functions
for 10 bit depth. Benchmarks are shown below.
sa8d_8x8_c: 2914
sa8d_8x8_neon: 608
sa8d_16x16_c: 11469
sa8d_16x16_neon: 2030
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
0e6165de by Hubert Mazur at 2023-10-01T15:45:18+00:00
pixel: Add neon hadamard implementations for 10 bit
Provide arm64 neon implementation for hadamard_ac functions
for 10 bit depth. Benchmarks are shown below.
hadamard_ac_8x8_c: 2995
hadamard_ac_8x8_neon: 682
hadamard_ac_8x16_c: 5959
hadamard_ac_8x16_neon: 1207
hadamard_ac_16x8_c: 5963
hadamard_ac_16x8_neon: 1212
hadamard_ac_16x16_c: 11851
hadamard_ac_16x16_neon: 2260
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
67ad1cb6 by Hubert Mazur at 2023-10-01T15:45:18+00:00
pixel: Add neon ssim_core implementation for 10 bit
Provide arm64 neon implementation for ssim_core function
for 10 bit depth. Benchmarks are shown below.
ssim_core_c: 1315
ssim_core_neon: 470
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
5a9dfdde by Hubert Mazur at 2023-10-01T15:45:18+00:00
pixel: Add neon ssim_end implementation for 10 bit
Provide arm64 neon implementation for ssim_end function
for 10 bit depth. The implementation is based on the
previous one for 8 bit depth with a few differences like
IEEE-754 constant values and scheduling. The conversion
to floating point number must be done at the beginning
to prevent range overflows.
Benchmarks are shown below.
ssim_end_c: 715
ssim_end_neon: 380
Signed-off-by: Hubert Mazur <hum at semihalf.com>
- - - - -
1ecc51ee by Loongson Technology Corporation Limited at 2023-10-10T09:00:09+08:00
loongarch: Init LSX/LASX support
LSX/LASX is the LOONGARCH 128-bit/256-bit SIMD Architecture.
Signed-off-by: Shiyou Yin <yinshiyou-hf at loongson.cn>
Signed-off-by: Xiwei Gu <guxiwei-hf at loongson.cn>
- - - - -
25ffd616 by Loongson Technology Corporation Limited at 2023-10-10T09:00:47+08:00
loongarch: Add loongson_asm.S and loongson_utils.S
Common macros and functions for loongson optimization.
Signed-off-by: Shiyou Yin <yinshiyou-hf at loongson.cn>
- - - - -
d7d283f6 by Loongson Technology Corporation Limited at 2023-10-10T09:04:49+08:00
loongarch: Improve the performance of deblock series functions.
Performance has improved from 4.76fps to 4.92fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
functions performance performance
(c) (asm)
deblock_luma[0] 79 39
deblock_luma[1] 91 18
deblock_luma_intra[0] 63 44
deblock_luma_intra[1] 71 18
deblock_strength 104 33
Signed-off-by: Hao Chen <chenhao at loongson.cn>
- - - - -
00b8e3b9 by Loongson Technology Corporation Limited at 2023-10-10T09:09:52+08:00
loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions
Performance has improved from 4.92fps to 6.32fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
functions performance performance
(c) (asm)
sad_4x4 13 3
sad_4x8 26 7
sad_4x16 57 13
sad_8x4 24 3
sad_8x8 54 8
sad_8x16 108 13
sad_16x8 95 8
sad_16x16 189 13
sad_x3_4x4 37 6
sad_x3_4x8 71 13
sad_x3_8x4 70 8
sad_x3_8x8 162 14
sad_x3_8x16 323 25
sad_x3_16x8 279 15
sad_x3_16x16 555 27
sad_x4_4x4 49 8
sad_x4_4x8 95 17
sad_x4_8x4 94 8
sad_x4_8x8 214 16
sad_x4_8x16 429 33
sad_x4_16x8 372 18
sad_x4_16x16 740 34
Signed-off-by: wanglu <wanglu at loongson.cn>
- - - - -
d8ed272a by Loongson Technology Corporation Limited at 2023-10-10T09:13:58+08:00
loongarch: Improve the performance of predict series functions
Performance has improved from 6.32fps to 6.34fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
functions performance performance
(c) (asm)
intra_predict_4x4_dc 3 2
intra_predict_4x4_dc8 1 1
intra_predict_4x4_dcl 2 1
intra_predict_4x4_dct 2 1
intra_predict_4x4_ddl 7 2
intra_predict_4x4_h 2 1
intra_predict_4x4_v 1 1
intra_predict_8x8_dc 8 2
intra_predict_8x8_dc8 1 1
intra_predict_8x8_dcl 5 2
intra_predict_8x8_dct 5 2
intra_predict_8x8_ddl 27 3
intra_predict_8x8_ddr 26 3
intra_predict_8x8_h 4 2
intra_predict_8x8_v 3 1
intra_predict_8x8_vl 29 3
intra_predict_8x8_vr 31 4
intra_predict_8x8c_dc 8 5
intra_predict_8x8c_dc8 1 1
intra_predict_8x8c_dcl 5 3
intra_predict_8x8c_dct 5 3
intra_predict_8x8c_h 4 2
intra_predict_8x8c_p 58 30
intra_predict_8x8c_v 4 1
intra_predict_16x16_dc 32 8
intra_predict_16x16_dc8 9 4
intra_predict_16x16_dcl 26 6
intra_predict_16x16_dct 26 6
intra_predict_16x16_h 23 7
intra_predict_16x16_p 182 44
intra_predict_16x16_v 22 4
Signed-off-by: Xiwei Gu <guxiwei-hf at loongson.cn>
- - - - -
65e7bac5 by Loongson Technology Corporation Limited at 2023-10-10T09:15:32+08:00
loongarch: Improve the performance of quant series functions
Performance has improved from 6.34fps to 6.78fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
functions performance performance
(c) (asm)
coeff_last15 3 2
coeff_last16 3 1
coeff_last64 42 6
decimate_score15 8 12
decimate_score16 8 11
decimate_score64 61 43
dequant_4x4_cqm 16 5
dequant_4x4_dc_cqm 13 5
dequant_4x4_dc_flat 13 5
dequant_4x4_flat 16 5
dequant_8x8_cqm 71 9
dequant_8x8_flat 71 9
Signed-off-by: Shiyou Yin <yinshiyou-hf at loongson.cn>
- - - - -
981c8f25 by Loongson Technology Corporation Limited at 2023-10-12T17:27:40+08:00
loongarch: Improve the performance of mc series functions
Performance has improved from 6.78fps to 10.53fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
functions performance performance
(c) (asm)
avg_4x2 16 5
avg_4x4 30 6
avg_4x8 63 10
avg_4x16 124 19
avg_8x4 60 6
avg_8x8 119 10
avg_8x16 233 19
avg_16x8 229 21
avg_16x16 451 41
get_ref_4x4 30 9
get_ref_4x8 52 11
get_ref_8x4 45 9
get_ref_8x8 80 11
get_ref_8x16 156 16
get_ref_12x10 137 13
get_ref_16x8 147 11
get_ref_16x16 282 16
get_ref_20x18 278 22
hpel_filter 5163 686
lowres_init 5440 286
mc_chroma_2x2 24 7
mc_chroma_2x4 42 10
mc_chroma_4x2 41 7
mc_chroma_4x4 75 10
mc_chroma_4x8 144 19
mc_chroma_8x4 137 15
mc_chroma_8x8 269 28
mc_luma_4x4 30 10
mc_luma_4x8 52 12
mc_luma_8x4 44 10
mc_luma_8x8 80 13
mc_luma_8x16 156 19
mc_luma_16x8 147 13
mc_luma_16x16 281 19
memcpy_aligned 14 9
memzero_aligned 24 4
offsetadd_w4 79 18
offsetadd_w8 142 18
offsetadd_w16 277 25
offsetadd_w20 1118 38
offsetsub_w4 75 18
offsetsub_w8 140 18
offsetsub_w16 265 25
offsetsub_w20 989 39
weight_w4 111 19
weight_w8 205 19
weight_w16 396 29
weight_w20 1143 45
deinterleave_chroma_fdec 76 9
deinterleave_chroma_fenc 86 9
plane_copy_deinterleave 733 90
plane_copy_interleave 791 245
store_interleave_chroma 82 12
Signed-off-by: Xiwei Gu <guxiwei-hf at loongson.cn>
- - - - -
fa7f1fce by Loongson Technology Corporation Limited at 2023-10-12T17:28:15+08:00
loongarch: Improve the performance of dct series functions
Performance has improved from 10.53fps to 11.27fps.
Tested with following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
functions performance performance
(c) (asm)
add4x4_idct 34 9
add8x8_idct 139 31
add8x8_idct8 269 39
add8x8_idct_dc 67 7
add16x16_idct 564 123
add16x16_idct_dc 260 22
dct4x4dc 18 10
idct4x4dc 16 9
sub4x4_dct 25 7
sub8x8_dct 101 12
sub8x8_dct8 160 25
sub16x16_dct 403 52
sub16x16_dct8 646 68
zigzag_scan_4x4_frame 4 1
Signed-off-by: zhoupeng <zhoupeng at loongson.cn>
- - - - -
5f84d403 by Loongson Technology Corporation Limited at 2023-10-12T17:28:23+08:00
loongarch: Improve the performance of pixel series functions
Performance has improved from 11.27fps to 20.50fps by using the
following command:
./configure && make -j5
./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv
functions performance performance
(c) (asm)
hadamard_ac_8x8 117 21
hadamard_ac_8x16 236 42
hadamard_ac_16x8 235 31
hadamard_ac_16x16 473 60
intra_sad_x3_4x4 50 21
intra_sad_x3_8x8 183 34
intra_sad_x3_8x8c 181 36
intra_sad_x3_16x16 643 68
intra_satd_x3_4x4 83 61
intra_satd_x3_8x8c 344 81
intra_satd_x3_16x16 1389 136
sa8d_8x8 97 19
sa8d_16x16 394 68
satd_4x4 24 8
satd_4x8 51 11
satd_4x16 103 24
satd_8x4 52 9
satd_8x8 108 12
satd_8x16 218 24
satd_16x8 218 19
satd_16x16 437 38
ssd_4x4 10 5
ssd_4x8 24 8
ssd_4x16 42 15
ssd_8x4 23 5
ssd_8x8 37 9
ssd_8x16 74 17
ssd_16x8 72 11
ssd_16x16 140 23
var2_8x8 91 37
var2_8x16 176 66
var_8x8 50 15
var_8x16 65 29
var_16x16 132 56
Signed-off-by: Hecai Yuan <yuanhecai at loongson.cn>
- - - - -
db9bc75b by Martin Storsjö at 2023-10-18T11:23:47+03:00
configure: Check for support for AArch64 SVE and SVE2
We don't expect the user to build the whole x264 codebase with
SVE/SVE2 enabled, as we only enable this feature for the assembly
files that use it, in order to have binaries that are portable
and enable the SVE codepaths at runtime if supported.
- - - - -
9c3c7168 by Martin Storsjö at 2023-10-19T22:58:11+03:00
Add cpu flags and runtime detection of SVE and SVE2
We could also use HWCAP_SVE and HWCAP2_SVE2 for detecting this,
but these might not be available in all userland headers, while
HWCAP_CPUID is available much earlier.
The register ID_AA64ZFR0_EL1, which indicates if SVE2 is available,
can only be accessed if SVE is available. If not building all the
C code with SVE enabled (which could make it impossible to run on
on HW without SVE), binutils refuses to assemble an instruction
reading ID_AA64ZFR0_EL1 - but if referring to it with the technical
name S3_0_C0_C4_4, it can be assembled even without any extra
extensions enabled.
- - - - -
d46938de by Anton Mitrofanov at 2023-10-24T22:07:14+03:00
Fix VBV with sliced threads
- - - - -
4664f5aa by Martin Storsjö at 2023-11-02T13:27:08+02:00
aarch64: Improve scheduling in sad_x3/sad_x4
Cortex A53 A72 A73
8 bpc:
Before:
sad_x3_4x4_neon: 580 303 204
sad_x3_4x8_neon: 1065 516 323
sad_x3_8x4_neon: 668 262 282
sad_x3_8x8_neon: 1238 454 471
sad_x3_8x16_neon: 2378 842 847
sad_x3_16x8_neon: 2136 738 776
sad_x3_16x16_neon: 4162 1378 1463
After:
sad_x3_4x4_neon: 477 298 206
sad_x3_4x8_neon: 842 515 327
sad_x3_8x4_neon: 603 260 279
sad_x3_8x8_neon: 1110 451 464
sad_x3_8x16_neon: 2125 841 843
sad_x3_16x8_neon: 2124 730 766
sad_x3_16x16_neon: 4145 1370 1434
10 bpc:
Before:
sad_x3_4x4_neon: 632 247 254
sad_x3_4x8_neon: 1162 419 443
sad_x3_8x4_neon: 890 358 416
sad_x3_8x8_neon: 1670 632 759
sad_x3_8x16_neon: 3230 1179 1458
sad_x3_16x8_neon: 3070 1209 1403
sad_x3_16x16_neon: 6030 2333 2699
After:
sad_x3_4x4_neon: 522 253 255
sad_x3_4x8_neon: 932 443 431
sad_x3_8x4_neon: 880 354 406
sad_x3_8x8_neon: 1660 626 736
sad_x3_8x16_neon: 3220 1170 1397
sad_x3_16x8_neon: 3060 1184 1362
sad_x3_16x16_neon: 6020 2272 2579
Thus, this is around a 20-25% speedup on Cortex A53 for the small
sizes (much smaller difference for bigger sizes though), while it
doesn't make much of a difference at all (mostly within measurement
noise) for the out-of-order cores (A72 and A73).
- - - - -
dc755eab by Martin Storsjö at 2023-11-02T21:26:03+00:00
aarch64: Use rounded right shifts in dequant
Don't manually add in the rounding constant (via a fused multiply-add
instruction) when we can just do a plain rounded right shift.
Cortex A53 A72 A73
8bpc:
Before:
dequant_4x4_cqm_neon: 515 246 267
dequant_4x4_dc_cqm_neon: 410 265 266
dequant_4x4_dc_flat_neon: 413 271 271
dequant_4x4_flat_neon: 519 254 274
dequant_8x8_cqm_neon: 1555 980 1002
dequant_8x8_flat_neon: 1562 994 1014
After:
dequant_4x4_cqm_neon: 499 246 255
dequant_4x4_dc_cqm_neon: 376 265 255
dequant_4x4_dc_flat_neon: 378 271 260
dequant_4x4_flat_neon: 500 254 262
dequant_8x8_cqm_neon: 1489 900 925
dequant_8x8_flat_neon: 1493 915 938
10bpc:
Before:
dequant_4x4_cqm_neon: 483 275 275
dequant_4x4_dc_cqm_neon: 429 256 261
dequant_4x4_dc_flat_neon: 435 267 267
dequant_4x4_flat_neon: 487 283 288
dequant_8x8_cqm_neon: 1511 1112 1076
dequant_8x8_flat_neon: 1518 1139 1089
After:
dequant_4x4_cqm_neon: 472 255 239
dequant_4x4_dc_cqm_neon: 404 256 232
dequant_4x4_dc_flat_neon: 406 267 234
dequant_4x4_flat_neon: 472 255 239
dequant_8x8_cqm_neon: 1462 922 978
dequant_8x8_flat_neon: 1462 922 978
This makes it around 3% faster on the Cortex A53, around 8% faster
for 8bpc on Cortex A72/A73, and around 10-20% faster for 10bpp
on A72/A73.
- - - - -
3bc7c362 by Martin Storsjö at 2023-11-02T23:31:40+02:00
arm: Make the assembly indentation slightly more consistent
The assembly currently uses a mixture of different styles. Don't
make all of it entirely consistent now, but try to make functions
more consistent within themselves at least.
In particular, get rid of the convention to have braces hanging
outside of the alignment line.
- - - - -
ef572b9f by Martin Storsjö at 2023-11-02T23:34:22+02:00
aarch64: Make the assembly indentation slightly more consistent
The assembly currently uses a mixture of different styles. Don't
make all of it entirely consistent now, but try to make functions
more consistent within themselves at least.
In particular, get rid of the convention to have braces hanging
outside of the alignment line.
Some functions have the whole content indented off by one char
compared to other functions; adjust those (but retain the functions
that are self-consistent and match either of the common styles).
- - - - -
a354f11f by Martin Storsjö at 2023-11-02T23:34:23+02:00
aarch64: Consistently use lowercase vector element specifiers
- - - - -
611b87b7 by Martin Storsjö at 2023-11-14T12:38:47+02:00
checkasm: Print the actual SVE vector length
- - - - -
9b3e653b by Martin Storsjö at 2023-11-14T12:44:15+00:00
ci: Update the build-debian-amd64 job to a new base image
In the new version, there's no longer any "wine64" executable,
but both i386 and x86_64 are handled with the same "wine" frontend.
- - - - -
c1962404 by Martin Storsjö at 2023-11-14T12:44:15+00:00
ci: Test the aarch64 build in QEMU with varying SVE sizes
The sve-default-vector-length property sets the maximum vector
length in bytes; the default is 64, i.e. handling up to 512
bit vectors. In order to be able to test 1024 and 2048 bit vectors,
this has to be raised separately from setting the sve<n>=on
property.
- - - - -
b6190c6f by David Chen at 2023-11-18T08:42:48+02:00
Create Common NEON dct-a Macros
Place NEON dct-a macros that are intended to be
used by SVE/SVE2 functions as well in a common file.
- - - - -
5c382660 by David Chen at 2023-11-20T08:03:51+02:00
Improve dct-a.S Performance by Using SVE/SVE2
Imporve the performance of NEON functions of aarch64/dct-a.S
by using the SVE/SVE2 instruction set. Below, the specific functions
are listed together with the improved performance results.
Command executed: ./checkasm8 --bench=sub
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
sub4x4_dct_c: 528
sub4x4_dct_neon: 322
sub4x4_dct_sve: 247
Command executed: ./checkasm8 --bench=sub
Testbed: AWS Graviton3
Results:
sub4x4_dct_c: 562
sub4x4_dct_neon: 376
sub4x4_dct_sve: 255
Command executed: ./checkasm8 --bench=add
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
add4x4_idct_c: 698
add4x4_idct_neon: 386
add4x4_idct_sve2: 345
Command executed: ./checkasm8 --bench=zigzag
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
zigzag_interleave_8x8_cavlc_frame_c: 582
zigzag_interleave_8x8_cavlc_frame_neon: 273
zigzag_interleave_8x8_cavlc_frame_sve: 257
Command executed: ./checkasm8 --bench=zigzag
Testbed: AWS Graviton3
Results:
zigzag_interleave_8x8_cavlc_frame_c: 587
zigzag_interleave_8x8_cavlc_frame_neon: 257
zigzag_interleave_8x8_cavlc_frame_sve: 249
- - - - -
37949a99 by David Chen at 2023-11-20T08:03:53+02:00
Create Common NEON deblock-a Macros
Place NEON deblock-a macros that are intended to be
used by SVE/SVE2 functions as well in a common file.
- - - - -
5ad5e5d8 by David Chen at 2023-11-20T08:03:54+02:00
Improve deblock-a.S Performance by Using SVE/SVE2
Imporve the performance of NEON functions of aarch64/deblock-a.S
by using the SVE/SVE2 instruction set. Below, the specific functions
are listed together with the improved performance results.
Command executed: ./checkasm8 --bench=deblock
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
deblock_chroma[1]_c: 735
deblock_chroma[1]_neon: 427
deblock_chroma[1]_sve: 353
Command executed: ./checkasm8 --bench=deblock
Testbed: AWS Graviton3
Results:
deblock_chroma[1]_c: 719
deblock_chroma[1]_neon: 442
deblock_chroma[1]_sve: 345
- - - - -
21a788f1 by David Chen at 2023-11-23T08:24:13+02:00
Create Common NEON mc-a Macros and Functions
Place NEON mc-a macros and functions that are intended
to be used by SVE/SVE2 functions as well in a common file.
- - - - -
06dcf3f9 by David Chen at 2023-11-23T08:24:16+02:00
Improve mc-a.S Performance by Using SVE/SVE2
Imporve the performance of NEON functions of aarch64/mc-a.S
by using the SVE/SVE2 instruction set. Below, the specific functions
are listed together with the improved performance results.
Command executed: ./checkasm8 --bench=avg
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
avg_4x2_c: 274
avg_4x2_neon: 215
avg_4x2_sve: 171
avg_4x4_c: 461
avg_4x4_neon: 343
avg_4x4_sve: 225
avg_4x8_c: 806
avg_4x8_neon: 619
avg_4x8_sve: 334
avg_4x16_c: 1523
avg_4x16_neon: 1168
avg_4x16_sve: 558
Command executed: ./checkasm8 --bench=avg
Testbed: AWS Graviton3
Results:
avg_4x2_c: 267
avg_4x2_neon: 213
avg_4x2_sve: 167
avg_4x4_c: 467
avg_4x4_neon: 350
avg_4x4_sve: 221
avg_4x8_c: 784
avg_4x8_neon: 624
avg_4x8_sve: 302
avg_4x16_c: 1445
avg_4x16_neon: 1182
avg_4x16_sve: 485
- - - - -
0ac52d29 by David Chen at 2023-11-23T08:26:53+02:00
Create Common NEON pixel-a Macros and Constants
Place NEON pixel-a macros and constants that are intended
to be used by SVE/SVE2 functions as well in a common file.
- - - - -
c1c9931d by David Chen at 2023-11-23T19:01:29+02:00
Improve pixel-a.S Performance by Using SVE/SVE2
Imporve the performance of NEON functions of aarch64/pixel-a.S
by using the SVE/SVE2 instruction set. Below, the specific functions
are listed together with the improved performance results.
Command executed: ./checkasm8 --bench=ssd
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
ssd_4x4_c: 235
ssd_4x4_neon: 226
ssd_4x4_sve: 151
ssd_4x8_c: 409
ssd_4x8_neon: 363
ssd_4x8_sve: 201
ssd_4x16_c: 781
ssd_4x16_neon: 653
ssd_4x16_sve: 313
ssd_8x4_c: 402
ssd_8x4_neon: 192
ssd_8x4_sve: 192
ssd_8x8_c: 728
ssd_8x8_neon: 275
ssd_8x8_sve: 275
Command executed: ./checkasm10 --bench=ssd
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
ssd_4x4_c: 256
ssd_4x4_neon: 226
ssd_4x4_sve: 153
ssd_4x8_c: 460
ssd_4x8_neon: 369
ssd_4x8_sve: 215
ssd_4x16_c: 852
ssd_4x16_neon: 651
ssd_4x16_sve: 340
Command executed: ./checkasm8 --bench=ssd
Testbed: AWS Graviton3
Results:
ssd_4x4_c: 295
ssd_4x4_neon: 288
ssd_4x4_sve: 228
ssd_4x8_c: 454
ssd_4x8_neon: 431
ssd_4x8_sve: 294
ssd_4x16_c: 779
ssd_4x16_neon: 631
ssd_4x16_sve: 438
ssd_8x4_c: 463
ssd_8x4_neon: 247
ssd_8x4_sve: 246
ssd_8x8_c: 781
ssd_8x8_neon: 413
ssd_8x8_sve: 353
Command executed: ./checkasm10 --bench=ssd
Testbed: AWS Graviton3
Results:
ssd_4x4_c: 322
ssd_4x4_neon: 335
ssd_4x4_sve: 240
ssd_4x8_c: 522
ssd_4x8_neon: 448
ssd_4x8_sve: 294
ssd_4x16_c: 832
ssd_4x16_neon: 603
ssd_4x16_sve: 440
Command executed: ./checkasm8 --bench=sa8d
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
sa8d_8x8_c: 2103
sa8d_8x8_neon: 619
sa8d_8x8_sve: 617
Command executed: ./checkasm8 --bench=sa8d
Testbed: AWS Graviton3
Results:
sa8d_8x8_c: 2021
sa8d_8x8_neon: 597
sa8d_8x8_sve: 580
Command executed: ./checkasm8 --bench=var
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
var_8x8_c: 595
var_8x8_neon: 262
var_8x8_sve: 262
var_8x16_c: 1193
var_8x16_neon: 435
var_8x16_sve: 419
Command executed: ./checkasm8 --bench=var
Testbed: AWS Graviton3
Results:
var_8x8_c: 616
var_8x8_neon: 229
var_8x8_sve: 222
var_8x16_c: 1207
var_8x16_neon: 399
var_8x16_sve: 389
Command executed: ./checkasm8 --bench=hadamard_ac
Testbed: Alibaba g8y instance based on Yitian 710 CPU
Results:
hadamard_ac_8x8_c: 2330
hadamard_ac_8x8_neon: 635
hadamard_ac_8x8_sve: 635
hadamard_ac_8x16_c: 4500
hadamard_ac_8x16_neon: 1152
hadamard_ac_8x16_sve: 1151
hadamard_ac_16x8_c: 4499
hadamard_ac_16x8_neon: 1151
hadamard_ac_16x8_sve: 1150
hadamard_ac_16x16_c: 8812
hadamard_ac_16x16_neon: 2187
hadamard_ac_16x16_sve: 2186
Command executed: ./checkasm8 --bench=hadamard_ac
Testbed: AWS Graviton3
Results:
hadamard_ac_8x8_c: 2266
hadamard_ac_8x8_neon: 517
hadamard_ac_8x8_sve: 513
hadamard_ac_8x16_c: 4444
hadamard_ac_8x16_neon: 867
hadamard_ac_8x16_sve: 849
hadamard_ac_16x8_c: 4443
hadamard_ac_16x8_neon: 880
hadamard_ac_16x8_sve: 868
hadamard_ac_16x16_c: 8595
hadamard_ac_16x16_neon: 1656
hadamard_ac_16x16_sve: 1622
- - - - -
4815ccad by Anton Mitrofanov at 2024-01-13T14:45:39+03:00
Bump dates to 2024
- - - - -
436be41f by Henrik Gramner at 2024-02-19T23:49:36+01:00
x86inc: Properly sort instructions in alphabetical order
- - - - -
5207a74e by Henrik Gramner at 2024-02-20T00:02:59+01:00
x86inc: Add template defines for EVEX broadcasts
Broadcasting a memory operand is a binary flag, you either broadcast
or you don't, and there's only a single possible element size for
any given instruction.
The instruction syntax however requires the broadcast semanticts
to be explicitly defined, which is an issue when using macros to
template code for multiple register widths.
Add some helper defines to alleviate the issue.
- - - - -
a6b56179 by Henrik Gramner at 2024-02-20T00:03:09+01:00
x86inc: Add CLMUL cpu flag
Also make the GFNI cpu flag imply the presence of both AESNI and CLMUL.
- - - - -
87476b4c by Henrik Gramner at 2024-02-20T00:03:09+01:00
x86inc: Add a cpu flag for the Ice Lake AVX-512 subset
- - - - -
6fc4480c by Henrik Gramner at 2024-02-20T00:03:09+01:00
x86inc.asm: Add the crc32 SSE4.2 GPR instruction
- - - - -
12426f5f by Henrik Gramner at 2024-02-20T00:03:09+01:00
x86inc: Add support for ELF CET properties
Automatically flag x86-64 asm object files as SHSTK-compatible.
Shadow Stack (SHSTK) is a part of Control-flow Enforcement Technology
(CET) which is a feature aimed at defending against ROP attacks by
verifying that 'call' and 'ret' instructions are correctly matched.
For well-written code this works transparently without any code changes,
as return addresses popped from the shadow stack should match return
addresses popped from the normal stack for performance reasons anyway.
- - - - -
ea08f586 by Anton Mitrofanov at 2024-02-28T23:19:23+03:00
CI: Add config.log to job artifacts
- - - - -
7241d020 by Anton Mitrofanov at 2024-02-28T23:23:15+03:00
CI: Switch 32/64-bit windows builds to LLVM
Use same Docker images as VLC for contrib compilation.
- - - - -
be4f0200 by Martin Storsjö at 2024-02-28T22:26:17+00:00
aarch64: Use regular hwcaps flags instead of HWCAP_CPUID for CPU feature detection on Linux
This makes the code much simpler (especially for adding support
for other instruction set extensions), avoids needing inline
assembly for this feature, and generally is more of the canonical
way to do this.
The CPU feature detection was added in
9c3c71688226fbb23f4d36399fab08f018e760b0, using HWCAP_CPUID.
The argument for using that, was that HWCAP_CPUID was added much
earlier in the kernel (in Linux v4.11), while the HWCAP flags for
individual features always come later. This allows detecting support
for new CPU extensions before the kernel exposes information about
them via hwcap flags.
However in practice, there's probably quite little advantage in this.
E.g. HWCAP_SVE was added in Linux v4.15, and HWCAP2_SVE2 was added in
v5.10 - later than HWCAP_CPUID, but there's probably very little
practical cases where one would run a kernel older than that on a CPU
that supports those instructions.
Additionally, we provide our own definitions of the flag values to
check (as they are fixed constants anyway), with names not conflicting
with the ones from system headers. This reduces the number of ifdefs
needed, and allows detecting those features even if building with
userland headers that are lacking the definitions of those flags.
Also, slightly older versions of QEMU, e.g. 6.2 in Ubuntu 22.04,
do expose support for these features via HWCAP flags, but the
emulated cpuid registers are missing the bits for exposing e.g. SVE2
(This issue is fixed in later versions of QEMU though.)
Also drop the ifdef check for whether AT_HWCAP is defined; it was
added to glibc in 1997. AT_HWCAP2 was added in 2013, in glibc 2.18,
which also precedes when aarch64 was commonly used anyway, so
don't guard the use of that with an ifdef.
- - - - -
de1bea53 by Anton Mitrofanov at 2024-03-12T23:10:12+03:00
ppc: Fix incompatible pointer type errors
Use correct return type for pixel_sad_x3/x4 functions.
Bug report by Dominik 'Rathann' Mierzejewski .
- - - - -
3d8aff7e by Henrik Gramner at 2024-03-14T23:29:26+00:00
x86inc: Fix warnings with old nasm versions
- - - - -
4df71a75 by Henrik Gramner at 2024-03-14T23:29:26+00:00
x86inc: Restore the stack state between stack allocations
Allows the use of multiple independent stack allocations within
a function without having to manually fiddle with stack offsets.
- - - - -
585e0199 by Henrik Gramner at 2024-03-14T23:29:26+00:00
x86inc: Improve XMM-spilling functionality on 64-bit Windows
Prior to this change dealing with the scenario where the number of
XMM registers spilled depends on if a branch is taken or not was
complicated to handle well. There was essentially three options:
1) Always spill the largest number of XMM register. Results in
unnecessary spills.
2) Do the spilling after the branch. Results in code duplication
for the shared subset of spills.
3) Do the spilling manually. Optimal, but overly complex and vexing.
This adds an additional optional argument to the WIN64_SPILL_XMM
and WIN64_PUSH_XMM macros to make it possible to allocate space
for a certain number of registers but initially only push a subset
of those, with the option of pushing additional register later.
- - - - -
982d3240 by Xiwei Gu at 2024-03-21T09:17:09+08:00
loongarch: Update loongson_asm.S version to 0.4.0
- - - - -
5a61afdb by Xiwei Gu at 2024-03-21T09:18:00+08:00
loongarch: Add checkasm_call
- - - - -
16262286 by Xiwei Gu at 2024-03-21T09:18:32+08:00
loongarch: Fixed pixel_sa8d_16x16_lasx
Save and restore FPR
- - - - -
7ed753b1 by Xiwei Gu at 2024-03-21T09:18:50+08:00
loongarch: Enhance ultrafast encoding performance
Using the following command, ultrafast encoding
has improved from 182fps to 189fps:
./x264 --preset ultrafast -o out.mkv yuv_1920x1080.yuv
- - - - -
4613ac3c by Henrik Gramner at 2024-05-13T17:54:15+02:00
x86inc: Improve ELF PIC support for external function calls
PLT/GOT indirections are required in some cases. Most commonly when
calling functions from other shared libraries, but also in some
scenarios when calling functions with default symbol visibility
even within the same component on certain elf64 platforms.
On elf64 we can simply use PLT relocations for all calls to external
functions. Since the linker is able to eliminate unnecessary PLT
indirections with the final output binary being identical to non-PLT
relocations there isn't really any downside to doing so. This mimics
what regular compilers normally do for calls to external functions.
On elf32 with PIC we can use a function pointer from the GOT when
calling external functions, similar to what regular compilers do when
using -fno-plt. Since this both introduces overhead and clobbers one
register, which could potentially have been used for custom calling
conventions when calling other asm functions within the same library,
it's only performed for functions declared using 'cextern_naked'.
- - - - -
c24e06c2 by Martin Storsjö at 2024-09-17T14:07:10+03:00
configure: Check for SVE support in MS armasm64 via as_check
This is mostly supported in armasm64 since MSVC 2022 17.10.
- - - - -
3a8b5be2 by Brad Smith at 2024-10-07T15:58:28-04:00
aarch64: Use elf_aux_info() for CPU feature detection on FreeBSD/OpenBSD
- - - - -
1243d9ff by Brad Smith at 2024-10-17T06:23:19-04:00
Provide x264_getauxval() wrapper for getauxvaul() and elf_aux_info()
- - - - -
80c1c47c by Brad Smith at 2024-10-20T08:50:55+00:00
configure: Add DragonFly support
- - - - -
3a21e97b by Anton Mitrofanov at 2024-10-22T22:59:00+03:00
Fix build with Android NDK and API < 24 for 32-bit targets
fseeko() is not available before API 24 with _FILE_OFFSET_BITS=64.
x264.c: x264cli.h must be first as it contains _FILE_OFFSET_BITS define.
- - - - -
b1d2de88 by Brad Smith at 2024-10-26T06:34:32+00:00
Use getauxval() on Linux and elf_aux_info() on FreeBSD/OpenBSD on arm/ppc
- - - - -
da14df55 by Brad Smith at 2024-10-27T12:28:19-04:00
Make use of sysconf(3) _SC_NPROCESSORS_ONLN and _SC_NPROCESSORS_CONF
Make use of _SC_NPROCESSORS_ONLN if it exists and fallback to
_SC_NPROCESSORS_CONF for really old operating systems. This adds
support for retrieving the number of CPUs on a few OS's such as
NetBSD, DragonFly and a few others.
- - - - -
023112c6 by Brad Smith at 2024-11-03T23:44:35-05:00
aarch64: defines involving bit shifts should be unsigned
- - - - -
938601b9 by Brad Smith at 2024-12-29T15:52:24+00:00
Use sysctlbyname(3) hw.logicalcpu on macOS
Use of hw.ncpu has long been deprecated.
- - - - -
a64111b1 by Brad Smith at 2024-12-29T12:13:33-05:00
Enable use of __sync_fetch_and_add() wherever detected instead of just X86
Use __sync_fetch_and_add() wherever detected instead of being limited to
just X86.
- - - - -
450946f9 by Martin Storsjö at 2024-12-29T17:48:58+00:00
ci: Test compiling for Android
- - - - -
52f7694d by Brad Smith at 2024-12-29T17:54:57+00:00
Use sched_getaffinity on Android
https://android.googlesource.com/platform/bionic/+/72e6fd42421dca80fb2776a9185c186d4a04e5f7
Android has had sched_getaffinity since Android 3.0. Builds need
to use _GNU_SOURCE.
- - - - -
373697b4 by Anton Mitrofanov at 2025-01-03T16:48:30+03:00
Bump dates to 2025
- - - - -
c80f8a28 by Martin Storsjö at 2025-03-04T11:15:49+02:00
msvsdepend: Allow using the script for .S sources too
Previously, MSVC would warn that the .S source is unrecognized,
and the script would only produce a depenency on the main source
file itself.
- - - - -
27d83708 by Martin Storsjö at 2025-03-11T22:22:24+02:00
Makefile: Generate dependency information implicitly while compiling
This updates the dependecy information on each successive recompile.
When building with MSVC, dependency information is generated with
a separate command just like before, but done together with
compiling each object file. (This is quite similar to how ffmpeg does
the same.)
This avoids the serial dependency generation step. In slow
environments (in particular if using MSVC) it could take a notable
amount of time; this can now all be done in parallel.
In one example, this reduces the time for a full build from clean
with MSVC (wrapped in wine) from 23 seconds down to 9 seconds,
thanks to parallelism. (For non-parallel builds, it doesn't make
much of a difference.)
- - - - -
a0191bd8 by Martin Storsjö at 2025-03-12T13:23:40+02:00
configure: Use as_check for checking for aarch64 features
This is more correct than using cc_check; we're going to assemble
standalone external assembly - thus check for whether we can
build it in that form, not using inline assembly.
This allows sharing checks with the MSVC codepath (where inline
assembly isn't supported, and where assembly is built using
a tool different from the regular compiler).
- - - - -
72ce1cde by Martin Storsjö at 2025-03-12T13:23:40+02:00
configure: Use as_check for the main check for whether NEON is supported
This requires adding the "-c" flag to ASFLAGS before doing the
check.
This also makes sure to validate the gas-preprocessor is functional
for MSVC configurations, by testing whether the "cmeq" instruction
can be assembled at this point.
- - - - -
f87ca183 by Martin Storsjö at 2025-03-12T13:23:40+02:00
configure: Check for .arch and .arch_extension for enabling aarch64 extensions
This hasn't been needed for SVE/SVE2, as all toolchains have
supported just enabling it via ".arch armv8.2-a+sve". For other
arch extensions, like dotprod/i8mm, there's more combinations of
toolchain bugs in slightly older toolchains; try to detect what is
supported.
Additionally, when involving more than one architecture extension,
we may want to enable/disable individual extensions one at a time,
without needing to specify the full list in one single .arch
statement.
This is a preparatory commit for adding support for the dotprod/i8mm
extensions.
We intentionally don't add AS_ARCH_LEVEL to the CONFIG_HAVE list,
as this define isn't prefixed with "HAVE_", and we don't use the
define except in the case where we actually do set it. (It's not
a regular 0/1 define like the others.)
- - - - -
87044b21 by Martin Storsjö at 2025-03-12T13:23:40+02:00
aarch64: Use configure detected directives for enabling SVE/SVE2
By using .arch_extension (if supported) to enable the relevant
extensions, we can also disable them afterwards, so we can e.g.
cleanly enable one extension only for one subsection of a file.
This also makes it easier to enable various combinations of
supported architecture extensions.
- - - - -
fc4012fb by Martin Storsjö at 2025-03-12T13:23:40+02:00
configure: Check for the dotprod and i8mm aarch64 extensions
- - - - -
0e48d072 by Martin Storsjö at 2025-03-12T13:23:40+02:00
aarch64: Add flags for runtime detection of dotprod and i8mm
Also add code for detecting them on Linux.
- - - - -
570f6c70 by Martin Storsjö at 2025-03-12T13:23:40+02:00
aarch64: Add runtime detection of extensions on Windows and macOS
- - - - -
fe9e4a7f by Konstantinos Margaritis at 2025-03-12T12:35:10+00:00
Provide implementations for functions using the instructions SDOT/UDOT in the DotProd Armv8 extension.
Functions implemented:
sad_16x8, sad_16x16,
sad_x3_16x8_neon, sad_x3_16x16_neon,
sad_x4_16x8_neon, sad_x4_16x16_neon,
ssd_8x4, ssd_8x8, ssd_8x16, ssd_16x8, ssd_16x16,
pixel_vsad
Performance improvement against Neon ranges from 5% to 188%.
Following is the output of ./checkasm8 --bench (run on a Graviton4 system):
sad_16x8_c: 1323
sad_16x8_neon: 224
sad_16x8_dotprod: 211
sad_16x16_c: 2619
sad_16x16_neon: 365
sad_16x16_dotprod: 320
sad_x3_16x8_c: 3836
sad_x3_16x8_neon: 403
sad_x3_16x8_dotprod: 317
sad_x3_16x16_c: 7725
sad_x3_16x16_neon: 714
sad_x3_16x16_dotprod: 532
sad_x4_16x8_c: 5080
sad_x4_16x8_neon: 438
sad_x4_16x8_dotprod: 375
sad_x4_16x16_c: 10260
sad_x4_16x16_neon: 794
sad_x4_16x16_dotprod: 655
ssd_8x4_c: 381
ssd_8x4_neon: 157
ssd_8x4_dotprod: 115
ssd_8x4_sve: 150
ssd_8x8_c: 695
ssd_8x8_neon: 238
ssd_8x8_dotprod: 161
ssd_8x8_sve: 228
ssd_8x16_c: 1335
ssd_8x16_neon: 388
ssd_8x16_dotprod: 267
ssd_16x8_c: 1342
ssd_16x8_neon: 285
ssd_16x8_dotprod: 166
ssd_16x16_c: 2623
ssd_16x16_neon: 503
ssd_16x16_dotprod: 277
vsad_c: 2786
vsad_neon: 311
vsad_dotprod: 235
- - - - -
32c3b801 by Martin Storsjö at 2025-04-04T17:50:00+03:00
lavf: Update the code to work with the latest libavutil API
- - - - -
4360ac37 by Anton Mitrofanov at 2025-05-19T01:28:04+03:00
ci: Fix ffmpeg build
libpostproc has been removed from the ffmpeg repository.
- - - - -
40617ddb by Anton Mitrofanov at 2025-05-19T02:42:45+03:00
ci: Remove vlc-contrib dependency
- - - - -
85b5ccea by Martin Storsjö at 2025-05-21T18:48:51+00:00
Update gas-preprocessor.pl to the latest upstream version
This updates to the version from commit
7380ac24e1cd23a5e6d76c6af083d8fc5ab9e943 from
https://github.com/ffmpeg/gas-preprocessor.
The previous version was from 2017, from commit
ee12830747ff0b97ec6b41f4263fec63d1711365.
This includes support for assembling aarch64 code with
register ranges, such as {v0.8b-v3.8b} with armasm64 (rewritten
into an explicit list of registers), and fixes deprecated Perl
syntax broken by more modern versions of Perl.
- - - - -
ff620d0c by Coia Prant at 2025-05-27T19:44:39+00:00
configure: Use MSYSTEM_CARCH for default arch on msys2
- - - - -
714e07b4 by Martin Storsjö at 2025-06-06T14:25:43+03:00
arm: Don't test x264_cpu_fast_neon_mrc_test on Windows
The performance counters themselves are accessible, but the PMNC
(control register) that we try to read to see if the performance
counters are accessible, is not readable, causing illegal
instructions in cpu_enable_armv7_counter.
As an alternative, we could also modify cpu_fast_neon_mrc_test to
not inspect the PMNC at all (skip calling cpu_enable_armv7_counter)
but just assume that the counters are available, in high resolution
mode. However just not calling this codepath is the simplest,
as Windows on 32 bit ARM isn't very relevant these days.
- - - - -
291476d7 by Anton Mitrofanov at 2025-06-06T19:34:05+00:00
windows: Fix named pipes detection
The _wstati64 call succeeds on named pipes, so to check correctly you
must first check the result of WaitNamedPipeW.
- - - - -
b35605ac by Konstantinos Margaritis at 2025-06-08T16:24:23+00:00
i8mm & neon hpel_filter optimization
hpel_filter_c: 47995
hpel_filter_neon: 9670
hpel_filter_i8mm: 9643
previously:
hpel_filter_neon: 10222
In the Neon implementation, replaced SSHR+SUB+ADD with a single SSRA
- - - - -
21 changed files:
- .gitignore
- .gitlab-ci.yml
- Makefile
- autocomplete.c
- common/aarch64/asm-offsets.c
- common/aarch64/asm-offsets.h
- common/aarch64/asm.S
- common/aarch64/bitstream-a.S
- common/aarch64/bitstream.h
- common/aarch64/cabac-a.S
- + common/aarch64/dct-a-common.S
- + common/aarch64/dct-a-sve.S
- + common/aarch64/dct-a-sve2.S
- common/aarch64/dct-a.S
- common/aarch64/dct.h
- + common/aarch64/deblock-a-common.S
- + common/aarch64/deblock-a-sve.S
- common/aarch64/deblock-a.S
- common/aarch64/deblock.h
- + common/aarch64/mc-a-common.S
- + common/aarch64/mc-a-sve.S
The diff was not included because it is too large.
View it on GitLab: https://code.videolan.org/videolan/x264/-/compare/31e19f92f00c7003fa115047ce50978bc98c3a0d...b35605ace3ddf7c1a5d67a2eb553f034aef41d55
--
View it on GitLab: https://code.videolan.org/videolan/x264/-/compare/31e19f92f00c7003fa115047ce50978bc98c3a0d...b35605ace3ddf7c1a5d67a2eb553f034aef41d55
You're receiving this email because of your account on code.videolan.org.
VideoLAN code repository instance
More information about the x264-devel
mailing list