[x265] [PATCH] Use atomic bit test and set/reset operations on x86

Wed Jan 10 20:20:09 CET 2018

On 01/10/18 21:03, chen wrote:
> 
> At 2018-01-11 00:06:29, "Andrey Semashev" <andrey.semashev at gmail.com> wrote:
>>On 01/10/18 18:53, chen wrote:
>>
>>> the "lock" prefix will lock the CPU bus, it will be greater penalty on 
>>> the multi-core system.
>>
>>Just for the record, the lock prefix is implemented much more 
>>efficiently nowdays and involves CPU cache management rather bus 
>>locking. It used to lock the memory bus on early CPUs (I want to say 
>>before Pentium, but I'm not sure which exact architecture changed this). 
>>In any case, the patch does not introduce new lock instructions but it 
>>replaces "lock; cmpxchg" loops that are normally generated for the 
>  >atomic AND and OR operations with a single instruction.
> 
> https://htor.inf.ethz.ch/publications/img/atomic-bench.pdf
> 
> In this paper, the author explain toat lock (SWP) just performance drop
> a little in modern CPUs, but they just try less cores system (Xeon Phi
> have more lost and it is single socket CPU), on multi-socket system,
> the cache coherency maintenance will be very expensive.

I don't dispute that on massively parallel systems cache coherency 
protocols are more expensive. They are equally as expensive with the 
current code. If anything, replacing a CAS loop with a single 
instruction has the potential to *reduce* the number of executed atomic 
instructions. More so on heavy contention.

> However, the intrinsic may get more benefit from compiler, it may decide
> 
> which method is best choice on target platform.

Well, on x86 there really is not much choice of atomic instructions, all 
of them have the lock prefix (xchg has an implicit one) and presumably 
rely on the same cache coherency protocols. There are TSX extensions, 
but given that the operations always modify the same memory, I don't see 
those as beneficial. So basically, you can only hope that the compiler 
does perform the optimization that this patch does, but generated code 
inspection shows that current compilers are not capable enough.