[vlc-devel] [patch] avx2 acceleration for i420_yuy2/i422_yuy2/i420_rgb

Sat Jan 26 20:52:56 CET 2019

Ok, I had to look this up to catch up, not having done a lot of asm to
date. My understanding is that nasm/yasm are preferred now over gas;
nasm requires intel syntax, while yasm supports both; and intel syntax
would be preferred for supporting both nasm & yasm.

I only used at&t syntax as that was what was already in place. I guess
I can change it to intel.

I'll leave the existing SSE2 as at&t, unless you'd like to see that
converted also, towards unifying everything onto a single assembler...
actually maybe i'll just end up converting the SSE2 also while I'm at
it, since that seems a sensible goal... I'll see...

On Sat, 2019-01-26 at 00:44 +0100, Jean-Baptiste Kempf wrote:
> Shouldn't that be moved to nasm/yasm syntax?
> 
> On Tue, 22 Jan 2019, at 22:58, jnqnfe at gmail.com wrote:
> > the attached patch adds AVX2 acceleration for
> > i420_yuy2/i422_yuy2/i420_rgb chroma converters
> > 
> > it is built on top of two other submissions sent in today, one to
> > add
> > an AVX2 module to configure, and the other was a set of various
> > patches
> > to these plugins
> > 
> > it is designed based upon the SSE2 implementation
> > 
> > i've not yet been in any position to compile it, but I've put a lot
> > of
> > work into it over the past fews days perfecting it
> > 
> > benefits:
> >  - twice as much data at a time
> >  - Vex instructions are more compact I believe = less byte code
> >  - use of non-destructive instructions enabled eliminating many of
> > the
> > copies done in the SSE2 version
> > 
> > ---
> > an aside:
> > one small thing I'll mention that I don't like is that a
> > `_mm256_loadl_epi128` function does not exist, so I had to use
> > `_mm256_inserti128_si256`. With the assembly, `vmovdqa` is used for
> > both 256-bit and 128-bit aligned loads (`vmovdqu` for unaligned),
> > with
> > how much data depending on whether you reference a YMM or XMM
> > register;
> > and if an XMM register, it zeros out the top portion. Use of
> > `_mm256_inserti128_si256` in the function based implementation does
> > not
> > zero out the top portion (unless already zero), but should not
> > cause
> > any problem since we don't use that data. Note that an older non-
> > "Vex"
> > ("__mm_" instead of "__mm256_") instruction could be used, but A)
> > this
> > would have the same effect, and B) I have a copy of a 2014 paper
> > which
> > suggests mixing Vex/non-Vex YMM+XMM instructions brings a big
> > performance penalty. I don't know yet whether or not there's a
> > better
> > solution than `_mm256_inserti128_si256` (to properly load 1x128
> > with
> > upper zerod properly as with asm).
> > _______________________________________________
> > vlc-devel mailing list
> > To unsubscribe or modify your subscription options:
> > https://mailman.videolan.org/listinfo/vlc-devel
> > Email had 1 attachment:
> > + chroma_avx2.patch
> >   174k (text/x-patch)
> 
>