[vlc-devel] commit: NEON float to fixed point vectorized conversion ( Rémi Denis-Courmont )

Sun Sep 6 20:46:46 CEST 2009

Le dimanche 6 septembre 2009 21:02:37 Måns Rullgård, vous avez écrit :
> git at videolan.org (git version control) writes:
> > vlc | branch: master | Rémi Denis-Courmont <remi at remlab.net> | Sat Sep  5
> > 17:14:09 2009 +0300| [7b51769579f4b5a83641c5e93e957f902467f71e] |
> > committer: Rémi Denis-Courmont
> >
> > NEON float to fixed point vectorized conversion
> >
> > +/**
> > + * Half-precision floating point to signed fixed point conversion.
> > + */
>
> I think you mean single-precision.  Half-precision is a rather
> uncommon 16-bit floating point format.

Hmm right.

>
> > +    while (inp != endp)
> > +        asm volatile (
> > +            "vld4.f32 {q0-q1}, [%[inp]]!\n"
> > +            "vcvt.s32.f32 q2, q0, #28\n"
> > +            "vcvt.s32.f32 q3, q1, #28\n"
> > +            "vst4.s32 {q2-q3}, [%[outp]]!\n"
> > +            : [outp] "+r" (outp), [inp] "+r" (inp)
> > +            :
> > +            : "q0", "q1", "q2", "q3", "memory");
>
> This is very inefficient for a couple of reasons:

> - VLD1 is faster and works just as well here.

This was already fixed in b1aa778c9337b0a9d, although the "faulty" code still 
seemed 6 times faster than the portable C implementation.

> - The VST4 instruction will stall four cycles waiting for the result
>   of VCVT.

I haven't (had time to) read the spec to such a level of detail yet :/

> Simply switching to VLD1/VST1 will save one cycle in each of these,
> and will also stall for one cycle less, saving a total of three cycles
> per iteration.  More could be saved by unrolling the loop a few times.

So there should be 4 independent instructions before the corresponding store? 

-- 
Rémi Denis-Courmont
http://www.remlab.net/