[vlc-devel] commit: NEON float to fixed point vectorized conversion ( Rémi Denis-Courmont )
Rémi Denis-Courmont
remi at remlab.net
Sun Sep 6 20:46:46 CEST 2009
Le dimanche 6 septembre 2009 21:02:37 Måns Rullgård, vous avez écrit :
> git at videolan.org (git version control) writes:
> > vlc | branch: master | Rémi Denis-Courmont <remi at remlab.net> | Sat Sep 5
> > 17:14:09 2009 +0300| [7b51769579f4b5a83641c5e93e957f902467f71e] |
> > committer: Rémi Denis-Courmont
> >
> > NEON float to fixed point vectorized conversion
> >
> > +/**
> > + * Half-precision floating point to signed fixed point conversion.
> > + */
>
> I think you mean single-precision. Half-precision is a rather
> uncommon 16-bit floating point format.
Hmm right.
>
> > + while (inp != endp)
> > + asm volatile (
> > + "vld4.f32 {q0-q1}, [%[inp]]!\n"
> > + "vcvt.s32.f32 q2, q0, #28\n"
> > + "vcvt.s32.f32 q3, q1, #28\n"
> > + "vst4.s32 {q2-q3}, [%[outp]]!\n"
> > + : [outp] "+r" (outp), [inp] "+r" (inp)
> > + :
> > + : "q0", "q1", "q2", "q3", "memory");
>
> This is very inefficient for a couple of reasons:
> - VLD1 is faster and works just as well here.
This was already fixed in b1aa778c9337b0a9d, although the "faulty" code still
seemed 6 times faster than the portable C implementation.
> - The VST4 instruction will stall four cycles waiting for the result
> of VCVT.
I haven't (had time to) read the spec to such a level of detail yet :/
> Simply switching to VLD1/VST1 will save one cycle in each of these,
> and will also stall for one cycle less, saving a total of three cycles
> per iteration. More could be saved by unrolling the loop a few times.
So there should be 4 independent instructions before the corresponding store?
--
Rémi Denis-Courmont
http://www.remlab.net/
More information about the vlc-devel
mailing list