[vlc-devel] commit: NEON float to fixed point vectorized conversion ( Rémi Denis-Courmont )

Sun Sep 6 21:30:30 CEST 2009

"Rémi Denis-Courmont" <remi at remlab.net> writes:

> Le dimanche 6 septembre 2009 21:02:37 Måns Rullgård, vous avez écrit :
>> git at videolan.org (git version control) writes:
>> > vlc | branch: master | Rémi Denis-Courmont <remi at remlab.net> | Sat Sep  5
>> > 17:14:09 2009 +0300| [7b51769579f4b5a83641c5e93e957f902467f71e] |
>> > committer: Rémi Denis-Courmont
>> >
>> > NEON float to fixed point vectorized conversion
>> >
>> > +/**
>> > + * Half-precision floating point to signed fixed point conversion.
>> > + */
>>
>> I think you mean single-precision.  Half-precision is a rather
>> uncommon 16-bit floating point format.
>
> Hmm right.
>
>>
>> > +    while (inp != endp)
>> > +        asm volatile (
>> > +            "vld4.f32 {q0-q1}, [%[inp]]!\n"
>> > +            "vcvt.s32.f32 q2, q0, #28\n"
>> > +            "vcvt.s32.f32 q3, q1, #28\n"
>> > +            "vst4.s32 {q2-q3}, [%[outp]]!\n"
>> > +            : [outp] "+r" (outp), [inp] "+r" (inp)
>> > +            :
>> > +            : "q0", "q1", "q2", "q3", "memory");
>>
>> This is very inefficient for a couple of reasons:
>
>> - VLD1 is faster and works just as well here.
>
> This was already fixed in b1aa778c9337b0a9d,

I noticed after sending the email.

> although the "faulty" code still seemed 6 times faster than the
> portable C implementation.

Yes, it would be.  GCC won't use NEON at all here, but rather the
non-pipelined VFP unit.  Beating gcc is easy, but you can usually
achieve at least double the speed of a trivial asm implementation.

>> - The VST4 instruction will stall four cycles waiting for the result
>>   of VCVT.
>
> I haven't (had time to) read the spec to such a level of detail yet :/

If you intend to write a lot of this stuff, you will need to do that.
Proper instruction scheduling is essential for writing fast NEON code.

>> Simply switching to VLD1/VST1 will save one cycle in each of these,
>> and will also stall for one cycle less, saving a total of three cycles
>> per iteration.  More could be saved by unrolling the loop a few times.
>
> So there should be 4 independent instructions before the
> corresponding store?

There are four cycles to fill, which can be fewer instructions or
more.  Some instructions need multiple issue cycles, and some pairs
can dual-issue.  You should of course try to dual-issue as much as
possible.  NEON load/store/permute instructions can dual-issue with
other instructions.  This means you should try to interleave the two
classes of instructions whenever possible.  Permute instructions are
things like VREV, VTRN, and VZIP that shuffle the contents of
registers without doing any arithmetic/logic operations.

-- 
Måns Rullgård
mans at mansr.com