[libdvbpsi-devel] [Videolan-devel] libdvbcsa - NEON acceleration

Nikolay Nikolaev nicknickolaev at gmail.com
Tue Jul 19 12:01:56 CEST 2011


Remi,

I am using Linaro's qemu - had no troubles with NEON.


> Unfortunately, the structure of libdvbcsa in its current form prevents
> good ARM NEON optimizations. It cannot deal with unrolling, at least not
> with block size above 64 bits. I do not know if this is a problem for MMX,
> SSE and Altivec, but it definitely is for ARM NEON. The ARM processors do
> not reorder instructions and most NEON instructions have latencies. So you
> typically should not use the result of an instruction in the next
> instruction, otherwise the coprocessor will stall due to data dependency.
>
> Does this mean that there's no point in trying to do NEON in this code?
I would still like to see this run on real hardware and see if it's worth it
at all.


> Also, without explicit vector load/store (VLD and VST) instructions,
> memory transfers to/from the coprocessor may be painfully slow.
>
> This probably means that BS_VAL can be done as VLD, I can't see where VST
can be used though.


> It might also be worth looking at aligned memory accesses, which are much
> faster in NEON than unaligned access. But that optimization is only usable
> with assembly code (inline or out of line), not with intrinsics.
>
>
As I said this is my first NEON experience, so I am not completely sure I
understand
how to do that?
>From my standpoint - there are macros that provide some basic primitives for
different SIMD platforms.
I have just placed the NEON one (their intrinsics representation), as by the
assembly manuals.
Now - if the code calls BS_SHL on an unaligned data, what can possibly be
done?

Thanks,
Nikolay Nikolaev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/libdvbpsi-devel/attachments/20110719/d16e938b/attachment.html>


More information about the libdvbpsi-devel mailing list