Altivec IDCT

Sun Sep 9 17:18:02 CEST 2001

On   9 Sep, this message from Michel LESPINASSE echoed through cyberspace:
> On Sun, Sep 09, 2001 at 10:13:59AM +0200, Michel Lanners wrote:
>> Concerning precision, I just noted that Moto's IDCT is not precise
>> enough... You knew that already ;-). Also, for the record, the Altivec
>> IDCT that exists in vlc 0.2.82 misses a final matrix transpose to be of
>> any use (but that's history, since it's been replaced by a better
>> implementation).
> 
> I think it could work if we just transpose the idct input (which we
> can do by just changing the zigzag order tables).

Avoiding the transpose where possible is always good, since the two
transposes involved in the IDCT use about 50 % of the CPU cycles...
Also, the final transpose could be replaced by code in the block
add/copy routines. Not sure which (zizag or block) is easier to do.

>> >>I've been looking at motorola's integer implementation (16-bit), but
>> >>its not precise enough (far from it) and I dont think there is any
>> >>hope to bring it to compliance. Its blindingly fast though.
> 
>> Is scaling to augment precision (while still staying 16-bit integer) any
>> possibility?
> 
> Well actually I've been looking a bit more at motorola's routine and
> it looks like precision can be improved a lot by scaling - so maybe
> that'll be enough. I have to try :) If it doesnt work I'll do the
> first half in 32-bit or float mode but that'll be quite slower.

Also, you might want to look at Apple's example code, available publicly
on their site (look under developper support, example code, Altivec
technology somewhere); IIRC, they have a comparative implementation of
three or four IDCT's producing (quoting) 'identical results', one of
which is in floating point. I.e. they seem to have an integer
implementation that has the same precision as the fp one.

Not sure about their license re. integrating this into vlc...

>> Therefore, either your Altivec code needs to do some magic to re-align
>> misaligned accesses (expensive CPU-wise), or you need to guarantee that
>> the buffers you're using are indeed aligned. Alignment needs to be on
>> 16-bytes; however the malloc() out of glibc 2.2 guarantees only 8 bytes
>> alignment.
> 
> We need to do a memalign() or to have static idct buffers...

How about something along these lines:

(I was thinking about a #define ALIGN <something> and using that to name
the function; but I need to check cpp docs on how to do that ;-)

diff -uNr vlc-snapshot-20010829/include/malloc-aligned.h vlc-snapshot-20010829-altivec/include/malloc-aligned.h

--- vlc-snapshot-20010829/include/malloc-aligned.h      Thu Jan  1 01:00:00 1970
+++ vlc-snapshot-20010829-altivec/include/malloc-aligned.h      Wed Sep  5 18:17:38 2001
@@ -0,0 +1,22 @@
+#define ALIGN 16
+
+#ifndef __USE_XOPEN2K
+#define __USE_XOPEN2K
+#include <stdlib.h>
+#undef __USE_XOPEN2K
+#else
+#include <stdlib.h>
+#endif
+
+#ifndef __USE_ISOC99
+#define __USE_ISOC99
+#include <math.h>
+#undef __USE_ISOC99
+#else
+#include <math.h>
+#endif
+
+#define malloc malign_16
+
+void * malign_16 ( size_t );
+
diff -uNr vlc-snapshot-20010829/src/misc/malloc.c vlc-snapshot-20010829-altivec/src/misc/malloc.c
--- vlc-snapshot-20010829/src/misc/malloc.c     Thu Jan  1 01:00:00 1970
+++ vlc-snapshot-20010829-altivec/src/misc/malloc.c     Wed Sep  5 18:28:36 2001
@@ -0,0 +1,17 @@
+#include "malloc-aligned.h"
+
+void * malign_16( size_t count) {
+       void * aligned = NULL;
+       int size;
+
+       size = (count / sizeof(void *) + 1) * sizeof(void *);
+       size = exp2( ((int)log2(size) + 1) );
+       if (posix_memalign(&aligned, 16, size) != 0)
+               return NULL;
+       else {
+               if ((unsigned long)aligned & 0xf)
+                       printf ("Alarm: malloc not aligned: %p\n.", aligned);
+               return aligned;
+       }
+}
+

(Warning: needs to be cleaned up.) Right now I'm including "malloc.h" in
some generic header file, so that all malloc() calls get #define'd to
malign_16(), and voilà.... But the long-term fix is to use malign_16()
in all critical places directly. Maybe you could help identify the
critical places....

> I got myself an account on a g4 but I dont have an altivec-enabled gcc
> on it... so I may wait a few more days before I try my idct ideas :)

Emmm... may I humbly suggest to avoid the Altivec-enabled gcc if at all
possible? Right now the only available Altivec-enabled tools (on Linux,
anyway) are an old gcc-2.95 hacked on by Motorola, and it seems the gcc
community doesn't like that implementation too much.

On the other side, the current gas does understand all (well, most
anyway) Altivec assembler instructions; so that one can be used right
away and permits to avoid the problematic gcc.

I'm currently hacking on the Altivec-capable-tools detection in
configure.in (separated the Altivec modules in ALTIVEC_MODULES_C and
ALTIVEC_MODULES_ASM), so for me the motion compensation is not
available. If time permits, I'll translate it into asm.

Also, the runtime test for Altivec availability needs to be changed for
Linux. What is the best way to make the detection code depend on the OS,
so that I can conditionally compile in Linux detection code? In my case,
I need to check the output of /proc/cpuinfo.

Cheers

Michel

PS tried the current CVS; Altivec IDCT doesn't work :-( Probably
alignment problems.

-------------------------------------------------------------------------
Michel Lanners                 |  " Read Philosophy.  Study Art.
23, Rue Paul Henkes            |    Ask Questions.  Make Mistakes.
L-1710 Luxembourg              |
email   mlan at cpu.lu            |
http://www.cpu.lu/~mlan        |                     Learn Always. "