[x264-devel] [PATCH] Add all remaining 16x16 predict Altivec routines

Wed Jan 14 00:03:57 CET 2009

On 13 janv. 09, at 22:27, Alexander Strange wrote:

> It's accurate enough - PPC isn't severely out-of-order like x86, and  
> the
> units aren't CPU ticks so won't change meaning if it speedsteps.
> But it doesn't necessarily count in nanoseconds; you have to call
> mach_timebase_info():
> http://developer.apple.com/qa/qa2004/qa1398.html#LISTMACHTIMEBASEINT

Thanks. Since checkasm is used for relative comparison, I guess I'll  
leave it like this
(it seems the iPhone is the only platform that doesn't use a non- 
trivial timebase anyway).

> The 0xFFFFFFFF looks unnecessary to me.

You're right. I just learned what the default behavior of a uint64_t  
to a uint32_t cast is :-).


On 13 janv. 09, at 23:50, Guillaume POIRIER wrote:

> I don't exactly have the same numbers over here (PPC970MP with  
> GCC4.2 on
> Leopard), but it's close enough.

My own runs exhibit slight variations, on the order of +/- 1 for the  
shorter functions
and +/- 5 for the longest (intra_predict_16x16_p_c). Still, the  
conclusions seem clear
enough...

> I guess I'll have to drop intra_predict_16x16_h_altivec since I don't
> know how to make it faster with Altivec, even after some unrolling.
>
> However, it looks like doing some pseudo-64bits SIMD with general
> purpose registers allows this code to go faster on that machine.
>
> I'll experience more with that later one.

Good luck !

Antoine Gerschenfeld
gerschen at gmail.com

P.S. : New checkasm patch :

diff --git a/tools/checkasm.c b/tools/checkasm.c
index aeaf5fb..7825b97 100644
--- a/tools/checkasm.c
+++ b/tools/checkasm.c
@@ -30,6 +30,10 @@
#include "common/common.h"
#include "common/cpu.h"

+#ifdef SYS_MACOSX
+#include <mach/mach_time.h>
+#endif
+
/* buf1, buf2: initialised to random data and shouldn't write into  
them */
uint8_t * buf1, * buf2;
/* buf3, buf4: used to store output */
@@ -80,6 +84,8 @@ static inline uint32_t read_time(void)
     uint32_t a;
     asm volatile( "rdtsc" :"=a"(a) ::"edx" );
     return a;
+#elif defined(SYS_MACOSX)
+   return mach_absolute_time();
#else
     return 0;
#endif
@@ -153,7 +159,8 @@ static void print_bench(void)
                     /* print sse2slow only if there's also a sse2fast  
version of the same func */
                     b->cpu&X264_CPU_SSE2_IS_SLOW && j<MAX_CPUS &&  
b[1].cpu&X264_CPU_SSE2_IS_FAST && !(b[1].cpu&X264_CPU_SSE3) ?  
"sse2slow" :
                     b->cpu&X264_CPU_SSE2 ? "sse2" :
-                    b->cpu&X264_CPU_MMX ? "mmx" : "c",
+                    b->cpu&X264_CPU_MMX ? "mmx" :
+                    b->cpu&X264_CPU_ALTIVEC ? "altivec" : "c",
                     b->cpu&X264_CPU_CACHELINE_32 ? "_c32" :
                     b->cpu&X264_CPU_CACHELINE_64 ? "_c64" :
                     b->cpu&X264_CPU_SSE_MISALIGN ? "_misalign" :
@@ -1448,7 +1455,7 @@ int main(int argc, char *argv[])

     if( argc > 1 && !strncmp( argv[1], "--bench", 7 ) )
     {
-#if !defined(ARCH_X86) && !defined(ARCH_X86_64)
+#if !defined(ARCH_X86) && !defined(ARCH_X86_64) && !defined(SYS_MACOSX)
         fprintf( stderr, "no --bench for your cpu until you port rdtsc 
\n" );
         return 1;
#endif