[vlc-devel] [PATCH 0/6] COPY (YV12/NV12) and VAAPI performance improvements

Fri Jun 13 14:02:35 CEST 2014

From: Jean-Yves Avenard <jyavenard at mythtv.org>

Hello.

Please find some modifications on video_chroma/copy.c and vaapi.

This is my first attempt at submitting patches for VLC, so apologies in advance if I didn't do things properly.
I certainly followed the wiki and instructions provided on IRC!

Core change on copy.c is to remove the need for an external memory buffer.
That buffer was used as intermediary location like so:
source -> buffer -> destination

I rewrote it so buffer is no longer required. The code will automatically select the most appropriate method to deal with the alignment on both the destination and the source memory locations.

This yield under most cases in a 100% speed improvements.
Unit tests were written to test all corner cases:
- Various stride sizes
- Various alignments (on both source and destination)
- Various resolutions

Let me know if you want to have a look at those.
To simply test the speed, you can use this:

----- CUT BEGIN
#include "config.h"
#include <vlc_common.h>
#include <vlc_picture.h>
#include "copy.h"

#define OLD_API 0

unsigned vlc_CPU(void)
{
    return 0x6fe8;
}

int main(int argc, char **argv)
{
#if OLD_API
    copy_cache_t cache;
#endif
    picture_t dst;

    if (argc != 4)
    {
        fprintf(stderr, "bad number of argument. Usage test n w h\n");
        exit(0);
    }

    int NUM = atoi(argv[1]);
    int WIDTH = atoi(argv[2]);
    int HEIGHT = atoi(argv[3]);
    int STRIDESRC = (WIDTH + 63) & ~63;
    int STRIDEDST = WIDTH;

    printf("testing %dx%d (%dx%d) into %dx%d\n", WIDTH, HEIGHT, STRIDESRC, HEIGHT, WIDTH, HEIGHT);

#if OLD_API
    CopyInitCache(&cache, WIDTH);
#endif

    size_t srcbufsize = STRIDESRC * HEIGHT + STRIDESRC * HEIGHT / 2;
    size_t dstbufsize = STRIDEDST * HEIGHT + STRIDEDST / 2 * HEIGHT / 2;

    uint8_t *srcbuf = (uint8_t*)malloc(srcbufsize);
    uint8_t *dstbuf = (uint8_t*)malloc(dstbufsize);

    dst.p[0].i_pitch = STRIDEDST;
    dst.p[1].i_pitch = STRIDEDST / 2;
    dst.p[2].i_pitch = STRIDEDST / 2;

    dst.p[0].p_pixels = srcbuf;
    dst.p[1].p_pixels = dst.p[0].p_pixels + dst.p[0].i_pitch * HEIGHT;
    dst.p[2].p_pixels = dst.p[1].p_pixels + dst.p[1].i_pitch * HEIGHT / 2;

    uint8_t *src[2] = { srcbuf, srcbuf + STRIDESRC * HEIGHT };
    size_t src_pitch[2] = { STRIDESRC, STRIDESRC };

    for (int i = 0; i < NUM; i++)
    {
        CopyFromNv12(&dst, src, src_pitch,
                     WIDTH, HEIGHT
#if OLD_API
                     , &cache
#endif
                     );
    }

#if OLD_API
    CopyCleanCache(&cache);
#endif
    free(srcbuf);
    free(dstbuf);
    printf("done\n");
}
----- CUT END

compile it like so:
place it in modules/video_chroma, cd into that directory
gcc -g -std=gnu99 -o test.o -I../../ -I../../include -DHAVE_CONFIG_H -c test.c
gcc -g -std=gnu99 -o copy.o -I../.. -I../../include -DHAVE_CONFIG_H -c copy.c
gcc -o test copy.o test.o

test takes 3 arguments, how many conversions to make, width and height
frames used have strides that are 32 bytes aligned: e.g.
720x576, create a 720x576 images with 768 and 384. Reason for this choice is that's what VAAPI vaDeriveImage or vaGetImage creates

benchmarks:
time ./test 10000 720 576
Original:
real	0m3.314s
user	0m3.300s
sys	0m0.012s

New:
real	0m1.448s
user	0m1.439s
sys	0m0.007s

time ./test 10000 1920 1080
Original:
real	0m9.414s
user	0m9.371s
sys	0m0.036s

New:
real	0m4.734s
user	0m4.711s
sys	0m0.017s

Now, I have left the SSE accelerated CopyPlane routine. In my various tests, on various platforms, it provides no benefit, quite the opposite
using the C version yields much better performance, for example in the test above, 
replacing the call to SSE_CopyPlane with the C CopyPlane, gives me:
real	0m0.986s
user	0m0.980s
sys	0m0.005s
(that's on an i7-4650 Haswell, with HyperThreading disabled)

Now there may be religious reasons for believing we can do a better job than memcpy.. I won't get involved.

In regards to the other changes.

For the VAAPI codec module, I've made vaGetImage be use in priority. vaDeriveImage returns (at least with Intel and AMD drivers) a NV12 image, which is then converted into a YV12 frame.
The cost to perform that conversion outweigh any of the benefits provided by vaDeriveImage.

Just for the beauty of things, I changed so vaDeriveImage will be used if it returns a YV12 image. In practive this never happens (and not sure it ever will). Intel VAAPI backend certainly doesn't

As there's no need for the copy_cache_t object in the NV12->YV12 conversion, I've removed the type and upgraded all modules making use of it.
Not having access to the hardware, I've been unable to test the dxva and omx codecs.
Though the confidence that it will be okay is high.

That's it for me.

all the best.
Flame On.

Jean-Yves

Jean-Yves Avenard (6):
  vaapi: remove unused variable
  vaapi: use proper official fourcc constants
  copy: remove need for cache memory in SSE routines
  vaapi: prefer vaGetImage over vaDeriveImage under most circumstances.
  copy: drop requirement for a memory cache for NV12/YV12 frames copies
  copy: remove compilation warnings on some systems

 modules/codec/avcodec/dxva2.c |  19 +--
 modules/codec/avcodec/vaapi.c |  58 +++++---
 modules/codec/avcodec/vda.c   |  19 +--
 modules/codec/omxil/utils.c   |  17 +--
 modules/video_chroma/copy.c   | 302 +++++++++++++++++-------------------------
 modules/video_chroma/copy.h   |  16 +--
 6 files changed, 163 insertions(+), 268 deletions(-)

-- 
1.8.5.2 (Apple Git-48)