[vlc-devel] [PATCH 0/6] COPY (YV12/NV12) and VAAPI performance improvements
Jean-Yves Avenard
jyavenard at gmail.com
Fri Jun 13 14:02:35 CEST 2014
From: Jean-Yves Avenard <jyavenard at mythtv.org>
Hello.
Please find some modifications on video_chroma/copy.c and vaapi.
This is my first attempt at submitting patches for VLC, so apologies in advance if I didn't do things properly.
I certainly followed the wiki and instructions provided on IRC!
Core change on copy.c is to remove the need for an external memory buffer.
That buffer was used as intermediary location like so:
source -> buffer -> destination
I rewrote it so buffer is no longer required. The code will automatically select the most appropriate method to deal with the alignment on both the destination and the source memory locations.
This yield under most cases in a 100% speed improvements.
Unit tests were written to test all corner cases:
- Various stride sizes
- Various alignments (on both source and destination)
- Various resolutions
Let me know if you want to have a look at those.
To simply test the speed, you can use this:
----- CUT BEGIN
#include "config.h"
#include <vlc_common.h>
#include <vlc_picture.h>
#include "copy.h"
#define OLD_API 0
unsigned vlc_CPU(void)
{
return 0x6fe8;
}
int main(int argc, char **argv)
{
#if OLD_API
copy_cache_t cache;
#endif
picture_t dst;
if (argc != 4)
{
fprintf(stderr, "bad number of argument. Usage test n w h\n");
exit(0);
}
int NUM = atoi(argv[1]);
int WIDTH = atoi(argv[2]);
int HEIGHT = atoi(argv[3]);
int STRIDESRC = (WIDTH + 63) & ~63;
int STRIDEDST = WIDTH;
printf("testing %dx%d (%dx%d) into %dx%d\n", WIDTH, HEIGHT, STRIDESRC, HEIGHT, WIDTH, HEIGHT);
#if OLD_API
CopyInitCache(&cache, WIDTH);
#endif
size_t srcbufsize = STRIDESRC * HEIGHT + STRIDESRC * HEIGHT / 2;
size_t dstbufsize = STRIDEDST * HEIGHT + STRIDEDST / 2 * HEIGHT / 2;
uint8_t *srcbuf = (uint8_t*)malloc(srcbufsize);
uint8_t *dstbuf = (uint8_t*)malloc(dstbufsize);
dst.p[0].i_pitch = STRIDEDST;
dst.p[1].i_pitch = STRIDEDST / 2;
dst.p[2].i_pitch = STRIDEDST / 2;
dst.p[0].p_pixels = srcbuf;
dst.p[1].p_pixels = dst.p[0].p_pixels + dst.p[0].i_pitch * HEIGHT;
dst.p[2].p_pixels = dst.p[1].p_pixels + dst.p[1].i_pitch * HEIGHT / 2;
uint8_t *src[2] = { srcbuf, srcbuf + STRIDESRC * HEIGHT };
size_t src_pitch[2] = { STRIDESRC, STRIDESRC };
for (int i = 0; i < NUM; i++)
{
CopyFromNv12(&dst, src, src_pitch,
WIDTH, HEIGHT
#if OLD_API
, &cache
#endif
);
}
#if OLD_API
CopyCleanCache(&cache);
#endif
free(srcbuf);
free(dstbuf);
printf("done\n");
}
----- CUT END
compile it like so:
place it in modules/video_chroma, cd into that directory
gcc -g -std=gnu99 -o test.o -I../../ -I../../include -DHAVE_CONFIG_H -c test.c
gcc -g -std=gnu99 -o copy.o -I../.. -I../../include -DHAVE_CONFIG_H -c copy.c
gcc -o test copy.o test.o
test takes 3 arguments, how many conversions to make, width and height
frames used have strides that are 32 bytes aligned: e.g.
720x576, create a 720x576 images with 768 and 384. Reason for this choice is that's what VAAPI vaDeriveImage or vaGetImage creates
benchmarks:
time ./test 10000 720 576
Original:
real 0m3.314s
user 0m3.300s
sys 0m0.012s
New:
real 0m1.448s
user 0m1.439s
sys 0m0.007s
time ./test 10000 1920 1080
Original:
real 0m9.414s
user 0m9.371s
sys 0m0.036s
New:
real 0m4.734s
user 0m4.711s
sys 0m0.017s
Now, I have left the SSE accelerated CopyPlane routine. In my various tests, on various platforms, it provides no benefit, quite the opposite
using the C version yields much better performance, for example in the test above,
replacing the call to SSE_CopyPlane with the C CopyPlane, gives me:
real 0m0.986s
user 0m0.980s
sys 0m0.005s
(that's on an i7-4650 Haswell, with HyperThreading disabled)
Now there may be religious reasons for believing we can do a better job than memcpy.. I won't get involved.
In regards to the other changes.
For the VAAPI codec module, I've made vaGetImage be use in priority. vaDeriveImage returns (at least with Intel and AMD drivers) a NV12 image, which is then converted into a YV12 frame.
The cost to perform that conversion outweigh any of the benefits provided by vaDeriveImage.
Just for the beauty of things, I changed so vaDeriveImage will be used if it returns a YV12 image. In practive this never happens (and not sure it ever will). Intel VAAPI backend certainly doesn't
As there's no need for the copy_cache_t object in the NV12->YV12 conversion, I've removed the type and upgraded all modules making use of it.
Not having access to the hardware, I've been unable to test the dxva and omx codecs.
Though the confidence that it will be okay is high.
That's it for me.
all the best.
Flame On.
Jean-Yves
Jean-Yves Avenard (6):
vaapi: remove unused variable
vaapi: use proper official fourcc constants
copy: remove need for cache memory in SSE routines
vaapi: prefer vaGetImage over vaDeriveImage under most circumstances.
copy: drop requirement for a memory cache for NV12/YV12 frames copies
copy: remove compilation warnings on some systems
modules/codec/avcodec/dxva2.c | 19 +--
modules/codec/avcodec/vaapi.c | 58 +++++---
modules/codec/avcodec/vda.c | 19 +--
modules/codec/omxil/utils.c | 17 +--
modules/video_chroma/copy.c | 302 +++++++++++++++++-------------------------
modules/video_chroma/copy.h | 16 +--
6 files changed, 163 insertions(+), 268 deletions(-)
--
1.8.5.2 (Apple Git-48)
More information about the vlc-devel
mailing list