[vlc-devel] [PATCH 3/6] copy: remove need for cache memory in SSE routines

Sun Jun 15 04:20:47 CEST 2014

Hi

[replying again as I didn't use reply-to-all earlier on]

On Saturday, June 14, 2014, Jean-Baptiste Kempf <jb at videolan.org> wrote:
>
>
>
> So, you remove the CopyFromUswc part, right?

The methods to copy are the same as before. Eg it's the same set of
instructions to copy from the source to the destination as it used to
be from the source to the 4kB buffer.

The difference is that it doesn't use an intermediate memory as a go in between

If the destination is memory aligned, it's really identical to the
earlier version, if not it copies the first few bytes (15 max) using C
and then revert to the fast copy (using MOVNTDQA or MOVNTDQ)

>
> Is that not too slow when using DxVA?
> https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers

It's a very easy routine to time and benchmark.
While I haven't tried with frames coming from dxva decode, I did use
it via vaapi which is extremely similar I believe.

I can't imagine how caching would help quite frankly. The data you are
copying changes every time, so there's nothing worth caching.

And even then, assuming caching had some use, the time to copy from
the source to the cache would be the same as copying directly from the
source to the destination.

Reading the article you've posted is very interesting, I should point
however that the white paper mentioned that the 4kB memory buffer was
to be 64 bytes aligned to enjoy the benefit: it wasn't in the current
code: it is 16 bytes aligned only. Which would defeat the theory
described in the white paper.

http://git.videolan.org/?p=vlc.git;a=blob;f=modules/video_chroma/copy.c;h=d29843c037e494170f0d6bc976bea8439dd6115b;hb=HEAD#l39

That probably could explain some of the performance difference I was seeing.

Here are timing as used within vaapi.c.
I measured the execution time of the entire decoding routine using the
attached patch.

Here are my results, and boy was that surprising...
Each tests was run 3 times, and I took the worse result for the new
code, and the best result for the old one.


intel i7-4650U with Intel HD5000

NV12->YV12 conversion
1080p h264 video, with stride == width
original:
4142468125 us after 2000 runs
new:
21932062106 us after 2000 runs
gain: -81%

720x576 mpeg video, with 768px strides:
original:
5623063156 us after 1400 runs

new:
8670008721 us after 1400 runs
gain: -35%

With YV12->YV12 plain copy (patch 4/6 applied)
1080p h264 video, with stride == width
original:
1875732509 us after 2000 runs
new:
1387730501 us after 2000 runs
gain: 35.1%

720x576 mpeg video, with 768px strides:
original:
565602914 us after 1400 runs

new:
284509124 us after 1400 runs
gain: 98% (average was 118% faster)

So back to the drawing board...
So while copy of YV12->YV12 got significantly improved, NV12->YV12
certainly didn't...
So patches 4/6 is one to definitely apply, as it speeds up by a factor
of around 10 vaapi decoding (and I see on my machine a rather
significant drop of CPU usage)

I'll split YV12 and NV12 patches for the time being, as it does
improve plain copy, and rework the deinterleaving part.

Luckily, NV12->YV12 is now (if patch 4/6 is applied) not in effect use
(except for OMX platform, but that's only with SSE, and which OMX
platform uses SSE anyway? aren't they all ARM based?)

Thanks
Very interesting indeed.
-------------- next part --------------

diff --git a/modules/codec/avcodec/vaapi.c b/modules/codec/avcodec/vaapi.c
index 204e8da..99e8375 100644
--- a/modules/codec/avcodec/vaapi.c
+++ b/modules/codec/avcodec/vaapi.c
@@ -49,6 +49,8 @@
 #include "va.h"
 #include "../../video_chroma/copy.h"
 
+#include <time.h>
+
 #ifndef VA_SURFACE_ATTRIB_SETTABLE
 #define vaCreateSurfaces(d, f, w, h, s, ns, a, na) \
     vaCreateSurfaces(d, w, h, f, ns, s)
@@ -491,6 +493,12 @@ static int Extract( vlc_va_t *va, picture_t *p_picture, void *opaque,
     if( vaMapBuffer( sys->p_display, sys->image.buf, &p_base ) )
         return VLC_EGENERIC;
 
+    static long long totaltime = 0LL;
+    static int runs = 0;
+    struct timespec tstart, tend;
+
+    clock_gettime(CLOCK_REALTIME, &tstart);
+
     const uint32_t i_fourcc = sys->image.format.fourcc;
     if( i_fourcc == VA_FOURCC_YV12 ||
         i_fourcc == VA_FOURCC_IYUV )
@@ -526,6 +534,14 @@ static int Extract( vlc_va_t *va, picture_t *p_picture, void *opaque,
                       sys->i_surface_height,
                       &sys->image_cache );
     }
+    clock_gettime(CLOCK_REALTIME, &tend);
+    totaltime += (1000000000LL * tend.tv_sec + tend.tv_nsec) -
+        (1000000000LL * tstart.tv_sec + tstart.tv_nsec);
+    runs++;
+    if (runs % 100 == 0)
+    {
+        fprintf(stderr, "%lld us after %d runs\n", totaltime, runs);
+    }
 
     if( vaUnmapBuffer( sys->p_display, sys->image.buf ) )
         return VLC_EGENERIC;