[vlc-devel] [PATCH] Added IVTC deinterlacer (NTSC film mode)

Fri Jan 7 03:08:33 CET 2011

Hi,

On 01/06/2011 11:56 PM, Laurent Aimar wrote:
> Hi,
>
> On Wed, Jan 05, 2011 at 11:04:40PM +0200, Juha Jeronen wrote:
>    
>> On which note, what do you think, should I make the lengthy explanation
>> of the filter doxygen compatible
>>      
>   Would be great.
>    
Will do.

>> , or should it be moved into a separate
>> file?
>>      
>   I don't think it is really needed. One day, I think it might be better to
> split the deinterlace filter into multiples files.
>    
Ok.

>> As far as the code is concerned, I wrote the filter from scratch. Parts
>> of VLC's own deinterlace.c were useful as a template (RenderBob,
>> RenderX, RenderYadif, and some MMX code), but that's it.
>>      
>   Great, just wanted to be sure.
>    
Ok. Of course :)

>> Anyway - maybe "This is based on..." is too strong, and "This uses ideas
>> from..." would be better?
>>      
>   I think so but It's up to you I think.
>    
Changed. I ended up rewriting some of my comments in that part.

>   I have looked a bit at the mmx code and it can be vastly accelerated if you
> suppose that T is lower or equal to 127 (and positive), and that you work on
> less than 2^32/255 pixels.
>    
Great. I think those assumptions are reasonable.

>   Basically what you can do is to compute P - C, N - C, and then saturates the
> result to -128..127. This simplifies the pmull part (only one instead of 2),
> and then using pcmpgtb, it allows to use psadbw to count the number of pixel.
> As a side effect, it allows to handle 8 pixels at once.
>
>   Here is what I tried:
>
>      static const mmx_t b0   = { .uq = 0x0000000000000000ULL };
>      static const mmx_t b128 = { .uq = 0x8080808080808080ULL };
>      static const mmx_t bT   = { .ub = { T, T, T, T, T, T, T, T } };
>
>      pxor_r2r( mm7, mm7 );
>      for( int x = 0; x<  w; x += 8 )
>      {
>          movq_m2r( *((int64_t*)src_c), mm0 );
>          movq_m2r( *((int64_t*)src_p), mm1 );
>          movq_m2r( *((int64_t*)src_n), mm2 );
>
>          psubb_m2r( b128, mm0 );
>          psubb_m2r( b128, mm1 );
>          psubb_m2r( b128, mm2 );
>
>          psubsb_r2r( mm0, mm1 );
>          psubsb_r2r( mm0, mm2 );
>
>          pxor_r2r( mm3, mm3 );
>          pxor_r2r( mm4, mm4 );
>          pxor_r2r( mm5, mm5 );
>          pxor_r2r( mm6, mm6 );
>
>          punpcklbw_r2r( mm1, mm3 );
>          punpcklbw_r2r( mm2, mm4 );
>          punpckhbw_r2r( mm1, mm5 );
>          punpckhbw_r2r( mm2, mm6 );
>
>          pmulhw_r2r( mm3, mm4 );
>          pmulhw_r2r( mm5, mm6 );
>
>          packsswb_r2r(mm4, mm6);
>          pcmpgtb_m2r( bT, mm6 );
>          psadbw_m2r( b0, mm6 );
>          paddd_r2r( mm6, mm7 );
>
>
>          src_c += 8;
>          src_p += 8;
>          src_n += 8;
>      }
>      uint32_t score;
>      movd_r2m( mm7, score );
>      return score / 255;
>
>   This function is about 4.6 times faster than the C version, while the
> proposed version has about the same speed than the C one (in my test).
>   It can be made a bit faster by interleaving a bit the instructions,
> and by using sse2 (but only if the data is 16 bytes aligned). I haven't
> tried a SSSE3 version. It is probably not optimal either, my MMX/SSE2
> skills are a bit rusty.
>    
Thanks! Sounds great. I'm really more of a C/C++/MATLAB guy than asm ;)

For comparison, here's what I did. This processes one row. The score is 
an int32_t:

---8<---8<---8<---

assert( w % 4 == 0 );

const uint64_t logical_one_d = INT64_C(0x0000000100000001);
const uint64_t thres = INT64_C(0x0000006400000064);

movd_m2r( i_score, mm4 );
movq_m2r( logical_one_d, mm5 );
pxor_r2r( mm6, mm6 );
movq_m2r( thres, mm7 );

int32_t *p32_c = (int32_t *)src_c;
int32_t *p32_p = (int32_t *)src_p;
int32_t *p32_n = (int32_t *)src_n;
for( int x = 0; x < w; x+=4 )
{
     pxor_r2r( mm0, mm0 );
     pxor_r2r( mm1, mm1 );
     pxor_r2r( mm2, mm2 );

     movd_m2r( *(p32_c++), mm0 );
     movd_m2r( *(p32_p++), mm1 );
     movd_m2r( *(p32_n++), mm2 );

     punpcklbw_r2r( mm6, mm0 );
     punpcklbw_r2r( mm6, mm1 );
     punpcklbw_r2r( mm6, mm2 );

     movq_r2r( mm1, mm3 );
     psubsw_r2r( mm0, mm3 ); /* mm3 = P - C  (signed) */
     movq_r2r( mm2, mm1 );
     psubsw_r2r( mm0, mm1 ); /* mm1 = N - C  (signed) */

     /* (P - C) * (N - C) */
     movq_r2r( mm1, mm0 );
     pmullw_r2r( mm3, mm1 );
     pmulhw_r2r( mm3, mm0 );

     /* Two low-order dw results */
     movq_r2r( mm1, mm2 );
     punpcklwd_r2r( mm0, mm2 );

     pcmpgtd_r2r( mm7, mm2 );
     pand_r2r( mm5, mm2 );

     movq_r2r( mm2, mm3 );
     punpckldq_r2r( mm6, mm3 );
     paddd_r2r( mm3, mm4 );     /* result = 0 or 1, add it to score. */

     movq_r2r( mm2, mm3 );
     punpckhdq_r2r( mm6, mm3 );
     paddd_r2r( mm3, mm4 );

     /* Two high-order dw results */
     movq_r2r( mm1, mm2 );
     punpckhwd_r2r( mm0, mm2 );

     pcmpgtd_r2r( mm7, mm2 );
     pand_r2r( mm5, mm2 );

     movq_r2r( mm2, mm3 );
     punpckldq_r2r( mm6, mm3 );
     paddd_r2r( mm3, mm4 );

     movq_r2r( mm2, mm3 );
     punpckhdq_r2r( mm6, mm3 );
     paddd_r2r( mm3, mm4 );
}

movd_r2m( mm4, i_score );

---8<---8<---8<---

I haven't yet benchmarked this one, but based on the 4 vs. 8 difference, 
I think you win at least by a factor of 2 ;)

I'm curious about this, so I'll try both against the C version.

One thing that still bothers me about this part is that ARM processors 
are getting more common, and MMX solves the speed issue only for x86. In 
the long run, maybe we'll need versions for other vectorization 
instruction sets, like the Merge routine currently has.

But maybe MMX is a good enough start at this point :)

>> On this I have a question... can we be sure it doesn't change even if
>> the user switches to another video source? Say, from a DVD to a file,
>> which didn't come from an NTSC source. Does the core always restart the
>> filter chain in such a case? If so, can we rely on it doing so in the
>> future, too?
>>      
>   Yes. The video filters are recreated on resolution changes. The exceptions
> are scaling filters and blending ones.
>    
Ok, great.

>> (My main future concern is that there have been posts asking about
>> seamless playback on the forums. If there are plans for that in 1.2.x
>> (as I understood), will it affect the filter chain restart?)
>>      
>   Filter chain will still be recreated.
>    
Thanks. I'll remove the resolution change detection code, then.

>> True. Will fix.
>>
>> Or maybe even actually check the case, and error out in a final else
>> block if none of the conditions match. It's not like it costs too many
>> CPU cycles ;)
>>      
>   Adding code that cannot be executed is usually a bad practice IMHO.
> It can slow down things, and will confuse someone reading the code
> later. Using asserts fix both issue.
>    
True. I thought about it after posting and I agree.

In this case, the code approach would also cause another problem: the 
function would need another return value for "cannot happen" cases, 
which is bad.

I decided to go for an assert, as you suggested.

Thanks for your comments. I've mostly done the fixes already - I'll 
finish it up and then post the new version.

Regarding output quality, I think the aggressive lock-on code may be 
needed to get reasonable results with some anime (notably at least 
Azumanga). I'll still have to update it for this version, and do some 
quality tests with it enabled now that the interlace detector is fixed.

  -J