[vlc-devel] [PATCH] Added IVTC deinterlacer (NTSC film mode)

Thu Jan 6 22:56:23 CET 2011

Hi,

On Wed, Jan 05, 2011 at 11:04:40PM +0200, Juha Jeronen wrote:
> On which note, what do you think, should I make the lengthy explanation  
> of the filter doxygen compatible
 Would be great.

>, or should it be moved into a separate  
> file?
 I don't think it is really needed. One day, I think it might be better to
split the deinterlace filter into multiples files.

>>> +static int CalculateInterlaceScore(filter_t *p_filter, picture_t* p_pic)
>>> +{
>>> +    assert( p_filter != NULL );
>>> +    assert( p_pic != NULL );
>>> +
>>> +    /* This is based on the comb detector used in the IVTC filter of
>>> Transcode 1.1.5. */
>>>      
>>   Is it your own reimplementation of the algorithm?
>>    
> The comb detector? Yes, the implementation is my own, I just used the  
> existing comb metric. The important part is the (P-C)*(N-C) and the  
> threshold of 100.
>
> There are some implementation differences: Transcode never explicitly  
> constructs the temporary frames, but performs the comb detection  
> directly on the original data. I implemented it this way to keep the  
> function reusable, and because I thought it was conceptually cleaner  
> (although it does add some runtime cost).
>
> For frame construction, Transcode pairs TCBP, TCBC and TCBN (using a  
> unique top field from the input stream for each output frame - this is  
> ingenious), while my code pairs TNBN, TNBC and TCBN. My code also keeps  
> track of the scores of all seven possible combinations in the stencil,  
> because it's required for the cadence detector.
>
> As far as the code is concerned, I wrote the filter from scratch. Parts  
> of VLC's own deinterlace.c were useful as a template (RenderBob,  
> RenderX, RenderYadif, and some MMX code), but that's it.
 Great, just wanted to be sure.

> Anyway - maybe "This is based on..." is too strong, and "This uses ideas  
> from..." would be better?
 I think so but It's up to you I think.

>>> +    for( i_plane = 0 ; i_plane<  p_pic->i_planes ; i_plane++ )
>>> +                    /* Low-order part: */
>>> +                    movq_r2r( mm2, mm0 );
>>> +                    punpckldq_r2r( mm7, mm0 );
>>> +                    int32_t result3;
>>> +                    movq_r2m( mm0, result3 );
>>> +                    /* High-order part: */
>>> +                    movq_r2r( mm2, mm0 );
>>> +                    punpckhdq_r2r( mm7, mm0 );
>>> +                    int32_t result4;
>>> +                    movq_r2m( mm0, result4 );
>>>      
>>   movq (for the 4 assignments) cannot do what you want, making the whole
>> function always returning 0 (at least for me)... If you also end up with
>> a non valid value, I fear that some of the (quality) tests you did are not
>> meaningful.
>>    
> It worked for me, so I didn't notice this until now. Thanks.
>
> What I wanted to do there was to move the doubleword result into memory.  
> So indeed, the instruction movq is wrong - it should be movd instead.
>
> When fixing this, I found one other serious bug in that function -  
> punpcklbw instead of punpcklwd (and same for hbw/hwd) when unpacking the  
> multiplication results. Oops. I wonder how that worked at all. I guess I  
> got lucky.
>
> So, you're right, I need to redo my output quality tests. I'll also need  
> to re-tune the residual interlace threshold :)
>
> While at it, I converted the whole inner loop into MMX, as I had  
> originally planned. I had misread the MMX reference - didn't notice at  
> first that MMX actually does provide a signed packed comparison (I  
> usually work in high-level languages... excuses, excuses ;) ). With  
> that, this is easy.
>
> The score is kept in a register. We need five fewer memory accesses per  
> iteration of the inner loop, since i_score and result1-4 no longer need  
> to be accessed. We only need three 32-bit fetches, and three pointer  
> updates, which cannot be avoided.

 I have looked a bit at the mmx code and it can be vastly accelerated if you
suppose that T is lower or equal to 127 (and positive), and that you work on
less than 2^32/255 pixels.

 Basically what you can do is to compute P - C, N - C, and then saturates the
result to -128..127. This simplifies the pmull part (only one instead of 2),
and then using pcmpgtb, it allows to use psadbw to count the number of pixel.
As a side effect, it allows to handle 8 pixels at once.

 Here is what I tried:

    static const mmx_t b0   = { .uq = 0x0000000000000000ULL };
    static const mmx_t b128 = { .uq = 0x8080808080808080ULL };
    static const mmx_t bT   = { .ub = { T, T, T, T, T, T, T, T } };

    pxor_r2r( mm7, mm7 );
    for( int x = 0; x < w; x += 8 )
    {
        movq_m2r( *((int64_t*)src_c), mm0 );
        movq_m2r( *((int64_t*)src_p), mm1 );
        movq_m2r( *((int64_t*)src_n), mm2 );

        psubb_m2r( b128, mm0 );
        psubb_m2r( b128, mm1 );
        psubb_m2r( b128, mm2 );

        psubsb_r2r( mm0, mm1 );
        psubsb_r2r( mm0, mm2 );

        pxor_r2r( mm3, mm3 );
        pxor_r2r( mm4, mm4 );
        pxor_r2r( mm5, mm5 );
        pxor_r2r( mm6, mm6 );

        punpcklbw_r2r( mm1, mm3 );
        punpcklbw_r2r( mm2, mm4 );
        punpckhbw_r2r( mm1, mm5 );
        punpckhbw_r2r( mm2, mm6 );

        pmulhw_r2r( mm3, mm4 );
        pmulhw_r2r( mm5, mm6 );

        packsswb_r2r(mm4, mm6);
        pcmpgtb_m2r( bT, mm6 );
        psadbw_m2r( b0, mm6 );
        paddd_r2r( mm6, mm7 );

        src_c += 8;
        src_p += 8;
        src_n += 8;
    }
    uint32_t score;
    movd_r2m( mm7, score );
    return score / 255;

 This function is about 4.6 times faster than the C version, while the
proposed version has about the same speed than the C one (in my test).
 It can be made a bit faster by interleaving a bit the instructions,
and by using sse2 (but only if the data is 16 bytes aligned). I haven't
tried a SSSE3 version. It is probably not optimal either, my MMX/SSE2
skills are a bit rusty.

> On this I have a question... can we be sure it doesn't change even if  
> the user switches to another video source? Say, from a DVD to a file,  
> which didn't come from an NTSC source. Does the core always restart the  
> filter chain in such a case? If so, can we rely on it doing so in the  
> future, too?
 Yes. The video filters are recreated on resolution changes. The exceptions
are scaling filters and blending ones.

> (My main future concern is that there have been posts asking about  
> seamless playback on the forums. If there are plans for that in 1.2.x  
> (as I understood), will it affect the filter chain restart?)
 Filter chain will still be recreated.

> True. Will fix.
>
> Or maybe even actually check the case, and error out in a final else  
> block if none of the conditions match. It's not like it costs too many  
> CPU cycles ;)
 Adding code that cannot be executed is usually a bad practice IMHO.
It can slow down things, and will confuse someone reading the code
later. Using asserts fix both issue.

-- 
fenrir