[vlc-devel] [PATCH] Added IVTC deinterlacer (NTSC film mode)
Laurent Aimar
fenrir at elivagar.org
Thu Jan 6 22:56:23 CET 2011
Hi,
On Wed, Jan 05, 2011 at 11:04:40PM +0200, Juha Jeronen wrote:
> On which note, what do you think, should I make the lengthy explanation
> of the filter doxygen compatible
Would be great.
>, or should it be moved into a separate
> file?
I don't think it is really needed. One day, I think it might be better to
split the deinterlace filter into multiples files.
>>> +static int CalculateInterlaceScore(filter_t *p_filter, picture_t* p_pic)
>>> +{
>>> + assert( p_filter != NULL );
>>> + assert( p_pic != NULL );
>>> +
>>> + /* This is based on the comb detector used in the IVTC filter of
>>> Transcode 1.1.5. */
>>>
>> Is it your own reimplementation of the algorithm?
>>
> The comb detector? Yes, the implementation is my own, I just used the
> existing comb metric. The important part is the (P-C)*(N-C) and the
> threshold of 100.
>
> There are some implementation differences: Transcode never explicitly
> constructs the temporary frames, but performs the comb detection
> directly on the original data. I implemented it this way to keep the
> function reusable, and because I thought it was conceptually cleaner
> (although it does add some runtime cost).
>
> For frame construction, Transcode pairs TCBP, TCBC and TCBN (using a
> unique top field from the input stream for each output frame - this is
> ingenious), while my code pairs TNBN, TNBC and TCBN. My code also keeps
> track of the scores of all seven possible combinations in the stencil,
> because it's required for the cadence detector.
>
> As far as the code is concerned, I wrote the filter from scratch. Parts
> of VLC's own deinterlace.c were useful as a template (RenderBob,
> RenderX, RenderYadif, and some MMX code), but that's it.
Great, just wanted to be sure.
> Anyway - maybe "This is based on..." is too strong, and "This uses ideas
> from..." would be better?
I think so but It's up to you I think.
>>> + for( i_plane = 0 ; i_plane< p_pic->i_planes ; i_plane++ )
>>> + /* Low-order part: */
>>> + movq_r2r( mm2, mm0 );
>>> + punpckldq_r2r( mm7, mm0 );
>>> + int32_t result3;
>>> + movq_r2m( mm0, result3 );
>>> + /* High-order part: */
>>> + movq_r2r( mm2, mm0 );
>>> + punpckhdq_r2r( mm7, mm0 );
>>> + int32_t result4;
>>> + movq_r2m( mm0, result4 );
>>>
>> movq (for the 4 assignments) cannot do what you want, making the whole
>> function always returning 0 (at least for me)... If you also end up with
>> a non valid value, I fear that some of the (quality) tests you did are not
>> meaningful.
>>
> It worked for me, so I didn't notice this until now. Thanks.
>
> What I wanted to do there was to move the doubleword result into memory.
> So indeed, the instruction movq is wrong - it should be movd instead.
>
> When fixing this, I found one other serious bug in that function -
> punpcklbw instead of punpcklwd (and same for hbw/hwd) when unpacking the
> multiplication results. Oops. I wonder how that worked at all. I guess I
> got lucky.
>
> So, you're right, I need to redo my output quality tests. I'll also need
> to re-tune the residual interlace threshold :)
>
> While at it, I converted the whole inner loop into MMX, as I had
> originally planned. I had misread the MMX reference - didn't notice at
> first that MMX actually does provide a signed packed comparison (I
> usually work in high-level languages... excuses, excuses ;) ). With
> that, this is easy.
>
> The score is kept in a register. We need five fewer memory accesses per
> iteration of the inner loop, since i_score and result1-4 no longer need
> to be accessed. We only need three 32-bit fetches, and three pointer
> updates, which cannot be avoided.
I have looked a bit at the mmx code and it can be vastly accelerated if you
suppose that T is lower or equal to 127 (and positive), and that you work on
less than 2^32/255 pixels.
Basically what you can do is to compute P - C, N - C, and then saturates the
result to -128..127. This simplifies the pmull part (only one instead of 2),
and then using pcmpgtb, it allows to use psadbw to count the number of pixel.
As a side effect, it allows to handle 8 pixels at once.
Here is what I tried:
static const mmx_t b0 = { .uq = 0x0000000000000000ULL };
static const mmx_t b128 = { .uq = 0x8080808080808080ULL };
static const mmx_t bT = { .ub = { T, T, T, T, T, T, T, T } };
pxor_r2r( mm7, mm7 );
for( int x = 0; x < w; x += 8 )
{
movq_m2r( *((int64_t*)src_c), mm0 );
movq_m2r( *((int64_t*)src_p), mm1 );
movq_m2r( *((int64_t*)src_n), mm2 );
psubb_m2r( b128, mm0 );
psubb_m2r( b128, mm1 );
psubb_m2r( b128, mm2 );
psubsb_r2r( mm0, mm1 );
psubsb_r2r( mm0, mm2 );
pxor_r2r( mm3, mm3 );
pxor_r2r( mm4, mm4 );
pxor_r2r( mm5, mm5 );
pxor_r2r( mm6, mm6 );
punpcklbw_r2r( mm1, mm3 );
punpcklbw_r2r( mm2, mm4 );
punpckhbw_r2r( mm1, mm5 );
punpckhbw_r2r( mm2, mm6 );
pmulhw_r2r( mm3, mm4 );
pmulhw_r2r( mm5, mm6 );
packsswb_r2r(mm4, mm6);
pcmpgtb_m2r( bT, mm6 );
psadbw_m2r( b0, mm6 );
paddd_r2r( mm6, mm7 );
src_c += 8;
src_p += 8;
src_n += 8;
}
uint32_t score;
movd_r2m( mm7, score );
return score / 255;
This function is about 4.6 times faster than the C version, while the
proposed version has about the same speed than the C one (in my test).
It can be made a bit faster by interleaving a bit the instructions,
and by using sse2 (but only if the data is 16 bytes aligned). I haven't
tried a SSSE3 version. It is probably not optimal either, my MMX/SSE2
skills are a bit rusty.
> On this I have a question... can we be sure it doesn't change even if
> the user switches to another video source? Say, from a DVD to a file,
> which didn't come from an NTSC source. Does the core always restart the
> filter chain in such a case? If so, can we rely on it doing so in the
> future, too?
Yes. The video filters are recreated on resolution changes. The exceptions
are scaling filters and blending ones.
> (My main future concern is that there have been posts asking about
> seamless playback on the forums. If there are plans for that in 1.2.x
> (as I understood), will it affect the filter chain restart?)
Filter chain will still be recreated.
> True. Will fix.
>
> Or maybe even actually check the case, and error out in a final else
> block if none of the conditions match. It's not like it costs too many
> CPU cycles ;)
Adding code that cannot be executed is usually a bad practice IMHO.
It can slow down things, and will confuse someone reading the code
later. Using asserts fix both issue.
--
fenrir
More information about the vlc-devel
mailing list