[vlc-devel] Updated patch for ogg/theora seeking

salsaman salsaman at gmail.com
Wed Jun 16 12:01:33 CEST 2010

On Wed, Jun 16, 2010 at 3:03 AM, Gregory Maxwell <gmaxwell at gmail.com> wrote:

> On Wed, Jun 9, 2010 at 7:04 PM, salsaman <salsaman at gmail.com> wrote:
> > While I am waiting for a reply, I just thought I would explain the search
> > algorithm used, so you verify it is very efficient.
> >
> > We create an index on the fly, this index is basically:
> > offset in file -> maximum frame number
> >
> > When asked to seek to a particular frame, we first check the index to see
> if
> > we have approximate boundaries for the seek, otherwise we will seek in
> the
> > whole file, from data_start until the end.
> >
> > The area to be searched is divided into two halves. We first check the
> upper
> > half, and get the highest and lowest granulepos. (The granulepos is
> > basically keyframe * keyframe_offset + frame offset). If our target frame
> > lies in this region we subdivide it into two halves, otherwise we check
> the
> > lower half from earlier. We stop when we have found the keyframe for our
> > target frame, or the search region is < minimum_page_size.
> This sounds like a plain bisection search. You can get faster results
> using an interpolated bisection.
> If the current range is 0........10,000 seconds. And 0....100Mbytes
> and you want 2500 seconds, you seek to ~25mbytes (minus the average or
> maximum page size) because  2500 seconds is 25% of 10000 and 25 is 25%
> of 100.
> This is guaranteed to converge just like a bisection search.  You can
> optimize this further, becoming closer to bisection as the ranges get
> smaller and the variability becomes greater, and biasing more downward
> because backward motion is more expensive that forward motion... but
> simply using an interpolation in the outer most step can cut several
> seeks out of a large file. If the bitrate is very constant (like
> audio) it can get your average down to something like three no matter
> what the file size is.  (libvorbisfile does this)

OK, I will test this, although I think it will improve only the very first
search. The way it works currently is that we first seed the index with
first and last granulepos (we set a fake keyframe at time==0 and we discover
the last keyframe when calculating the file duration). After this we search
using the index to get approximate bounds, then using bi-sections. Now
imagine if we were seeking to 25%. It would save one iteration on the first
search, but we would not discover the information at 50%. So assuming we
were seeking randomly, we would have to check 50% later on anyway. But I can
see the point - that as we get closer to the predicted point, the
probability thet we need that information later decreases.

Anyway I will look into your suggestion. I am not sure how to implement it
though, since the current algorithm requires two equally sized areas (upper
is checked first - we get first granulepos, if the lower keyframe is too
large, we check the lower section; if the lower keyframe is OK we check the
upper keyframe and then subdivide and repeat the process). The idea being,
we check upper first, since lower is more likely to have been played
already, and thus have its data in the index.

> Regarding keyframes.  Your text above is concerning me.  You do not
> know the granpos of the keyframe covering a particular frame until you
> have visited that frame.    The granshift implies a _maximum_ keyframe
> interval, but the keyframes can be placed sooner than that as the
> encoder desires, and they often are.    It wasn't quite clear to me
> that your approach was dealing with this.  I should probably go read
> the code.
You are correct. In fact the code makes no assumptions, it only maps frame
-> keyframe as it discovers this. Then later if we discover a higher frame
that maps to the same keyframe it updates that entry.

In fact, the above is not strictly true, what is mapped is:

granulepos of highest_frame -> file offset in bytes

where the file offset is a suitable point to decode the keyframe for
highest_frame, since we can get both frame and keyframe from the granulepos.
This is ordered by granulepos in a doubly-linked list. When we are asked to
find a new frame, we first get approximate bounds from the index, then
bisect the remaining search domain.

During playback, we also update any existing entries if we find a higher
frame that maps to a keyframe already in the index.

> Normally, In order to do a fully synchronized seek you must first seek
> to the desired frame, extract the granpos of its keyframe, seek back
> to that point, then decode forward.  If you cache your seeking
> requests along the way and use them to hot-start the second search
> then you should be able to hit the key frame very quickly in one seek
> or so.

> I'd recommend adding both keyframes and non-keyframes to your index
> (or keep two indexes in memory, if you're minimizing memory usage and
> don't want non-keyframes polluting a finite index) since both can help
> constrain the bounds of your future searches, not just ones belonging
> to a common keyframe offset. Every seek provides information which
> might be useful in the future.
That is precisely what the program does. It stores frame and keyframe in the
i_granulpos field, and byte offset for a sync point in the i_pagepos field.

> The indexing scheme in ogg index isn't precisely compatible with your
> approach, so you won't just be able to load its values into your
> index. This is because instead of mapping time->bytes it maps time to
> a byte location which is equal to or less than the byte location of
> the requested time,  such that the expressed byte-location is
> sufficient to perfectly reconstruct the requested time.    It is done
> this way so that it will work uniformly across formats with
> drastically different reference structures (dirac, theora, vorbis)
> without requiring too more per-codec smarts in the applications.
> But this also means that 1:00-> byte 1, 2:00->byte 2, 3:00->byte 3 is
> a perfectly valid, if useless, index. (since all those byte locations
> are sufficient starting points for finding those times.  In the common
> (theora keyframe) case the byte location will exactly match the page
> which starts the keyframe that you want, but the promise is that it
> will only be equal to or less than.
> So instead of copying the ogg skeleton index into your indexes, you
> should use it to hot start (choose the position of your first seek)
> your existing algorithm.  When everything works well the combination
> will give you the absolute minimum number of seeks possible, and when
> the index is missing or corrupt you would still always produce correct
> results.
Yes, that should work I think.

> (the skeleton index also isn't guaranteed to store all (or even most)
> possible seek points, otherwise it would grow to enormous sizes, so
> you must still have regular seeking logic to get acceptable
> granularity)
Actually the memory required is very small. For my index, we have:

granulepos -> page_offset per keyframe. Since both are int64_t that 16 bytes
per keyframe.

Assuming an average of 8 frames per keyframe that works out at an additional
2 bytes per frame.

Actually if you are generating an index for skeleton, then I would suggest
using my code to do so.

As a side note I am now looking into seeking in dirac. The method is
somewhat similar except that instead of keyframes we have sync frames. These
are frames which follow immediately from a sequence header start. Since the
granulepos does not provide this information the algorithm is to first seek
for a packet containing a sequence header (byte 4 of the packet == 0x00).
Then we rewind the stream to this page and begin decoding until we get a
frame number. I added a third field to the index for dirac, i_pagepos end.
So we have:

i_pagepos -> sync_frame ->i_pagepos_end

each region searched is added to the index. If we search a region and no
sync frame is found then sync_frame is set to -1, so we do not waste time by
searching these regions again. i_pagepos is the byte offset of the start of
the page containing the start of the sequence header packet, i_pagepos_end
is aligned to the end of the page where we stopped searching. The idea is
then, given a target frame, we find a lower bound at the sync_frame <=
target_frame. For dirac there is no upper bound (i.e. a frame can be decoded
any distance after its sync frame). In practice this distance is generally
no more than 3 or 4 frames. Given a region to search we find the first and
last sync frames, skipping over any regions which have been negatively
searched (sync_frame==-1) and using any positive information
(sync_frame>-1). If we get two contiguous regions with -1 sync_frame then
these are joined together.

I am also interested to know what you think about my suggestion of adding


> Cheers!
> _______________________________________________
> vlc-devel mailing list
> To unsubscribe or modify your subscription options:
> http://mailman.videolan.org/listinfo/vlc-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/vlc-devel/attachments/20100616/37ad8f50/attachment.html>

More information about the vlc-devel mailing list