[vlc-devel] [PATCH 4/5] stream_ReadLine: support arbitrary length limit

Tue Sep 8 10:56:59 CEST 2020

Hi,

What bothering me isn't really the line size by itself or the fact the
caller must be able to decide what he is doing or not.

To be perfectly clear, let me take another example which is not related
to VLC, but fits both the `text format` and `line based` criteria: CSV.

Everybody on this ML probably knows it, but just to set the context,
I'll describe the format and read operations made on it.

CSV files are basically tabular data where rows are lines and columns
are delimited by a special character.

	cell11;cell12;cell13
	cell21;cell22;cell23

There's particular two ways you can use CSV for this example:

 - read line-by-line, which usually and conventionnaly allows you to
   process `one record` at a time, for example to transform the record
   into another one or build a complete object, because it's convenient
   to split on `;` and directly count the number of item + invoke the
   deserializing methods.

 - read with a fixed buffer size and process the fields separator and
   line feed one by one until the required field is found, then extract
   it. It's much more complex but it allows to skip the higher memory
   requirements + allocation. Basically you write your own stateful
   parser instead of relying on a line parser first.

The CSV format is particularily handy for this example because there is
actually no limits to the number of characters on a line either. You
could for example have different feature extraction on the row index,
and different moment in time on the column index, or you could have a
database with a million variables for each record. This case where the
lines are insanely long is comparable to the kind of files you get when
downloading the javascript from youtube in my opinion.

So here, you might have different requirements: either you will run a
complex algorithm (like, to continue with the variable example, a
dimensionnality reduction algorithm) which will need the whole record
at each time and then it makes sense to read the whole line and have
a simpler parser afterwards, or you just need the value of one column
(like the name associated with the record) and then you don't need to
store the million variables, and are likely to only need a bounded
string size for the name.

The example I've made can be counterargued on the ReadLine() by saying
that in any case, it will always be more memory efficient to parse
field by field but you could have a field accounting for 100% of the
document, and in real world case it's more a reflect of the current
situation than a case we really want to to solve, thus please don't
account for that in this particular case.

In this case, you're trying to fix the code so that the first method
works, but from what I've seem you don't need that much of a record.

I don't really now the lua script well but from what I see:

 - anevia_streams matches xml elements
 - anevia_xml readline is a bit peculiar with both random
   value read or vlc.readline and then matches xml elements
 - appletrailers concatenate lines to buffer a json element
   and could be done with read too, or smarter parser.
 - bbc_co_uk matches something like headers format?
 - break matches the inner of html elements in an hardcoded way
 - cue is a line-based format (command-based afair)
 - youtube is making regex/json extraction to find the different
   parameters

Thus, in most cases, you don't actually need the lines, and in some
of the cases (like anevia_streams), reading line by line could break
the parsing because an item could be written like

   <input
      name="..">

or

   <INPUT NAME="...">

or

   < input name="" >

etc, while a more specialized parser would not fail on such issue.

In your case, in comparison with CSV, you have a text-based format
which is not line-based, and I think that Remi's point is very
legitimate here when saying that it's not supposed to use readline,
but in addition I also think he's right when it comes to the efficiency
of the parsing if you just need a little portion of the final document
(which, in comparison with CSV, would be the case where you need to
extract a fixed-size column from the whole record).

Thus, we should probably design better parsing primitives for the lua
modules like Thomas suggested. To do that, we need to define the
different needs of the parsers, including for example:

 - XML element-based parsing
 - JSON properties in a document
 - regex in which you can stream content and trigger when it matches

Maybe we can refocus a discussion thread on that if you agree with
me (us?) to get back to a productive and constructive result?

I can provide some work for the implementation and some time for
the discussion if you'd like.

Regards,
--
Alexandre Janniaux
Videolabs

On Mon, Sep 07, 2020 at 01:45:56PM +0200, Pierre Ynard via vlc-devel wrote:
> Date: Mon, 7 Sep 2020 13:45:56 +0200
> From: Pierre Ynard <linkfanel at yahoo.fr>
> To: vlc-devel at videolan.org
> Subject: Re: [vlc-devel] [PATCH 4/5] stream_ReadLine: support arbitrary
>  length limit
> User-Agent: Mutt/1.5.23 (2014-03-12)
>
> > No way. The caller is hardly in better position to know a sane limit
> > than the callee. This just ends up replacing an arbitrary limit with
> > dozens of different inconsistent arbitrary limits.
>
> I'm going to assume that you're referring again to what you mentioned
> last time:
>
> > Unfortunately, the practical limit depends on a lot of factors, most
> > notably the final "consumer" rather than the caller, e.g. the GUI
> > showing a property of an input item extracted from a playlist file.
>
> But that doesn't work. This is backwards and broken to put that
> responsibility on an upstream middle piece of the data supply that
> doesn't know the constraints of the consumer, and doesn't even
> understand what's flowing through it. This sounds a lot like an
> anti-pattern. The consumer should take responsibility for validating or
> formatting its input before using or presenting it.
>
> Then you don't actually name what could go wrong, so this sounds a bit
> like FUD. In fact this idea of practical limit is already broken, or
> is a non-issue (probably both). The current arbitrary limit isn't any
> consistent one on the consumed data or metadata, it's only a vague
> blanket upper bound that completely fails at consistently limiting
> fields as diverse as a genre or year tag (a few characters), a URL (some
> say up to 2 kB), or a description (potentially much longer).
>
> Also with the current blanket 200 kB limit, it can already pass
> 200-kB-long title strings or URLs to the GUI, video output or playlist,
> so it is already either broken or a non-issue; the security implications
> make me assume that this part is already properly handled by consumers,
> and thus a non-issue.
>
> Then stream_ReadLine() can't possibly be a proper checkpoint to control
> this as there are plenty other channels from where consumers get this
> data: binary file formats, with specifications controlling these field
> lengths; users of the XML parser module, itself calling upon the
> external, general-purpose libxml2 to parse fields; lua playlist scripts,
> some of which (out of tree) even start by slurping the whole file before
> beginning to parse it, and have no limit on what metadata they get and
> return from that; and also out-of-tree modules calling core APIs in an
> uncontrolled way.
>
> Your arguments are even self-defeating here: you say the hardcoded
> limit in stream_ReadLine() is crucial for consistently keeping things
> practical, yet you're pushing for lua scripts NOT using it and doing
> whatever the hell they want instead!
>
> So this really makes no sense to me at all. This point of view you take
> makes no sense to me, but even considering things from there, your
> reasoning is invalid, and the caller is in a much better position to
> wield limits, as contrary to the callee, it actually understands what
> data it's dealing with, and performs the field extraction, so if it was
> its responsibility it would be able to enforce limits on the actual
> fields, and in an actually adapted, proportionate and less arbitrary
> manner.
>
> It makes even less sense to me when instead of looking towards the
> consumer, you look towards the source of data. There it is very obvious
> that the caller, knowing what kind of input data and format to expect,
> is in a much better position to know not just sane, but appropriate and
> practical limits.
>
> You argued yourself that Kate limits lines at 4096 characters; I don't
> know if you mean that it's an implementation specific or part of the
> specs, but indeed as I agreed above, there can be such specification
> or formats constraints, in which case the caller implementing that
> particular format or use case knows *exactly* what the limit has to be
> - and then this known specified limit isn't arbitrary at all for the
> caller.
>
> The core stream API is general-purpose and many data formats flow
> through it: the various kinds of data of course aren't consistent,
> it isn't supposed to be consistent, and so the current "consistent"
> one-size-fits-all is inopportune. On the contrary, the point of the
> patch is to make it configurable, which is indeed the very opposite of
> consistent, in order to adapt it to the knowledge of the format, and
> thus make it less arbitrary.
>
> --
> Pierre Ynard
> "Une âme dans un corps, c'est comme un dessin sur une feuille de papier."

> _______________________________________________
> vlc-devel mailing list
> To unsubscribe or modify your subscription options:
> https://mailman.videolan.org/listinfo/vlc-devel