[vlc-devel] [PATCH 2/2] taglib: detect charset when ID3v2 Latin-1 parser finds invalid character

Sat Oct 24 11:41:39 CEST 2020

Le lauantaina 24. lokakuuta 2020, 12.17.33 EEST Souju TANAKA a écrit :
> > Oh come on. Eliminating C1 codes is questionable, but you obviously can't
> > just blindly reject all C0 codes.
> 
> ISO/IEC 8859-1 (ISO 8859-1) doesn't assign C0 and C1. Its superset,
> ISO-8859-1 (note the extra hyphen after "ISO") does. I think ID3v2
> specification refer to the former since ISO/IEC 8859-1 is cited in its
> references section.

Fair enough. The ISO specification does not have C0 and C1 codes, but the IANA 
assignment for the "ISO 8859-1" character set encoding name does have them.

The ID3 specification actually allows C1 but not C0 codes, which contradicts 
the referenced ISO specification and really hardly makes any sense. C1 codes 
are hardly ever used in real life, but some of the C0 ones obviously are. I 
can only guess that whoever wrote the ID3v2 spec didn't understand what they 
were talking about.

Now in practice, this is all moot. Misencoded non-Unicode ID3 tags are written 
in the Windows code page of the encoder system. As such:
- The C0 range will have the same meaning as in (IANA) ISO 8859-1, and will 
not work as a discriminant for most code pages, notably all of the monobyte 
ones.
- The C1 range will be rarely if at all used, again notably by the monobyte 
code pages.

As far as I see, the most common user complaints relates to Cyrillic 
(Windows-1251), and it looks like it will hardly be handled by this patch. 
Maybe Shift-JIS works well but I don't think that that's not the most common 
issue.

-- 
Rémi Denis-Courmont
http://www.remlab.net/