[vlc-devel] [PATCH 2/2] taglib: detect charset when ID3v2 Latin-1 parser finds invalid character

Souju TANAKA sojulibra at gmail.com
Sat Oct 24 11:17:33 CEST 2020


On 2020/10/23 23:59, Rémi Denis-Courmont wrote:
> Le perjantaina 23. lokakuuta 2020, 13.46.56 EEST sojulibra at gmail.com a écrit :
>> From: Souju TANAKA <sojulibra at gmail.com>
>>
>> Changed TagLib Latin-1 parser to check whether a ISO 8859-1 encoded ID3v2
>> tag is a valid byte sequence. If invalid Latin-1 character is found,
> 
> Well, considering that any octet sequence is valid ISO 8859-1, that sentence
> makes no sense.
> 
>> try to
>> detect charset and convert the tag into UTF-8 to avoid Mojibake.
>>
>> Some encoder embeds ID3v2 in unexpected charset, though it is againt the
>> spec. TagLib allows to overide TagLib::ID3v2::Latin1StringHandler::parse()
>> to deal with this practical situation.
>> ---
>>   include/vlc_charset.h           | 20 +++++++++++
>>   modules/meta_engine/Makefile.am |  2 +-
>>   modules/meta_engine/taglib.cpp  | 63 +++++++++++++++++++++++++++++++++
>>   3 files changed, 84 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/vlc_charset.h b/include/vlc_charset.h
>> index 0ec1734dc9..311856913e 100644
>> --- a/include/vlc_charset.h
>> +++ b/include/vlc_charset.h
>> @@ -93,6 +93,26 @@ VLC_USED static inline const char *IsASCII(const char
>> *str) return str;
>>   }
>>
>> +/**
>> + * Checks ISO/IEC 8859-1 validity.
>> + *
>> + * Checks whether a null-terminated string is a valid ISO/IEC 8859-1 bytes
>> sequence + *
>> + * \param str string to check
>> + *
>> + * \retval str the string is a valid null-terminated ISO/IEC 8859-1
>> sequence + * \retval NULL the string is not an ISO/IEC 8859-1 sequence
>> + */
>> +VLC_USED static inline const char *IsLatin1(const char *str)
>> +{
>> +    unsigned char c;
>> +
>> +    for (const char *p = str; (c = *p) != '\0'; p++)
>> +        if (unlikely(c < 0x20 || (c > 0x7e && c < 0xa0)))
>> +            return NULL;
>> +    return str;
> 
> Oh come on. Eliminating C1 codes is questionable, but you obviously can't just
> blindly reject all C0 codes.
> 

ISO/IEC 8859-1 (ISO 8859-1) doesn't assign C0 and C1. Its superset, ISO-8859-1
(note the extra hyphen after "ISO") does. I think ID3v2 specification refer to
the former since ISO/IEC 8859-1 is cited in its references section.

--
Souju TANAKA


More information about the vlc-devel mailing list