[vlc-devel] [PATCH 2/2] taglib: detect charset when ID3v2 Latin-1 parser finds invalid character
Rémi Denis-Courmont
remi at remlab.net
Fri Oct 23 16:59:46 CEST 2020
Le perjantaina 23. lokakuuta 2020, 13.46.56 EEST sojulibra at gmail.com a écrit :
> From: Souju TANAKA <sojulibra at gmail.com>
>
> Changed TagLib Latin-1 parser to check whether a ISO 8859-1 encoded ID3v2
> tag is a valid byte sequence. If invalid Latin-1 character is found,
Well, considering that any octet sequence is valid ISO 8859-1, that sentence
makes no sense.
> try to
> detect charset and convert the tag into UTF-8 to avoid Mojibake.
>
> Some encoder embeds ID3v2 in unexpected charset, though it is againt the
> spec. TagLib allows to overide TagLib::ID3v2::Latin1StringHandler::parse()
> to deal with this practical situation.
> ---
> include/vlc_charset.h | 20 +++++++++++
> modules/meta_engine/Makefile.am | 2 +-
> modules/meta_engine/taglib.cpp | 63 +++++++++++++++++++++++++++++++++
> 3 files changed, 84 insertions(+), 1 deletion(-)
>
> diff --git a/include/vlc_charset.h b/include/vlc_charset.h
> index 0ec1734dc9..311856913e 100644
> --- a/include/vlc_charset.h
> +++ b/include/vlc_charset.h
> @@ -93,6 +93,26 @@ VLC_USED static inline const char *IsASCII(const char
> *str) return str;
> }
>
> +/**
> + * Checks ISO/IEC 8859-1 validity.
> + *
> + * Checks whether a null-terminated string is a valid ISO/IEC 8859-1 bytes
> sequence + *
> + * \param str string to check
> + *
> + * \retval str the string is a valid null-terminated ISO/IEC 8859-1
> sequence + * \retval NULL the string is not an ISO/IEC 8859-1 sequence
> + */
> +VLC_USED static inline const char *IsLatin1(const char *str)
> +{
> + unsigned char c;
> +
> + for (const char *p = str; (c = *p) != '\0'; p++)
> + if (unlikely(c < 0x20 || (c > 0x7e && c < 0xa0)))
> + return NULL;
> + return str;
Oh come on. Eliminating C1 codes is questionable, but you obviously can't just
blindly reject all C0 codes.
--
レミ・デニ-クールモン
http://www.remlab.net/
More information about the vlc-devel
mailing list