Wrestling with character encoding is never fun. Or is it?
Being from México, I have been wrestling with character encoding issues for a long time, using several programming languages…
Now, it’s Elixir’s time.
When working my way through The little Elixir & OTP guidebook —a highly recommended one BTW— I got stuck at the ID3 parser example program:
defmodule ID3Parser do
def parse(file_name) do
case File.read(file_name) do
{:ok, mp3} ->
mp3_byte_size = byte_size(mp3) - 128
<<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3
<<"TAG",
title::binary-size(30),
artist::binary-size(30),
album::binary-size(30),
year::binary-size(4),
_rest::binary>> = id3_tag
IO.puts "#{artist} - #{title} (#{album} #{year})"
_ ->
IO.puts "Couldn't open #{file_name}"
end
end
end
Using Clementine edited the ID3 tags for an MP3 file named some-song.mp3
and modified the title to Éso.
I wanted to know if the program would handle accents in words just fine.
It did not.
It was all right when the ID3 tags contained only valid ASCII characters, as soon as I used an accented character in the title, artist or album what I was seen was this:
iex(1)> ID3Parser.parse "some-song.mp3"
** (ArgumentError) argument error
(stdlib) :io.put_chars(:standard_io, :unicode, [
<<89, 111, 112, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 32, 45, 32, 201, 115, 111,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>, 10])
After some research here and there, and then some error reporting over here.
I found out that ID3v1 tags should be —in theory— encoded as ISO-8859-1 AKA Latin 1.
So, what I need was a way to convert those bytes from ISO-5589-1 to UTF-8 so we could give IO.puts
something it can print without problems.
There is something for doing exactly that in Erlang:
:unicode.characters_to_binary(your_string, :latin1)
This is an implementation that can parse ID3v1 tags that include accented characters.
Just be careful, the ISO-5589-1 encoding is expected, but in no way guaranteed.
defmodule ID3Parser do
def parse(file_name) do
case File.read(file_name) do
{:ok, mp3} ->
mp3_byte_size = byte_size(mp3) - 128
<<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3
<<"TAG",
title::binary-size(30),
artist::binary-size(30),
album::binary-size(30),
year::binary-size(4),
_rest::binary>> = id3_tag
to_convert = [title, artist, album, year]
[title, artist, album, year] =
Enum.map(to_convert, fn tag -> from_latin1(tag) end)
IO.puts "#{artist} - #{title} (#{album} #{year})"
_ ->
IO.puts "Couldn't open #{file_name}"
end
end
defp from_latin1(string) do
:unicode.characters_to_binary(string, :latin1)
end
end
Hope this helps someone else! 🤓