We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
2018 / 04 / 25
My Elixir ISO-8859-1 character encoding issue
Wrestling with character encoding is never fun. Or is it?
Being from México, I have been exposed and wrestled with character encoding issues every now and then, using several programming languages…
Now, it’s Elixir’s time.
The issue
When working my way through The little Elixir & OTP guidebook —highly recommended BTW— I got stuck at the ID3 parser example program:
defmodule ID3Parser do
def parse(file_name) do
case File.read(file_name) do
{:ok, mp3} ->
mp3_byte_size = byte_size(mp3) - 128
<<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3
<<"TAG",
title::binary-size(30),
artist::binary-size(30),
album::binary-size(30),
year::binary-size(4),
_rest::binary>> = id3_tag
IO.puts "#{artist} - #{title} (#{album} #{year})"
_ ->
IO.puts "Couldn't open #{file_name}"
end
end
end
I edited the ID3 tags of an MP3 file using Clementine, and modified the title to Adiós.
I wanted to know if the program would handle accented words just fine. It did not.
It was all right when the ID3 tags contained only valid ASCII characters, as soon as I used an accented character in the title, artist or album I was presented with this:
iex(1)> ID3Parser.parse "some-song.mp3"
** (ArgumentError) argument error
(stdlib) :io.put_chars(:standard_io, :unicode, [
<<89, 111, 112, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 32, 45, 32, 201, 115, 111,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>, 10])
The research
After some research here and there, and then some error reporting over here.
I found out that ID3v1 tags should be —in theory— encoded as ISO-8859-1 AKA Latin 1.
So, what I need was a way to convert those bytes from ISO-5589-1
to UTF-8 so we could give IO.puts
something it can
print without problems.
The solution
There is something for doing exactly that in Erlang:
:unicode.characters_to_binary(your_string, :latin1)
This is an implementation that can parse ID3v1 tags that include accented characters.
Just be careful, the ISO-5589-1 encoding is expected, but in no way guaranteed.
defmodule ID3Parser do
def parse(file_name) do
case File.read(file_name) do
{:ok, mp3} ->
mp3_byte_size = byte_size(mp3) - 128
<<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3
<<"TAG",
title::binary-size(30),
artist::binary-size(30),
album::binary-size(30),
year::binary-size(4),
_rest::binary>> = id3_tag
to_convert = [title, artist, album, year]
[title, artist, album, year] =
Enum.map(to_convert, fn tag -> from_latin1(tag) end)
IO.puts "#{artist} - #{title} (#{album} #{year})"
_ ->
IO.puts "Couldn't open #{file_name}"
end
end
defp from_latin1(string) do
:unicode.characters_to_binary(string, :latin1)
end
end
Hope this helps someone else! :nerd_face: