My Elixir ISO-8859-1 character encoding issue

Wrestling with character encoding is never fun. Or is it?

2018-04-25
elixir
programming

Being from México, I have been exposed and wrestled with character encoding issues every now and then, using several programming languages…

Now, it’s Elixir’s time.

The issue

When working my way through The little Elixir & OTP guidebook —highly recommended BTW— I got stuck at the ID3 parser example program:

defmodule ID3Parser do
  def parse(file_name) do
    case File.read(file_name) do
      {:ok, mp3} ->
        mp3_byte_size = byte_size(mp3) - 128

        <<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3

        <<"TAG",
          title::binary-size(30),
          artist::binary-size(30),
          album::binary-size(30),
          year::binary-size(4),
          _rest::binary>> = id3_tag

        IO.puts "#{artist} - #{title} (#{album} #{year})"

      _ ->
        IO.puts "Couldn't open #{file_name}"
    end
  end
end

I edited the ID3 tags of an MP3 file using Clementine, and modified the title to Adiós.

I wanted to know if the program would handle accented words just fine. It did not.

It was all right when the ID3 tags contained only valid ASCII characters, as soon as I used an accented character in the title, artist or album I was presented with this:

iex(1)> ID3Parser.parse "some-song.mp3"

** (ArgumentError) argument error
    (stdlib) :io.put_chars(:standard_io, :unicode, [
      <<89, 111, 112, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 32, 45, 32, 201, 115, 111,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>, 10])

The research

After some research here and there, and then some error reporting over here.

I found out that ID3v1 tags should be —in theory— encoded as ISO-8859-1 AKA Latin 1.

So, what I need was a way to convert those bytes from ISO-5589-1 to UTF-8 so we could give IO.puts something it can print without problems.

The solution

There is something for doing exactly that in Erlang:

:unicode.characters_to_binary(your_string, :latin1)

This is an implementation that can parse ID3v1 tags that include accented characters.

Just be careful, the ISO-5589-1 encoding is expected, but in no way guaranteed.

defmodule ID3Parser do
  def parse(file_name) do
    case File.read(file_name) do
      {:ok, mp3} ->
        mp3_byte_size = byte_size(mp3) - 128

        <<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3

        <<"TAG",
          title::binary-size(30),
          artist::binary-size(30),
          album::binary-size(30),
          year::binary-size(4),
          _rest::binary>> = id3_tag

        to_convert = [title, artist, album, year]
        [title, artist, album, year] =
          Enum.map(to_convert, fn tag -> from_latin1(tag) end)

        IO.puts "#{artist} - #{title} (#{album} #{year})"

      _ ->
        IO.puts "Couldn't open #{file_name}"
    end
  end

  defp from_latin1(string) do
    :unicode.characters_to_binary(string, :latin1)
  end
end

Hope this helps someone else! :nerd_face:

Some links