Elixir and my ISO-8859-1 character encoding problem

Published on 2018/04/25

Table of contents

Being from México, I have been wrestling with character encoding issues for a long time, in several languages…

Now, it’s Elixir’s time.

The problem

When working my way through The little Elixir & OTP guidebook —a highly recommended one— I got stuck at the ID3 parser example program:

defmodule ID3Parser do
  def parse(file_name) do
    case File.read(file_name) do
      {:ok, mp3} ->
        mp3_byte_size = byte_size(mp3) - 128

        <<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3

        <<"TAG",
          title::binary-size(30),
          artist::binary-size(30),
          album::binary-size(30),
          year::binary-size(4),
          _rest::binary>> = id3_tag

        IO.puts "#{artist} - #{title} (#{album} #{year})"

      _ ->
        IO.puts "Couldn't open #{file_name}"
    end
  end
end

Using Clementine I edited the ID3 tags for a file namedsome-song.mp3.
And put Éso as its title.

I wanted to know if the program would handle those just fine. It did not.


It was all right when the ID3 tags contained only valid ASCII characters, as soon as I put an accented character in the title, artist or album what I got was an error like this:

iex(1)> ID3Parser.parse "some-song.mp3"

** (ArgumentError) argument error
    (stdlib) :io.put_chars(:standard_io, :unicode, [
      <<89, 111, 112, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 32, 45, 32, 201, 115, 111,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>, 10])

The solution

After some research here and there, then some error reporting… I found out that ID3v1 tags —the ones the program is trying to parse— should in theory be encoded as ISO-8859-1, also known as Latin 1.

What I needed was a way to convert those bytes from ISO-5589-1 (Latin 1) to UTF-8 (Unicode), and give IO.puts something it could print without problems.

I found exactly that in this Erlang facility:

:unicode.characters_to_binary(your_string, :latin1)

This is the final program that correctly parses ID3v1 tags in their expected encoding —careful, the encoding is expected, but in no way guaranteed:

defmodule ID3Parser do
  def parse(file_name) do
    case File.read(file_name) do
      {:ok, mp3} ->
        mp3_byte_size = byte_size(mp3) - 128

        <<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3

        <<"TAG",
          title::binary-size(30),
          artist::binary-size(30),
          album::binary-size(30),
          year::binary-size(4),
          _rest::binary>> = id3_tag

        to_convert = [title, artist, album, year]
        [title, artist, album, year] =
          Enum.map(to_convert, fn tag -> from_latin1(tag) end)

        IO.puts "#{artist} - #{title} (#{album} #{year})"

      _ ->
        IO.puts "Couldn't open #{file_name}"
    end
  end

  defp from_latin1(string) do
    :unicode.characters_to_binary(string, :latin1)
  end
end

Hopefully this will help someone else in the same predicament.

Links


— lt

Feedback & comments

Get in touch on Twitter

Or by good ol' email at adriandcs@gmail.com