lobo_tuerto's notes
Home
Blog
Notes
About

My Elixir ISO-8859-1 character encoding issue

Wrestling with character encoding is never fun. Or is it?

📅Published25 April 2018
🏷️
elixirprogramming

Being from México, I have been wrestling with character encoding issues for a long time, using several programming languages…

Now, it’s Elixir’s time.

The issue

When working my way through The little Elixir & OTP guidebook —a highly recommended one BTW— I got stuck at the ID3 parser example program:

defmodule ID3Parser do
  def parse(file_name) do
    case File.read(file_name) do
      {:ok, mp3} ->
        mp3_byte_size = byte_size(mp3) - 128

        <<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3

        <<"TAG",
          title::binary-size(30),
          artist::binary-size(30),
          album::binary-size(30),
          year::binary-size(4),
          _rest::binary>> = id3_tag

        IO.puts "#{artist} - #{title} (#{album} #{year})"

      _ ->
        IO.puts "Couldn't open #{file_name}"
    end
  end
end

Using Clementine edited the ID3 tags for an MP3 file named some-song.mp3 and modified the title to Éso.

I wanted to know if the program would handle accents in words just fine.
It did not.

It was all right when the ID3 tags contained only valid ASCII characters, as soon as I used an accented character in the title, artist or album what I was seen was this:

iex(1)> ID3Parser.parse "some-song.mp3"

** (ArgumentError) argument error
    (stdlib) :io.put_chars(:standard_io, :unicode, [
      <<89, 111, 112, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 32, 45, 32, 201, 115, 111,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>, 10])

The research

After some research here and there, and then some error reporting over here.

I found out that ID3v1 tags should be —in theory— encoded as ISO-8859-1 AKA Latin 1.

So, what I need was a way to convert those bytes from ISO-5589-1 to UTF-8 so we could give IO.puts something it can print without problems.

The solution

There is something for doing exactly that in Erlang:

:unicode.characters_to_binary(your_string, :latin1)

This is an implementation that can parse ID3v1 tags that include accented characters.

Just be careful, the ISO-5589-1 encoding is expected, but in no way guaranteed.

defmodule ID3Parser do
  def parse(file_name) do
    case File.read(file_name) do
      {:ok, mp3} ->
        mp3_byte_size = byte_size(mp3) - 128

        <<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3

        <<"TAG",
          title::binary-size(30),
          artist::binary-size(30),
          album::binary-size(30),
          year::binary-size(4),
          _rest::binary>> = id3_tag

        to_convert = [title, artist, album, year]
        [title, artist, album, year] =
          Enum.map(to_convert, fn tag -> from_latin1(tag) end)

        IO.puts "#{artist} - #{title} (#{album} #{year})"

      _ ->
        IO.puts "Couldn't open #{file_name}"
    end
  end

  defp from_latin1(string) do
    :unicode.characters_to_binary(string, :latin1)
  end
end

Hope this helps someone else! 🤓


Got comments or feedback?
Follow me on
v-3ab054c