Strings and Runes in Elixir

# An Elixir string is a UTF-8 encoded binary. The language
# and standard library treat strings specially - as
# containers of text encoded in UTF-8.
# In Elixir, the concept of a character is called a "grapheme" -
# it's a UTF-8 encoded code point that may consist of multiple bytes.

defmodule StringsAndGraphemes do
  def run do
    # `s` is a string assigned a literal value
    # representing the word "hello" in the Thai language.
    # Elixir string literals are UTF-8 encoded text.
    s = "สวัสดี"

    # This will produce the length of the raw bytes stored within.
    IO.puts("Len: #{byte_size(s)}")

    # This loop generates the hex values of all
    # the bytes that constitute the code points in `s`.
    for <<byte <- s>> do
      IO.write("#{Integer.to_string(byte, 16)} ")
    end
    IO.puts("")

    # To count how many graphemes are in a string, we can use
    # the String.length/1 function. Note that some Thai characters
    # are represented by UTF-8 code points that can span multiple bytes.
    IO.puts("Grapheme count: #{String.length(s)}")

    # String.codepoints/1 returns a list of Unicode code points in a string.
    for {codepoint, index} <- Enum.with_index(String.codepoints(s)) do
      IO.puts("#{inspect(codepoint)} starts at #{index}")
    end

    # We can achieve the same iteration by using String.next_grapheme/1 function.
    IO.puts("\nUsing String.next_grapheme/1")
    print_graphemes(s)

    # This demonstrates passing a grapheme to a function.
    Enum.each(String.graphemes(s), &examine_grapheme/1)
  end

  defp print_graphemes(<<>>), do: :ok
  defp print_graphemes(string) do
    {grapheme, rest} = String.next_grapheme(string)
    IO.puts("#{inspect(grapheme)} starts at #{byte_size(string) - byte_size(rest)}")
    print_graphemes(rest)
  end

  # Values enclosed in double quotes are string literals in Elixir.
  # We can compare a grapheme value to a string literal directly.
  defp examine_grapheme("t"), do: IO.puts("found tee")
  defp examine_grapheme("ส"), do: IO.puts("found so sua")
  defp examine_grapheme(_), do: :ok
end

StringsAndGraphemes.run()

This Elixir code demonstrates working with strings and graphemes, which are analogous to strings and runes in other languages. Here’s a breakdown of the key points:

  1. Elixir strings are UTF-8 encoded binaries.
  2. We use byte_size/1 to get the raw byte length of a string.
  3. We can iterate over the raw bytes of a string using a comprehension with the <<>> syntax.
  4. String.length/1 counts the number of graphemes in a string.
  5. String.codepoints/1 returns a list of Unicode code points.
  6. String.next_grapheme/1 can be used to iterate over graphemes in a string.
  7. String.graphemes/1 returns a list of graphemes in a string.
  8. Pattern matching can be used to examine specific graphemes.

When you run this code, you’ll see output similar to the following:

Len: 18
e0 b8 aa e0 b8 a7 e0 b8 b1 e0 b8 aa e0 b8 94 e0 b8 b5 
Grapheme count: 6
"ส" starts at 0
"ว" starts at 1
"ั" starts at 2
"ส" starts at 3
"ด" starts at 4
"ี" starts at 5

Using String.next_grapheme/1
"ส" starts at 0
"ว" starts at 3
"ั" starts at 6
"ส" starts at 9
"ด" starts at 12
"ี" starts at 15
found so sua
found so sua

This example demonstrates how Elixir handles UTF-8 encoded strings and graphemes, providing a similar level of functionality to other languages’ string and character handling capabilities.