Title here
Summary here
# An Elixir string is a UTF-8 encoded binary. The language
# and standard library treat strings specially - as
# containers of text encoded in UTF-8.
# In Elixir, the concept of a character is called a "grapheme" -
# it's a UTF-8 encoded code point that may consist of multiple bytes.
defmodule StringsAndGraphemes do
def run do
# `s` is a string assigned a literal value
# representing the word "hello" in the Thai language.
# Elixir string literals are UTF-8 encoded text.
s = "สวัสดี"
# This will produce the length of the raw bytes stored within.
IO.puts("Len: #{byte_size(s)}")
# This loop generates the hex values of all
# the bytes that constitute the code points in `s`.
for <<byte <- s>> do
IO.write("#{Integer.to_string(byte, 16)} ")
end
IO.puts("")
# To count how many graphemes are in a string, we can use
# the String.length/1 function. Note that some Thai characters
# are represented by UTF-8 code points that can span multiple bytes.
IO.puts("Grapheme count: #{String.length(s)}")
# String.codepoints/1 returns a list of Unicode code points in a string.
for {codepoint, index} <- Enum.with_index(String.codepoints(s)) do
IO.puts("#{inspect(codepoint)} starts at #{index}")
end
# We can achieve the same iteration by using String.next_grapheme/1 function.
IO.puts("\nUsing String.next_grapheme/1")
print_graphemes(s)
# This demonstrates passing a grapheme to a function.
Enum.each(String.graphemes(s), &examine_grapheme/1)
end
defp print_graphemes(<<>>), do: :ok
defp print_graphemes(string) do
{grapheme, rest} = String.next_grapheme(string)
IO.puts("#{inspect(grapheme)} starts at #{byte_size(string) - byte_size(rest)}")
print_graphemes(rest)
end
# Values enclosed in double quotes are string literals in Elixir.
# We can compare a grapheme value to a string literal directly.
defp examine_grapheme("t"), do: IO.puts("found tee")
defp examine_grapheme("ส"), do: IO.puts("found so sua")
defp examine_grapheme(_), do: :ok
end
StringsAndGraphemes.run()
This Elixir code demonstrates working with strings and graphemes, which are analogous to strings and runes in other languages. Here’s a breakdown of the key points:
byte_size/1
to get the raw byte length of a string.<<>>
syntax.String.length/1
counts the number of graphemes in a string.String.codepoints/1
returns a list of Unicode code points.String.next_grapheme/1
can be used to iterate over graphemes in a string.String.graphemes/1
returns a list of graphemes in a string.When you run this code, you’ll see output similar to the following:
Len: 18
e0 b8 aa e0 b8 a7 e0 b8 b1 e0 b8 aa e0 b8 94 e0 b8 b5
Grapheme count: 6
"ส" starts at 0
"ว" starts at 1
"ั" starts at 2
"ส" starts at 3
"ด" starts at 4
"ี" starts at 5
Using String.next_grapheme/1
"ส" starts at 0
"ว" starts at 3
"ั" starts at 6
"ส" starts at 9
"ด" starts at 12
"ี" starts at 15
found so sua
found so sua
This example demonstrates how Elixir handles UTF-8 encoded strings and graphemes, providing a similar level of functionality to other languages’ string and character handling capabilities.