Strings and Runes in Ruby

Ruby strings are mutable sequences of characters. The language and standard library treat strings as containers of text encoded in UTF-8. In Ruby, characters are represented by single-character strings, and there’s no separate “rune” concept like in some other languages.

# s is a String assigned a literal value
# representing the word "hello" in the Thai language.
# Ruby string literals are UTF-8 encoded text by default.
s = "สวัสดี"

# This will produce the length of the string in characters.
puts "Len: #{s.length}"

# This loop generates the hex values of all
# the bytes that constitute the characters in s.
s.each_byte do |byte|
  print "#{byte.to_s(16)} "
end
puts

# To count how many characters are in a string, we can use
# the length method. Note that Ruby treats Thai characters
# as single characters, even if they span multiple bytes in UTF-8.
puts "Character count: #{s.length}"

# Ruby's each_char method allows us to iterate over each character
# in the string.
s.each_char.with_index do |char, idx|
  puts "U+#{char.ord.to_s(16).upcase.rjust(4, '0')} '#{char}' starts at #{idx}"
end

puts "\nUsing String#unpack"
# We can achieve a similar iteration by unpacking the string
# into its UTF-8 codepoints.
s.unpack("U*").each_with_index do |codepoint, idx|
  char = [codepoint].pack("U")
  puts "U+#{codepoint.to_s(16).upcase.rjust(4, '0')} '#{char}' starts at #{idx}"
  examine_char(char)
end

def examine_char(char)
  # We can compare a character to a string literal directly.
  if char == 't'
    puts "found tee"
  elsif char == 'ส'
    puts "found so sua"
  end
end

When you run this program, you’ll see output similar to this:

Len: 6
e0 b8 aa e0 b8 a7 e0 b8 b1 e0 b8 aa e0 b8 94 e0 b8 b5 
Character count: 6
U+0E2A 'ส' starts at 0
U+0E27 'ว' starts at 1
U+0E31 'ั' starts at 2
U+0E2A 'ส' starts at 3
U+0E14 'ด' starts at 4
U+0E35 'ี' starts at 5

Using String#unpack
U+0E2A 'ส' starts at 0
found so sua
U+0E27 'ว' starts at 1
U+0E31 'ั' starts at 2
U+0E2A 'ส' starts at 3
found so sua
U+0E14 'ด' starts at 4
U+0E35 'ี' starts at 5

This example demonstrates various ways to work with strings and characters in Ruby, including iterating over bytes and characters, and examining individual characters. Ruby’s string handling is generally simpler than some other languages, as it treats multi-byte UTF-8 characters as single units by default.