Strings and Runes in Lua

Lua strings are sequences of bytes. They can contain any 8-bit value, including embedded zeros. Lua is eight-bit clean: strings can contain any 8-bit character, and all string operations work with the whole string, regardless of its contents. String literals can be delimited by matching single or double quotes.

-- s is a string assigned a literal value
-- representing the word "hello" in the Thai language.
local s = "สวัสดี"

-- This will produce the length of the raw bytes stored within.
print("Len:", #s)

-- This loop generates the hex values of all
-- the bytes that constitute the string s.
for i = 1, #s do
    io.write(string.format("%x ", s:byte(i)))
end
print()

-- To count how many characters are in a string, we can use
-- the UTF-8 library. Note that the run-time of
-- utf8.len depends on the size of the string,
-- because it has to decode each UTF-8 character sequentially.
print("Character count:", utf8.len(s))

-- A UTF-8 iterator handles strings specially and decodes
-- each character along with its offset in the string.
for i, c in utf8.codes(s) do
    print(string.format("U+%04X starts at %d", c, i))
end

print("\nUsing manual iteration")
local i = 1
while i <= #s do
    local c, size = utf8.codepoint(s, i)
    print(string.format("U+%04X starts at %d", c, i))
    examineCharacter(c)
    i = i + size
end

-- This demonstrates passing a character value to a function.
function examineCharacter(c)
    if c == string.byte('t') then
        print("found tee")
    elseif c == utf8.codepoint('ส') then
        print("found so sua")
    end
end

To run this Lua script, save it to a file (e.g., strings_and_chars.lua) and execute it using the Lua interpreter:

$ lua strings_and_chars.lua
Len: 18
e0 b8 aa e0 b8 a7 e0 b8 b1 e0 b8 aa e0 b8 94 e0 b8 b5 
Character count: 6
U+0E2A starts at 1
U+0E27 starts at 4
U+0E31 starts at 7
U+0E2A starts at 10
U+0E14 starts at 13
U+0E35 starts at 16

Using manual iteration
U+0E2A starts at 1
found so sua
U+0E27 starts at 4
U+0E31 starts at 7
U+0E2A starts at 10
found so sua
U+0E14 starts at 13
U+0E35 starts at 16

In this Lua version, we use the utf8 library to handle Unicode characters. The utf8.codes function provides an iterator that decodes UTF-8 encoded strings. We also use utf8.codepoint to manually iterate through the string and decode each character.

Note that Lua doesn’t have a built-in concept of “runes”. Instead, we work with Unicode code points directly. The examineCharacter function demonstrates how to compare characters, including non-ASCII ones.