Strings and Runes in Rust

use std::str;

fn main() {
    // `s` is a `&str` assigned a literal value
    // representing the word "hello" in the Thai
    // language. Rust string literals are UTF-8
    // encoded text.
    let s = "สวัสดี";

    // Since strings are stored as UTF-8 bytes, this
    // will produce the length of the raw bytes stored within.
    println!("Len: {}", s.len());

    // Indexing into a string produces the raw byte values at
    // each index. This loop generates the hex values of all
    // the bytes that constitute the code points in `s`.
    for b in s.bytes() {
        print!("{:x} ", b);
    }
    println!();

    // To count how many characters are in a string, we can use
    // the `chars()` method. Note that some Thai characters are
    // represented by UTF-8 code points that can span multiple bytes,
    // so the result of this count may be surprising.
    println!("Char count: {}", s.chars().count());

    // A `for` loop over a string handles it specially and decodes
    // each character along with its byte offset in the string.
    for (idx, c) in s.char_indices() {
        println!("{:?} starts at {}", c, idx);
    }

    // We can achieve the same iteration by using the
    // `str::from_utf8` function explicitly.
    println!("\nUsing str::from_utf8");
    let mut i = 0;
    while i < s.len() {
        let ch = str::from_utf8(&s.as_bytes()[i..])
            .unwrap()
            .chars()
            .next()
            .unwrap();
        println!("{:?} starts at {}", ch, i);
        i += ch.len_utf8();

        // This demonstrates passing a `char` value to a function.
        examine_char(ch);
    }
}

fn examine_char(c: char) {
    // We can compare a `char` value to a character literal directly.
    if c == 't' {
        println!("found tee");
    } else if c == 'ส' {
        println!("found so sua");
    }
}

When you run this program, you’ll see output similar to:

Len: 18
e0 b8 aa e0 b8 a7 e0 b8 b1 e0 b8 aa e0 b8 94 e0 b8 b5 
Char count: 6
'ส' starts at 0
'ว' starts at 3
'ั' starts at 6
'ส' starts at 9
'ด' starts at 12
'ี' starts at 15

Using str::from_utf8
'ส' starts at 0
found so sua
'ว' starts at 3
'ั' starts at 6
'ส' starts at 9
found so sua
'ด' starts at 12
'ี' starts at 15

This Rust code demonstrates similar concepts to the original example:

  1. It shows how strings are stored as UTF-8 encoded bytes.
  2. It demonstrates iterating over a string’s bytes and characters.
  3. It shows how to get the byte length and character count of a string.
  4. It demonstrates how to work with individual characters (chars) in a string.
  5. It includes an example of passing a char to a function and comparing it with literals.

The main differences are in the syntax and the specific methods used. For example, Rust uses chars() instead of utf8.RuneCountInString(), and char_indices() instead of ranging over the string directly. The concept of “runes” in Go is replaced by Rust’s char type, which represents a Unicode scalar value.