Strings and Runes in C++

#include <iostream>
#include <string>
#include <codecvt>
#include <locale>

void examineChar(char32_t c) {
    if (c == U't') {
        std::cout << "found tee" << std::endl;
    } else if (c == U'ส') {
        std::cout << "found so sua" << std::endl;
    }
}

int main() {
    // s is a string assigned a literal value
    // representing the word "hello" in the Thai language.
    // C++ string literals are UTF-8 encoded text.
    const std::string s = "สวัสดี";

    // Since strings are equivalent to char arrays, this
    // will produce the length of the raw bytes stored within.
    std::cout << "Len: " << s.length() << std::endl;

    // Indexing into a string produces the raw byte values at
    // each index. This loop generates the hex values of all
    // the bytes that constitute the code points in s.
    for (unsigned char c : s) {
        std::cout << std::hex << static_cast<int>(c) << " ";
    }
    std::cout << std::endl;

    // To count how many characters are in a string, we can use
    // a UTF-8 to UTF-32 converter. Note that the run-time of
    // this operation depends on the size of the string,
    // because it has to decode each UTF-8 character sequentially.
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
    std::u32string utf32 = converter.from_bytes(s);
    std::cout << "Character count: " << utf32.length() << std::endl;

    // We can iterate over the UTF-32 string to get each character
    // along with its position in the original UTF-8 string.
    size_t bytePos = 0;
    for (char32_t c : utf32) {
        std::cout << "U+" << std::hex << static_cast<int>(c) << " starts at " << std::dec << bytePos << std::endl;
        bytePos += converter.to_bytes(c).length();
        examineChar(c);
    }

    return 0;
}

This C++ code demonstrates working with UTF-8 encoded strings and Unicode characters. Here’s a breakdown of the code and its functionality:

  1. We include necessary headers for input/output, string manipulation, and Unicode conversions.

  2. The examineChar function demonstrates passing a Unicode character (char32_t) to a function and comparing it with character literals.

  3. In the main function, we define a UTF-8 encoded string s containing Thai characters.

  4. We print the length of the string, which gives the number of bytes in the UTF-8 representation.

  5. We iterate over the string to print the hexadecimal values of each byte.

  6. To count the actual number of characters (code points), we convert the UTF-8 string to UTF-32 using std::wstring_convert and std::codecvt_utf8.

  7. We iterate over the UTF-32 string to print each character’s Unicode code point and its starting byte position in the original UTF-8 string.

  8. For each character, we call the examineChar function to demonstrate character comparison.

Note that C++ doesn’t have a built-in rune type like Go, so we use char32_t which can represent any Unicode code point. The std::wstring_convert and std::codecvt_utf8 classes are used for UTF-8 to UTF-32 conversion, which is similar to Go’s UTF-8 handling.

To compile and run this program, you would typically use:

$ g++ -std=c++11 unicode_example.cpp -o unicode_example
$ ./unicode_example

This example demonstrates how to work with Unicode strings in C++, including iterating over characters, counting characters vs bytes, and examining individual Unicode code points.