Strings and Runes in C++
#include <iostream>
#include <string>
#include <codecvt>
#include <locale>
void examineChar(char32_t c) {
if (c == U't') {
std::cout << "found tee" << std::endl;
} else if (c == U'ส') {
std::cout << "found so sua" << std::endl;
}
}
int main() {
// s is a string assigned a literal value
// representing the word "hello" in the Thai language.
// C++ string literals are UTF-8 encoded text.
const std::string s = "สวัสดี";
// Since strings are equivalent to char arrays, this
// will produce the length of the raw bytes stored within.
std::cout << "Len: " << s.length() << std::endl;
// Indexing into a string produces the raw byte values at
// each index. This loop generates the hex values of all
// the bytes that constitute the code points in s.
for (unsigned char c : s) {
std::cout << std::hex << static_cast<int>(c) << " ";
}
std::cout << std::endl;
// To count how many characters are in a string, we can use
// a UTF-8 to UTF-32 converter. Note that the run-time of
// this operation depends on the size of the string,
// because it has to decode each UTF-8 character sequentially.
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
std::u32string utf32 = converter.from_bytes(s);
std::cout << "Character count: " << utf32.length() << std::endl;
// We can iterate over the UTF-32 string to get each character
// along with its position in the original UTF-8 string.
size_t bytePos = 0;
for (char32_t c : utf32) {
std::cout << "U+" << std::hex << static_cast<int>(c) << " starts at " << std::dec << bytePos << std::endl;
bytePos += converter.to_bytes(c).length();
examineChar(c);
}
return 0;
}
This C++ code demonstrates working with UTF-8 encoded strings and Unicode characters. Here’s a breakdown of the code and its functionality:
We include necessary headers for input/output, string manipulation, and Unicode conversions.
The
examineChar
function demonstrates passing a Unicode character (char32_t) to a function and comparing it with character literals.In the
main
function, we define a UTF-8 encoded strings
containing Thai characters.We print the length of the string, which gives the number of bytes in the UTF-8 representation.
We iterate over the string to print the hexadecimal values of each byte.
To count the actual number of characters (code points), we convert the UTF-8 string to UTF-32 using
std::wstring_convert
andstd::codecvt_utf8
.We iterate over the UTF-32 string to print each character’s Unicode code point and its starting byte position in the original UTF-8 string.
For each character, we call the
examineChar
function to demonstrate character comparison.
Note that C++ doesn’t have a built-in rune
type like Go, so we use char32_t
which can represent any Unicode code point. The std::wstring_convert
and std::codecvt_utf8
classes are used for UTF-8 to UTF-32 conversion, which is similar to Go’s UTF-8 handling.
To compile and run this program, you would typically use:
$ g++ -std=c++11 unicode_example.cpp -o unicode_example
$ ./unicode_example
This example demonstrates how to work with Unicode strings in C++, including iterating over characters, counting characters vs bytes, and examining individual Unicode code points.