Title here
Summary here
import std.stdio;
import std.utf;
void main()
{
// `s` is a `string` assigned a literal value
// representing the word "hello" in the Thai
// language. D string literals are UTF-8
// encoded text.
const string s = "สวัสดี";
// Since strings are arrays of immutable characters in D,
// this will produce the length of the raw bytes stored within.
writeln("Len: ", s.length);
// Indexing into a string produces the raw byte values at
// each index. This loop generates the hex values of all
// the bytes that constitute the code points in `s`.
foreach (i; 0 .. s.length)
{
writef("%x ", s[i]);
}
writeln();
// To count how many *code points* are in a string, we can use
// the `std.utf.count` function. Note that the run-time of
// `count` depends on the size of the string,
// because it has to decode each UTF-8 code point sequentially.
// Some Thai characters are represented by UTF-8 code points
// that can span multiple bytes, so the result of this count
// may be surprising.
writeln("Code point count:", s.count);
// A `foreach` loop handles strings specially and decodes
// each code point along with its offset in the string.
foreach (i, dchar c; s)
{
writefln("%U starts at %d", c, i);
}
// We can achieve the same iteration by using the
// `std.utf.decode` function explicitly.
writeln("\nUsing decode");
for (size_t i = 0; i < s.length; i += stride(s, i))
{
dchar c = decode(s, i);
writefln("%U starts at %d", c, i);
examineCodePoint(c);
}
}
void examineCodePoint(dchar c)
{
// Values enclosed in single quotes are character literals. We
// can compare a `dchar` value to a character literal directly.
if (c == 't')
{
writeln("found tee");
}
else if (c == 'ส')
{
writeln("found so sua");
}
}
This D program demonstrates working with strings and Unicode code points, which are similar to runes in other languages. Here’s a breakdown of the concepts:
length
property of a string gives the number of bytes, not the number of code points.std.utf.count
to count the number of Unicode code points in a string.foreach
loop can iterate over a string’s code points directly.std.utf.decode
can be used to manually decode UTF-8 encoded strings.std.utf.stride
is used to determine the number of bytes in the current code point.dchar
to represent a full Unicode code point.When you run this program, you’ll see output similar to:
Len: 18
e0 b8 aa e0 b8 a7 e0 b8 b1 e0 b8 aa e0 b8 94 e0 b8 b5
Code point count: 6
U+0E2A starts at 0
U+0E27 starts at 3
U+0E31 starts at 6
U+0E2A starts at 9
U+0E14 starts at 12
U+0E35 starts at 15
Using decode
U+0E2A starts at 0
found so sua
U+0E27 starts at 3
U+0E31 starts at 6
U+0E2A starts at 9
found so sua
U+0E14 starts at 12
U+0E35 starts at 15
This example illustrates how D handles Unicode strings and provides tools for working with individual code points, which is crucial for text processing in a globalized world.