Strings and Runes in D Programming Language

import std.stdio;
import std.utf;

void main()
{
    // `s` is a `string` assigned a literal value
    // representing the word "hello" in the Thai
    // language. D string literals are UTF-8
    // encoded text.
    const string s = "สวัสดี";

    // Since strings are arrays of immutable characters in D,
    // this will produce the length of the raw bytes stored within.
    writeln("Len: ", s.length);

    // Indexing into a string produces the raw byte values at
    // each index. This loop generates the hex values of all
    // the bytes that constitute the code points in `s`.
    foreach (i; 0 .. s.length)
    {
        writef("%x ", s[i]);
    }
    writeln();

    // To count how many *code points* are in a string, we can use
    // the `std.utf.count` function. Note that the run-time of
    // `count` depends on the size of the string,
    // because it has to decode each UTF-8 code point sequentially.
    // Some Thai characters are represented by UTF-8 code points
    // that can span multiple bytes, so the result of this count
    // may be surprising.
    writeln("Code point count:", s.count);

    // A `foreach` loop handles strings specially and decodes
    // each code point along with its offset in the string.
    foreach (i, dchar c; s)
    {
        writefln("%U starts at %d", c, i);
    }

    // We can achieve the same iteration by using the
    // `std.utf.decode` function explicitly.
    writeln("\nUsing decode");
    for (size_t i = 0; i < s.length; i += stride(s, i))
    {
        dchar c = decode(s, i);
        writefln("%U starts at %d", c, i);
        examineCodePoint(c);
    }
}

void examineCodePoint(dchar c)
{
    // Values enclosed in single quotes are character literals. We
    // can compare a `dchar` value to a character literal directly.
    if (c == 't')
    {
        writeln("found tee");
    }
    else if (c == 'ส')
    {
        writeln("found so sua");
    }
}

This D program demonstrates working with strings and Unicode code points, which are similar to runes in other languages. Here’s a breakdown of the concepts:

  1. D strings are UTF-8 encoded by default.
  2. The length property of a string gives the number of bytes, not the number of code points.
  3. We use std.utf.count to count the number of Unicode code points in a string.
  4. The foreach loop can iterate over a string’s code points directly.
  5. std.utf.decode can be used to manually decode UTF-8 encoded strings.
  6. std.utf.stride is used to determine the number of bytes in the current code point.
  7. D uses dchar to represent a full Unicode code point.

When you run this program, you’ll see output similar to:

Len: 18
e0 b8 aa e0 b8 a7 e0 b8 b1 e0 b8 aa e0 b8 94 e0 b8 b5 
Code point count: 6
U+0E2A starts at 0
U+0E27 starts at 3
U+0E31 starts at 6
U+0E2A starts at 9
U+0E14 starts at 12
U+0E35 starts at 15

Using decode
U+0E2A starts at 0
found so sua
U+0E27 starts at 3
U+0E31 starts at 6
U+0E2A starts at 9
found so sua
U+0E14 starts at 12
U+0E35 starts at 15

This example illustrates how D handles Unicode strings and provides tools for working with individual code points, which is crucial for text processing in a globalized world.