Strings and Runes in Miranda

Our Java program will demonstrate the concept of strings and characters. Here’s the full source code:

import java.nio.charset.StandardCharsets;

public class StringsAndChars {
    public static void main(String[] args) {
        // s is a String assigned a literal value
        // representing the word "hello" in the Thai language.
        // Java string literals are UTF-16 encoded text.
        final String s = "สวัสดี";

        // This will produce the length of the string in characters.
        System.out.println("Length: " + s.length());

        // This loop generates the hex values of all
        // the bytes that constitute the code points in s.
        System.out.print("Bytes: ");
        byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
        for (byte b : bytes) {
            System.out.printf("%x ", b);
        }
        System.out.println();

        // To count how many characters are in a string, we can use
        // the length() method. Note that some Thai characters
        // are represented by UTF-16 surrogate pairs, so the result
        // of this count may be surprising.
        System.out.println("Character count: " + s.codePointCount(0, s.length()));

        // We can iterate over each character (code point) in the string.
        System.out.println("\nUsing codePoints():");
        s.codePoints().forEach(cp -> {
            System.out.printf("%#X starts at %d\n", cp, s.indexOf(new String(Character.toChars(cp))));
            examineChar((char) cp);
        });
    }

    private static void examineChar(char c) {
        // We can compare a char value to a char literal directly.
        if (c == 't') {
            System.out.println("found tee");
        } else if (c == 'ส') {
            System.out.println("found so sua");
        }
    }
}

To run the program, compile and execute it using javac and java:

$ javac StringsAndChars.java
$ java StringsAndChars
Length: 6
Bytes: e0 b8 aa e0 b8 a7 e0 b8 b1 e0 b8 aa e0 b8 94 e0 b8 b5 
Character count: 6

Using codePoints():
0XE2A starts at 0
found so sua
0XE27 starts at 1
0XE31 starts at 2
0XE2A starts at 3
found so sua
0XE14 starts at 4
0XE35 starts at 5

This example demonstrates several important concepts:

  1. In Java, strings are sequences of UTF-16 code units. Some characters (like those in the Thai example) are represented by a single code unit, while others may require two code units (surrogate pairs).

  2. The length() method returns the number of UTF-16 code units, not necessarily the number of visible characters.

  3. To get the actual number of characters (code points), we use codePointCount().

  4. We can iterate over the code points in a string using the codePoints() method, which returns an IntStream of Unicode code points.

  5. Java chars are 16-bit Unicode code units. For characters outside the Basic Multilingual Plane (BMP), we need to use surrogate pairs.

  6. The getBytes() method with UTF-8 encoding is used to get the raw bytes of the string, similar to how Go treats strings as byte slices.

This Java code provides similar functionality to the Go example, demonstrating how to work with strings and characters in a Unicode-aware manner.