Strings and Runes in Mercury

Our first program will demonstrate working with strings and characters in Java. Here’s the full source code:

import java.nio.charset.StandardCharsets;

public class StringsAndChars {
    public static void main(String[] args) {
        // s is a String assigned a literal value
        // representing the word "hello" in the Thai language.
        // Java string literals are UTF-16 encoded internally.
        final String s = "สวัสดี";

        // This will produce the length of the string in characters.
        System.out.println("Len: " + s.length());

        // Get the raw bytes in UTF-8 encoding
        byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
        System.out.print("UTF-8 bytes: ");
        for (byte b : bytes) {
            System.out.printf("%x ", b);
        }
        System.out.println();

        // Count the number of Unicode code points in the string
        System.out.println("Code point count: " + s.codePointCount(0, s.length()));

        // Iterate over each code point in the string
        System.out.println("\nUsing codePoints():");
        s.codePoints().forEach(codePoint -> {
            System.out.printf("%#X starts at %d%n", codePoint, s.indexOf(new String(Character.toChars(codePoint))));
            examineCodePoint(codePoint);
        });
    }

    private static void examineCodePoint(int codePoint) {
        // We can compare a code point value to a character literal directly.
        if (codePoint == 't') {
            System.out.println("found tee");
        } else if (codePoint == 'ส') {
            System.out.println("found so sua");
        }
    }
}

This program demonstrates several concepts related to strings and characters in Java:

  1. Java strings are sequences of UTF-16 code units.
  2. We can get the raw UTF-8 bytes of a string using getBytes(StandardCharsets.UTF_8).
  3. The length() method returns the number of UTF-16 code units, which may not be the same as the number of Unicode code points for some strings.
  4. We can count and iterate over Unicode code points using codePointCount() and codePoints().
  5. Java uses the int type to represent Unicode code points, similar to Go’s rune type.

To run this program:

$ javac StringsAndChars.java
$ java StringsAndChars
Len: 6
UTF-8 bytes: e0 b8 aa e0 b8 a7 e0 b8 b1 e0 b8 aa e0 b8 94 e0 b8 b5 
Code point count: 6

Using codePoints():
U+0E2A starts at 0
found so sua
U+0E27 starts at 1
U+0E31 starts at 2
U+0E2A starts at 3
found so sua
U+0E14 starts at 4
U+0E35 starts at 5

This output demonstrates that the Thai string “สวัสดี” consists of 6 Unicode code points, each of which may be represented by multiple bytes in UTF-8 encoding.