Strings and Runes in Minitab

A Java string is a sequence of characters. The language and the standard library treat strings as objects of the String class, which are immutable sequences of Unicode characters. In Java, individual characters are represented by the char data type, which is a 16-bit Unicode code unit.

public class StringsAndChars {
    public static void main(String[] args) {
        // s is a String assigned a literal value
        // representing the word "hello" in the Thai language.
        // Java string literals are UTF-16 encoded text.
        final String s = "สวัสดี";

        // This will produce the length of the string in characters.
        System.out.println("Len: " + s.length());

        // Indexing into a string produces the char values at
        // each index. This loop generates the hex values of all
        // the chars that constitute the code points in s.
        for (int i = 0; i < s.length(); i++) {
            System.out.printf("%04x ", (int) s.charAt(i));
        }
        System.out.println();

        // To count how many Unicode code points are in a string, we can use
        // the codePointCount method of the String class.
        System.out.println("Code point count: " + s.codePointCount(0, s.length()));

        // We can use the String.codePoints() method to get an IntStream
        // of code points in the string.
        s.codePoints().forEach(codePoint -> {
            System.out.printf("%#X starts at %d\n", codePoint, s.indexOf(new String(Character.toChars(codePoint))));
        });

        // We can achieve the same iteration by using the
        // String.codePointAt and String.offsetByCodePoints methods explicitly.
        System.out.println("\nUsing codePointAt and offsetByCodePoints");
        for (int i = 0; i < s.length();) {
            int codePoint = s.codePointAt(i);
            System.out.printf("%#X starts at %d\n", codePoint, i);
            i = s.offsetByCodePoints(i, 1);
            examineCodePoint(codePoint);
        }
    }

    private static void examineCodePoint(int codePoint) {
        // We can compare a code point value to a char literal directly.
        if (codePoint == 't') {
            System.out.println("found tee");
        } else if (codePoint == 'ส') {
            System.out.println("found so sua");
        }
    }
}

When you run this program, you’ll get output similar to this:

Len: 6
0e2a 0e27 0e31 0e2a 0e14 0e35 
Code point count: 6
0XE2A starts at 0
0XE27 starts at 1
0XE31 starts at 2
0XE2A starts at 3
0XE14 starts at 4
0XE35 starts at 5

Using codePointAt and offsetByCodePoints
0XE2A starts at 0
found so sua
0XE27 starts at 1
0XE31 starts at 2
0XE2A starts at 3
found so sua
0XE14 starts at 4
0XE35 starts at 5

This example demonstrates how Java handles Unicode strings and characters. It shows the difference between the length of a string (which counts UTF-16 code units) and the number of Unicode code points in the string. It also shows how to iterate over the code points in a string and how to examine individual code points.

Note that Java uses UTF-16 encoding for strings internally, which means that some Unicode characters (those outside the Basic Multilingual Plane) are represented by surrogate pairs and take up two char positions in the string. The codePointCount method and the codePoints() stream take this into account and give the correct count and iteration of Unicode characters.