Strings and Runes in Java

A Java string is a sequence of characters. The language and the standard library treat strings as immutable objects that represent text. In Java, the concept of a character is represented by the char type, which is a 16-bit Unicode code point.

public class StringsAndChars {
    public static void main(String[] args) {
        // s is a String assigned a literal value
        // representing the word "hello" in the Thai language.
        // Java string literals are UTF-16 encoded text.
        final String s = "สวัสดี";

        // This will produce the length of the string in characters.
        System.out.println("Len: " + s.length());

        // This loop generates the hex values of all
        // the characters in s.
        for (int i = 0; i < s.length(); i++) {
            System.out.printf("%04x ", (int) s.charAt(i));
        }
        System.out.println();

        // To count how many characters are in a string, we can use
        // the length() method. Note that some Thai characters might
        // be represented by surrogate pairs, which are counted as
        // two separate char values in Java.
        System.out.println("Character count: " + s.length());

        // A for-each loop handles strings by iterating over each character.
        int i = 0;
        for (char c : s.toCharArray()) {
            System.out.printf("%#04x starts at %d%n", (int) c, i);
            i += Character.charCount(c);
        }

        // We can achieve the same iteration by using the
        // String.codePoints() method to handle surrogate pairs correctly.
        System.out.println("\nUsing codePoints()");
        s.codePoints().forEach(codePoint -> {
            System.out.printf("%#06x%n", codePoint);
            examineCodePoint(codePoint);
        });
    }

    static void examineCodePoint(int codePoint) {
        // We can compare a codePoint value to a character literal directly.
        if (codePoint == 't') {
            System.out.println("found tee");
        } else if (codePoint == 'ส') {
            System.out.println("found so sua");
        }
    }
}

When you run this program, you’ll see output similar to this:

Len: 6
0e2a 0e27 0e31 0e2a 0e14 0e35 
Character count: 6
0x0e2a starts at 0
0x0e27 starts at 1
0x0e31 starts at 2
0x0e2a starts at 3
0x0e14 starts at 4
0x0e35 starts at 5

Using codePoints()
0x0e2a
found so sua
0x0e27
0x0e31
0x0e2a
found so sua
0x0e14
0x0e35

This example demonstrates how Java handles strings and Unicode characters. Unlike some other languages, Java uses UTF-16 encoding for its strings internally, which means that some characters (those outside the Basic Multilingual Plane) are represented by surrogate pairs and take up two char positions in the string.

The codePoints() method provides a way to iterate over the actual Unicode code points in the string, handling surrogate pairs correctly. This is especially important when dealing with characters from scripts that use characters outside the Basic Multilingual Plane, such as some rare Chinese characters or emoji.

Remember that in Java, char is a 16-bit type, while Unicode code points can require up to 21 bits. The int type is used to represent full Unicode code points when necessary.