Strings and Runes in Logo

Our Java program will demonstrate working with strings and characters. Here’s the full source code:

import java.nio.charset.StandardCharsets;

public class StringsAndChars {
    public static void main(String[] args) {
        // s is a String assigned a literal value
        // representing the word "hello" in the Thai language.
        // Java string literals are UTF-16 encoded text.
        final String s = "สวัสดี";

        // This will produce the length of the string in characters.
        System.out.println("Len: " + s.length());

        // This loop generates the hex values of all
        // the bytes that constitute the UTF-8 representation of s.
        byte[] utf8Bytes = s.getBytes(StandardCharsets.UTF_8);
        for (byte b : utf8Bytes) {
            System.out.printf("%x ", b);
        }
        System.out.println();

        // To count how many characters are in a string, we can use
        // the length() method. Note that some Thai characters are
        // represented by surrogate pairs in UTF-16, so the result
        // of this count may be surprising.
        System.out.println("Character count: " + s.codePointCount(0, s.length()));

        // A for loop with the codePoints() method handles strings
        // specially and decodes each code point along with its offset in the string.
        s.codePoints().forEach(codePoint -> {
            System.out.printf("U+%04X starts at %d%n", codePoint, s.indexOf(new String(Character.toChars(codePoint))));
        });

        // We can achieve the same iteration by using the
        // Character.charCount and String.codePointAt methods explicitly.
        System.out.println("\nUsing codePointAt");
        for (int i = 0; i < s.length(); ) {
            int codePoint = s.codePointAt(i);
            System.out.printf("U+%04X starts at %d%n", codePoint, i);
            examineCodePoint(codePoint);
            i += Character.charCount(codePoint);
        }
    }

    private static void examineCodePoint(int codePoint) {
        // Values enclosed in single quotes are character literals.
        // We can compare an int code point value to a character literal directly.
        if (codePoint == 't') {
            System.out.println("found tee");
        } else if (codePoint == 'ส') {
            System.out.println("found so sua");
        }
    }
}

To run the program, compile and execute it using the javac and java commands:

$ javac StringsAndChars.java
$ java StringsAndChars
Len: 6
e0 b8 aa e0 b8 a7 e0 b8 b1 e0 b8 aa e0 b8 94 e0 b8 b5 
Character count: 6
U+0E2A starts at 0
U+0E27 starts at 1
U+0E31 starts at 2
U+0E2A starts at 3
U+0E14 starts at 4
U+0E35 starts at 5

Using codePointAt
U+0E2A starts at 0
found so sua
U+0E27 starts at 1
U+0E31 starts at 2
U+0E2A starts at 3
found so sua
U+0E14 starts at 4
U+0E35 starts at 5

This Java program demonstrates various aspects of working with strings and characters:

  1. We define a string containing Thai characters.
  2. We print the length of the string, which gives the number of UTF-16 code units.
  3. We print the UTF-8 byte representation of the string.
  4. We count the number of Unicode code points in the string.
  5. We iterate over the string’s code points, printing each one along with its starting index.
  6. We demonstrate an alternative method of iterating over code points.
  7. We show how to compare code points with character literals.

Note that Java uses UTF-16 for its internal string representation, which is different from some other languages that use UTF-8. This can lead to some differences in how strings are handled, particularly with characters outside the Basic Multilingual Plane.