Title here
Summary here
Our first program will demonstrate working with strings and characters in Java. Here’s the full source code:
import java.nio.charset.StandardCharsets;
public class StringsAndChars {
public static void main(String[] args) {
// s is a String assigned a literal value
// representing the word "hello" in the Thai language.
// Java string literals are UTF-16 encoded internally.
final String s = "สวัสดี";
// This will produce the length of the string in characters.
System.out.println("Len: " + s.length());
// Get the raw bytes in UTF-8 encoding
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
System.out.print("UTF-8 bytes: ");
for (byte b : bytes) {
System.out.printf("%x ", b);
}
System.out.println();
// Count the number of Unicode code points in the string
System.out.println("Code point count: " + s.codePointCount(0, s.length()));
// Iterate over each code point in the string
System.out.println("\nUsing codePoints():");
s.codePoints().forEach(codePoint -> {
System.out.printf("%#X starts at %d%n", codePoint, s.indexOf(new String(Character.toChars(codePoint))));
examineCodePoint(codePoint);
});
}
private static void examineCodePoint(int codePoint) {
// We can compare a code point value to a character literal directly.
if (codePoint == 't') {
System.out.println("found tee");
} else if (codePoint == 'ส') {
System.out.println("found so sua");
}
}
}
This program demonstrates several concepts related to strings and characters in Java:
getBytes(StandardCharsets.UTF_8)
.length()
method returns the number of UTF-16 code units, which may not be the same as the number of Unicode code points for some strings.codePointCount()
and codePoints()
.int
type to represent Unicode code points, similar to Go’s rune
type.To run this program:
$ javac StringsAndChars.java
$ java StringsAndChars
Len: 6
UTF-8 bytes: e0 b8 aa e0 b8 a7 e0 b8 b1 e0 b8 aa e0 b8 94 e0 b8 b5
Code point count: 6
Using codePoints():
U+0E2A starts at 0
found so sua
U+0E27 starts at 1
U+0E31 starts at 2
U+0E2A starts at 3
found so sua
U+0E14 starts at 4
U+0E35 starts at 5
This output demonstrates that the Thai string “สวัสดี” consists of 6 Unicode code points, each of which may be represented by multiple bytes in UTF-8 encoding.