Strings and Runes in JavaScript
A JavaScript string is a sequence of UTF-16 code units. Strings in JavaScript are immutable. JavaScript treats strings as sequences of UTF-16 code units, which means that some Unicode characters may be represented by two code units (surrogate pairs).
```javascript
// 's' is a string assigned a literal value
// representing the word "hello" in the Thai language.
// JavaScript string literals are UTF-16 encoded text.
const s = "สวัสดี";
// This will produce the length of the string in code units.
// Note that this might not be the same as the number of
// visual characters due to surrogate pairs.
console.log("Length:", s.length);
// This loop generates the hex values of all
// the code units that constitute the string.
for (let i = 0; i < s.length; i++) {
console.log(s.charCodeAt(i).toString(16) + " ");
}
console.log();
// To count how many characters are in a string, we can use
// the spread operator or Array.from().
console.log("Character count:", [...s].length);
// A for...of loop handles strings specially and iterates
// over each character (including surrogate pairs).
for (const [idx, char] of [...s].entries()) {
console.log(`${char} starts at ${idx}`);
}
// We can achieve the same iteration by using String.prototype.codePointAt()
// and String.fromCodePoint() functions explicitly.
console.log("\nUsing codePointAt");
for (let i = 0; i < s.length;) {
const codePoint = s.codePointAt(i);
console.log(`${String.fromCodePoint(codePoint)} starts at ${i}`);
i += codePoint > 0xFFFF ? 2 : 1;
examineCodePoint(codePoint);
}
function examineCodePoint(cp) {
// We can compare a code point value to a character literal directly.
if (cp === 't'.codePointAt(0)) {
console.log("found tee");
} else if (cp === 'ส'.codePointAt(0)) {
console.log("found so sua");
}
}
To run this program, save it as strings-and-characters.js
and use Node.js:
$ node strings-and-characters.js
Length: 6
0e2a 0e27 0e31 0e2a 0e14 0e35
Character count: 6
ส starts at 0
ว starts at 1
ั starts at 2
ส starts at 3
ด starts at 4
ี starts at 5
Using codePointAt
ส starts at 0
found so sua
ว starts at 1
ั starts at 2
ส starts at 3
found so sua
ด starts at 4
ี starts at 5
In JavaScript, strings are sequences of UTF-16 code units. Unlike some other languages, JavaScript doesn’t have a separate “character” type. Instead, a character is represented by a string of length 1 or 2 (for surrogate pairs).
The String.prototype.codePointAt()
method returns a Unicode code point value, which is equivalent to the concept of a “rune” in some other languages. The String.fromCodePoint()
method creates a string from a code point value.
JavaScript’s for...of
loop and the spread operator [...s]
automatically handle surrogate pairs, making it easier to work with Unicode characters that are represented by two code units.
Remember that when working with Unicode in JavaScript, the length of a string might not always correspond to the number of visual characters, especially for complex scripts or emoji.