Strings and Runes in JavaScript

A JavaScript string is a sequence of UTF-16 code units. Strings in JavaScript are immutable. JavaScript treats strings as sequences of UTF-16 code units, which means that some Unicode characters may be represented by two code units (surrogate pairs).

```javascript
// 's' is a string assigned a literal value
// representing the word "hello" in the Thai language.
// JavaScript string literals are UTF-16 encoded text.
const s = "สวัสดี";

// This will produce the length of the string in code units.
// Note that this might not be the same as the number of
// visual characters due to surrogate pairs.
console.log("Length:", s.length);

// This loop generates the hex values of all
// the code units that constitute the string.
for (let i = 0; i < s.length; i++) {
    console.log(s.charCodeAt(i).toString(16) + " ");
}
console.log();

// To count how many characters are in a string, we can use
// the spread operator or Array.from().
console.log("Character count:", [...s].length);

// A for...of loop handles strings specially and iterates
// over each character (including surrogate pairs).
for (const [idx, char] of [...s].entries()) {
    console.log(`${char} starts at ${idx}`);
}

// We can achieve the same iteration by using String.prototype.codePointAt()
// and String.fromCodePoint() functions explicitly.
console.log("\nUsing codePointAt");
for (let i = 0; i < s.length;) {
    const codePoint = s.codePointAt(i);
    console.log(`${String.fromCodePoint(codePoint)} starts at ${i}`);
    i += codePoint > 0xFFFF ? 2 : 1;
    examineCodePoint(codePoint);
}

function examineCodePoint(cp) {
    // We can compare a code point value to a character literal directly.
    if (cp === 't'.codePointAt(0)) {
        console.log("found tee");
    } else if (cp === 'ส'.codePointAt(0)) {
        console.log("found so sua");
    }
}

To run this program, save it as strings-and-characters.js and use Node.js:

$ node strings-and-characters.js
Length: 6
0e2a 0e27 0e31 0e2a 0e14 0e35 

Character count: 6
ส starts at 0
ว starts at 1
ั starts at 2
ส starts at 3
ด starts at 4
ี starts at 5

Using codePointAt
ส starts at 0
found so sua
ว starts at 1
ั starts at 2
ส starts at 3
found so sua
ด starts at 4
ี starts at 5

In JavaScript, strings are sequences of UTF-16 code units. Unlike some other languages, JavaScript doesn’t have a separate “character” type. Instead, a character is represented by a string of length 1 or 2 (for surrogate pairs).

The String.prototype.codePointAt() method returns a Unicode code point value, which is equivalent to the concept of a “rune” in some other languages. The String.fromCodePoint() method creates a string from a code point value.

JavaScript’s for...of loop and the spread operator [...s] automatically handle surrogate pairs, making it easier to work with Unicode characters that are represented by two code units.

Remember that when working with Unicode in JavaScript, the length of a string might not always correspond to the number of visual characters, especially for complex scripts or emoji.

查看推荐产品