Strings and Runes in PHP
A PHP string is a series of characters. PHP uses UTF-8 encoding by default, which means it can handle multi-byte characters like those found in many non-Latin alphabets. Unlike some other languages, PHP doesn't have a separate 'character' type - a single character is simply a string of length 1.
```php
<?php
// $s is a string assigned a literal value
// representing the word "hello" in the Thai language.
// PHP string literals are UTF-8 encoded by default.
$s = "สวัสดี";
// strlen() returns the number of bytes in the string
echo "Len: " . strlen($s) . "\n";
// This loop generates the hex values of all
// the bytes that constitute the string $s.
for ($i = 0; $i < strlen($s); $i++) {
printf("%x ", ord($s[$i]));
}
echo "\n";
// To count how many characters are in a string, we can use
// the mb_strlen() function from the mbstring extension.
// This correctly handles multi-byte characters.
echo "Character count: " . mb_strlen($s, 'UTF-8') . "\n";
// We can use mb_str_split() to split the string into an array of characters,
// and then iterate over them.
foreach (mb_str_split($s) as $idx => $char) {
printf("U+%X '%s' starts at %d\n", mb_ord($char), $char, mb_strlen(mb_substr($s, 0, $idx), 'UTF-8'));
}
echo "\nUsing mb_strlen and mb_substr\n";
$i = 0;
while ($i < mb_strlen($s, 'UTF-8')) {
$char = mb_substr($s, $i, 1, 'UTF-8');
printf("U+%X '%s' starts at %d\n", mb_ord($char), $char, $i);
examineChar($char);
$i++;
}
function examineChar($c) {
// We can compare a character directly to a string literal
if ($c === 't') {
echo "found tee\n";
} elseif ($c === 'ส') {
echo "found so sua\n";
}
}
When you run this script, you’ll see output similar to this:
Len: 18
e0 b8 aa e0 b8 a7 e0 b8 b1 e0 b8 aa e0 b8 94 e0 b8 b5
Character count: 6
U+E2A 'ส' starts at 0
U+E27 'ว' starts at 1
U+E31 'ั' starts at 2
U+E2A 'ส' starts at 3
U+E14 'ด' starts at 4
U+E35 'ี' starts at 5
Using mb_strlen and mb_substr
U+E2A 'ส' starts at 0
found so sua
U+E27 'ว' starts at 1
U+E31 'ั' starts at 2
U+E2A 'ส' starts at 3
found so sua
U+E14 'ด' starts at 4
U+E35 'ี' starts at 5
This PHP script demonstrates how to work with UTF-8 encoded strings, which is especially important when dealing with non-ASCII characters. It shows how to get the byte length of a string, iterate over its bytes, count the number of characters (not bytes), and how to properly iterate over multi-byte characters.
The mb_*
functions from the mbstring extension are used to correctly handle multi-byte characters. These functions are crucial when working with strings that contain characters from non-Latin alphabets or emoji.
The examineChar
function demonstrates how to compare individual characters in a string, which can be useful for more complex string processing tasks.
Remember to ensure that your PHP installation has the mbstring extension enabled to use these multi-byte string functions.