Strings and Runes in Python

A Python string is a sequence of Unicode characters. Python treats strings as immutable sequences of Unicode code points. This is similar to how other languages handle strings, but Python's approach is more straightforward as it doesn't have a separate concept like 'runes' for individual characters.

```python
# coding: utf-8

import unicodedata

# 's' is a string assigned a literal value
# representing the word "hello" in the Thai language.
# Python string literals are Unicode by default.
s = "สวัสดี"

# This will produce the length of the string in characters.
print("Len:", len(s))

# Indexing into a string produces the Unicode character at
# each index. This loop generates the hex values of all
# the characters in 's'.
for i in range(len(s)):
    print(f"{ord(s[i]):x}", end=" ")
print()

# To count how many characters are in a string, we can
# simply use the len() function, as Python strings are
# already Unicode.
print("Character count:", len(s))

# A for loop handles strings by iterating over each character.
for idx, char in enumerate(s):
    print(f"{char!r} (U+{ord(char):04X}) starts at {idx}")

print("\nUsing unicodedata.iterparse")
# We can achieve a similar iteration by using the
# unicodedata.iterparse function for more detailed information.
for i, (char, name) in enumerate(unicodedata.iterparse(s)):
    print(f"{char!r} (U+{ord(char):04X}) starts at {i}")
    examine_char(char)

def examine_char(c):
    # We can compare a character value to a character literal directly.
    if c == 't':
        print("found tee")
    elif c == 'ส':
        print("found so sua")

When you run this script, you’ll see output similar to this:

Len: 6
e2a e27 e31 e2a e14 e35 
Character count: 6
'ส' (U+0E2A) starts at 0
'ว' (U+0E27) starts at 1
'ั' (U+0E31) starts at 2
'ส' (U+0E2A) starts at 3
'ด' (U+0E14) starts at 4
'ี' (U+0E35) starts at 5

Using unicodedata.iterparse
'ส' (U+0E2A) starts at 0
found so sua
'ว' (U+0E27) starts at 1
'ั' (U+0E31) starts at 2
'ส' (U+0E2A) starts at 3
found so sua
'ด' (U+0E14) starts at 4
'ี' (U+0E35) starts at 5

This Python code demonstrates how strings are handled as sequences of Unicode characters. Unlike some other languages, Python doesn’t have a separate type for individual characters - they are simply strings of length 1.

The unicodedata module is used to get more detailed information about the Unicode characters, similar to the UTF-8 decoding in the original example.

Note that Python’s string handling is generally simpler than in some other languages, as it treats strings as sequences of Unicode characters by default, without the need for explicit UTF-8 decoding in most cases.

查看推荐产品