Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

On Python 2.2 through 3.2, CPython could be compiled with "narrow" or "wide" Unicode strings. On a "narrow" build, Unicode strings internally were represented using two-byte characters and surrogate pairs. On a "wide" build, Unicode strings were represented internally using four-byte characters.

On Python 3.3+, the distinction is gone, and Python uses either latin-1, UCS-2 or UCS-4 depending on the highest codepoint in the string (i.e., a string containing only codepoints in latin-1 will be stored internally as latin-1; a string containing at least one codepoint outside the BMP will be stored internally as UCS-4, etc.).



It's still fixed-length codepoint encoding, even if the specific encoding depends on content.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: