.. _python-unicode: ################## Python and unicode ################## See :ref:`introducing-unicode` for an introduction to Unicode. See also: * http://docs.python.org/howto/unicode.html * http://effbot.org/zone/unicode-objects.htm * http://wiki.python.org/moin/Unicode Python 2 supports unicode with unicode strings:: s = u'Hello world' Python 3 strings are always unicode. Python 3 as of Python 3.3 allows (but ignores) the ``u`` prefix for strings, so I will use that convention for unicode strings for compatibility with Python 2 and Python 3. There are various ways of inputing characters you cannot type at your prompt. The most basic is to give the unicode code point in hexadecimal:: question = u'\u00bfHabla espa\u00f1ol?' # ¿Habla español? where ``00bf`` is the hexadecimal unicode code point for the inverted question mark; see http://www.unicode.org/Public/UNIDATA/UnicodeData.txt. Use the ``\u0000`` format, i.e. ``\u`` followed by 4 hexadecimal digits. For code points outside the 16 bit range (outside the BMP |--| see :ref:`introducing-unicode`) |--| use capital ``U`` and eight hexadecimal digits, like this:: complicated = u'\U0001D11A is musical symbol 5 line staff' See below for some complications of using these 32 bit unicode characters in some builds of python 2. You can also use the standard unicode name (see http://www.unicode.org/Public/UNIDATA/UnicodeData.txt ):: less_opaque = u'\N{MUSICAL SYMBOL FIVE-LINE STAFF} is more obviously a five line staff' To create an UTF-8 encoded version of a string - for example to write to a text file:: question = u'\u00bfHabla espa\u00f1ol?' # ¿Habla español? raw_str = question.encode('utf-8') Similarly for UTF-16, or other encodings: http://docs.python.org/lib/standard-encodings.html :: raw_str = question.encode('utf-16') To get a unicode string from text that has been encoded:: question = raw_str.decode('utf-8') In Python 3, ``raw_str`` will be a *byte string* rather than a standard (unicode) string. ******************************************* Python internal encoding of unicode strings ******************************************* The internal encoding of unicode strings depends on the version of Python. ******************************* Python versions 2.2 through 3.2 ******************************* The internal representation of unicode stirngs Pythons 2.2 through 3.2 depends on flags with which the Python program binary was built. Pythons built with the build flag ``--enable-unicode=ucs2`` `use UTF-16 as the internal representation `_. Yes, it is confusing that the flag value is ``ucs2`` and the actual result is UTF-16. Pythons built with build flag ``--enable-unicode=ucs4`` use UCS-4 (or equivalently, UTF-32) as their internal representation. To tell which format your Python uses:: import sys utf_16 = sys.maxunicode == 65535 If ``utf_16`` is True, you have a UTF-16 build, otherwise you have UCS4. See also: * http://www.python.org/dev/peps/pep-0100/ * http://www.python.org/dev/peps/pep-0261/ UTF-16 (ucs2) builds of Python and 32 bit unicode code points ============================================================= If you have a UTF-16 build of python, and want to use a 32 bit code point, then some oddness occurs:: complicated = u'\U0001D11A ' print ord(complicated[0]) print ord(complicated[1]) On a UTF-16 build the above gives you:: 55348 56602 In this case, the 32 bit value has been represented by two 16 bit values - a UTF-16 **surrogate pair** - see :ref:`introducing-unicode`. On a UCS-4 build you get:: 119066 32 which might have been more what you were expecting - 119066 is the decimal representation of hexadecimal 1D11A. The difference between the two builds can mean some oddness in slicing strings... (as noted in http://www.python.org/dev/peps/pep-0261/). Some discussion about UTF-16 / UCS-2, UCS-4 and Python 3 here: http://mail.python.org/pipermail/python-dev/2008-July/080886.html ************************ Python versions from 3.3 ************************ Python versions 3.3 and above use a flexible internal representation of the string that depends on the string contents |--| see http://www.python.org/dev/peps/pep-0393. ************************************ Relevant python modules and commands ************************************ Modules ======= * `codecs `_; * `unicodedata `_; * `locale `_ (``locale.getdefaultlocale``); * `regular expressions `_ - ``(?u)`` flag, ``re.UNICODE``; * `standard encodings `_; * encodings - e.g. ``encodings.getaliases()`` String methods ============== * encode * decode Builtins ======== * unichr - unicode equivalent of ``chr`` in Python 2. * unicode - constructor for unicode strings in Python 2. Exceptions: =========== * UnicodeEncodeError .. include:: links_names.inc