Tuesday, January 8, 2013

codecs – String encoding and decoding

Purpose:Encoders and decoders for converting text between different representations.
Available In:2.1 and later

The codecs module provides stream and file interfaces for transcoding data in your program. It is most commonly used to work with Unicode text, but other encodings are also available for other purposes.

Unicode Primer

CPython 2.x supports two types of strings for working with text data. Old-style str instances use a single 8-bit byte to represent each character of the string using its ASCII code. In contrast, unicode strings are managed internally as a sequence of Unicode code points. The code point values are saved as a sequence of 2 or 4 bytes each, depending on the options given when Python was compiled. Both unicodeand str are derived from a common base class, and support a similar API.
When unicode strings are output, they are encoded using one of several standard schemes so that the sequence of bytes can be reconstructed as the same string later. The bytes of the encoded value are not necessarily the same as the code point values, and the encoding defines a way to translate between the two sets of values. Reading Unicode data also requires knowing the encoding so that the incoming bytes can be converted to the internal representation used by the unicode class.
The most common encodings for Western languages are UTF-8 and UTF-16, which use sequences of one and two byte values respectively to represent each character. Other encodings can be more efficient for storing languages where most of the characters are represented by code points that do not fit into two bytes.
See also
 
For more introductory information about Unicode, refer to the list of references at the end of this section. The Python Unicode HOWTO is especially helpful.

source: codecs – String encoding and decoding