In Python 3, there are two types that represent sequences of characters: bytes and str.
Instances of bytes contain raw 8-bit values. Instances of str contain Unicode
characters.
In Python 2, there are two types that represent sequences of characters: str and
unicode. In contrast to Python 3, instances of str contain raw 8-bit values. Instances of
unicode contain Unicode characters.
There are many ways to represent Unicode characters as binary data (raw 8-bit values). The most common encoding is UTF-8. Importantly, str instances in Python 3 and unicode instances in Python 2 do not have an associated binary encoding. To convert Unicode characters to binary data, you must use the encode method. To convert binary data to Unicode characters, you must use the decode method.
When you’re writing Python programs, it’s important to do encoding and decoding of Unicode at the furthest boundary of your interfaces. The core of your program should use Unicode character types (str in Python 3, unicode in Python 2) and should not assume anything about character encodings. This approach allows you to be very accepting of alternative text encodings (such as Latin-1, Shift JIS, and Big5) while being strict about your output text encoding (ideally, UTF-8).
The split between character types leads to two common situations in Python code: You want to operate on raw 8-bit values that are UTF-8-encoded characters (or some other encoding). You want to operate on Unicode characters that have no specific encoding. You’ll often need two helper functions to convert between these two cases and to ensure that the type of input values matches your code’s expectations. In Python 3, you’ll need one method that takes a str or bytes and always returns a str.
def to_str( bytes_or_str) :
if isinstance( bytes_or_str, bytes) :
value = bytes_or_str. decode( ‘ utf- 8’ )
else:
value = bytes_or_str
return value # Instance of str
You’ll need another method that takes a str or bytes and always returns a bytes.
def to_bytes( bytes_or_str) :
if isinstance( bytes_or_str, str) :
value = bytes_or_str. encode( ‘ utf- 8’ )
else:
value = bytes_or_str
return value # Instance of bytes
There are many ways to represent Unicode characters as binary data (raw 8-bit values). The most common encoding is UTF-8. Importantly, str instances in Python 3 and unicode instances in Python 2 do not have an associated binary encoding. To convert Unicode characters to binary data, you must use the encode method. To convert binary data to Unicode characters, you must use the decode method.
When you’re writing Python programs, it’s important to do encoding and decoding of Unicode at the furthest boundary of your interfaces. The core of your program should use Unicode character types (str in Python 3, unicode in Python 2) and should not assume anything about character encodings. This approach allows you to be very accepting of alternative text encodings (such as Latin-1, Shift JIS, and Big5) while being strict about your output text encoding (ideally, UTF-8).
The split between character types leads to two common situations in Python code: You want to operate on raw 8-bit values that are UTF-8-encoded characters (or some other encoding). You want to operate on Unicode characters that have no specific encoding. You’ll often need two helper functions to convert between these two cases and to ensure that the type of input values matches your code’s expectations. In Python 3, you’ll need one method that takes a str or bytes and always returns a str.
def to_str( bytes_or_str) :
if isinstance( bytes_or_str, bytes) :
value = bytes_or_str. decode( ‘ utf- 8’ )
else:
value = bytes_or_str
return value # Instance of str
You’ll need another method that takes a str or bytes and always returns a bytes.
def to_bytes( bytes_or_str) :
if isinstance( bytes_or_str, str) :
value = bytes_or_str. encode( ‘ utf- 8’ )
else:
value = bytes_or_str
return value # Instance of bytes
In Python 2, you’ll need one method that takes a str or unicode and always returns a
unicode.
# Python 2
def to_unicode( unicode_or_str) :
if isinstance( unicode_or_str, str) :
value = unicode_or_str. decode( ‘ utf- 8’ )
else:
value = unicode_or_str
return value # Instance of unicode
You’ll need another method that takes str or unicode and always returns a str.
def to_unicode( unicode_or_str) :
if isinstance( unicode_or_str, str) :
value = unicode_or_str. decode( ‘ utf- 8’ )
else:
value = unicode_or_str
return value # Instance of unicode
You’ll need another method that takes str or unicode and always returns a str.
# Python 2
def to_str( unicode_or_str) :
if isinstance( unicode_or_str, unicode) :
value = unicode_or_str. encode( ‘ utf- 8’ )
else:
value = unicode_or_str
return value # Instance of str
def to_str( unicode_or_str) :
if isinstance( unicode_or_str, unicode) :
value = unicode_or_str. encode( ‘ utf- 8’ )
else:
value = unicode_or_str
return value # Instance of str
There are two big gotchas when dealing with raw 8-bit values and Unicode characters in
Python.
The first issue is that in Python 2, unicode and str instances seem to be the same type
when a str only contains 7-bit ASCII characters.
1.You can combine such a str and unicode together using the + operator.
2.You can compare such str and unicode instances using equality and inequality
operators.
3.You can use unicode instances for format strings like ' %s' .
All of this behavior means that you can often pass a str or unicode instance to a
function expecting one or the other and things will just work (as long as you’re only
dealing with 7-bit ASCII). In Python 3, bytes and str instances are never equivalent—
not even the empty string—so you must be more deliberate about the types of character
sequences that you’re passing around.
The second issue is that in Python 3, operations involving file handles (returned by the
open built-in function) default to UTF-8 encoding. In Python 2, file operations default to
binary encoding. This causes surprising failures, especially for programmers accustomed
to Python 2.
For example, say you want to write some random binary data to a file. In Python 2, this
works. In Python 3, this breaks.
with open( ‘ /tmp/random. bin’ , ‘ w’ ) as f:
f. write( os. urandom( 10) )
>>>
f. write( os. urandom( 10) )
>>>
TypeError: must be str, not bytes
The cause of this exception is the new encoding argument for open that was added in
Python 3. This parameter defaults to ' utf- 8' . That makes read and write operations
on file handles expect str instances containing Unicode characters instead of bytes
instances containing binary data.
To make this work properly, you must indicate that the data is being opened in write
binary mode (' wb' ) instead of write character mode (' w' ). Here, I use open in a way
that works correctly in Python 2 and Python 3:
with open( ‘ /tmp/random. bin’ , ‘ wb’ ) as f:
f. write( os. urandom( 10) )
f. write( os. urandom( 10) )
This problem also exists for reading data from files. The solution is the same: Indicate
binary mode by using ' rb' instead of ' r' when opening a file.
Things to Remember
In Python 3, bytes contains sequences of 8-bit values, str contains sequences of
Unicode characters. bytes and str instances can’t be used together with operators
(like > or +).
In Python 2, str contains sequences of 8-bit values, unicode contains sequences
of Unicode characters. str and unicode can be used together with operators if
the str only contains 7-bit ASCII characters.
Use helper functions to ensure that the inputs you operate on are the type of
character sequence you expect (8-bit values, UTF-8 encoded characters, Unicode
characters, etc.).
If you want to read or write binary data to/from a file, always open the file using a
binary mode (like ' rb' or ' wb' )
Comments
Post a Comment