the Differences between bytes, str, and unicode in Python

In Python 3, there are two types that represent sequences of characters: bytes and str. Instances of bytes contain raw 8-bit values. Instances of str contain Unicode characters. In Python 2, there are two types that represent sequences of characters: str and unicode. In contrast to Python 3, instances of str contain raw 8-bit values. Instances of unicode contain Unicode characters.
There are many ways to represent Unicode characters as binary data (raw 8-bit values). The most common encoding is UTF-8. Importantly, str instances in Python 3 and unicode instances in Python 2 do not have an associated binary encoding. To convert Unicode characters to binary data, you must use the encode method. To convert binary data to Unicode characters, you must use the decode method.
When you’re writing Python programs, it’s important to do encoding and decoding of Unicode at the furthest boundary of your interfaces. The core of your program should use Unicode character types (str in Python 3, unicode in Python 2) and should not assume anything about character encodings. This approach allows you to be very accepting of alternative text encodings (such as Latin-1, Shift JIS, and Big5) while being strict about your output text encoding (ideally, UTF-8).
The split between character types leads to two common situations in Python code: You want to operate on raw 8-bit values that are UTF-8-encoded characters (or some other encoding). You want to operate on Unicode characters that have no specific encoding. You’ll often need two helper functions to convert between these two cases and to ensure that the type of input values matches your code’s expectations. In Python 3, you’ll need one method that takes a str or bytes and always returns a str.
def to_str( bytes_or_str) :
if isinstance( bytes_or_str, bytes) :
value = bytes_or_str. decode( ‘ utf- 8’ )
else:
value = bytes_or_str
return value # Instance of str
You’ll need another method that takes a str or bytes and always returns a bytes.
def to_bytes( bytes_or_str) :
if isinstance( bytes_or_str, str) :
value = bytes_or_str. encode( ‘ utf- 8’ )
else:
value = bytes_or_str
return value # Instance of bytes

In Python 2, you’ll need one method that takes a str or unicode and always returns a

unicode.

# Python 2
def to_unicode( unicode_or_str) :
if isinstance( unicode_or_str, str) :
value = unicode_or_str. decode( ‘ utf- 8’ )
else:
value = unicode_or_str
return value # Instance of unicode
You’ll need another method that takes str or unicode and always returns a str.

# Python 2
def to_str( unicode_or_str) :
if isinstance( unicode_or_str, unicode) :
value = unicode_or_str. encode( ‘ utf- 8’ )
else:
value = unicode_or_str
return value # Instance of str

There are two big gotchas when dealing with raw 8-bit values and Unicode characters in

Python.

The first issue is that in Python 2, unicode and str instances seem to be the same type

when a str only contains 7-bit ASCII characters.

1.You can combine such a str and unicode together using the + operator.

2.You can compare such str and unicode instances using equality and inequality

operators.

3.You can use unicode instances for format strings like ' %s' .

All of this behavior means that you can often pass a str or unicode instance to a

function expecting one or the other and things will just work (as long as you’re only

dealing with 7-bit ASCII). In Python 3, bytes and str instances are never equivalent—

not even the empty string—so you must be more deliberate about the types of character

sequences that you’re passing around.

The second issue is that in Python 3, operations involving file handles (returned by the

open built-in function) default to UTF-8 encoding. In Python 2, file operations default to

binary encoding. This causes surprising failures, especially for programmers accustomed

to Python 2.

For example, say you want to write some random binary data to a file. In Python 2, this

works. In Python 3, this breaks.

with open( ‘ /tmp/random. bin’ , ‘ w’ ) as f:
f. write( os. urandom( 10) )
>>>

TypeError: must be str, not bytes

The cause of this exception is the new encoding argument for open that was added in

Python 3. This parameter defaults to ' utf- 8' . That makes read and write operations

on file handles expect str instances containing Unicode characters instead of bytes

instances containing binary data.

To make this work properly, you must indicate that the data is being opened in write

binary mode (' wb' ) instead of write character mode (' w' ). Here, I use open in a way

that works correctly in Python 2 and Python 3:

with open( ‘ /tmp/random. bin’ , ‘ wb’ ) as f:
f. write( os. urandom( 10) )

This problem also exists for reading data from files. The solution is the same: Indicate

binary mode by using ' rb' instead of ' r' when opening a file.

Things to Remember

In Python 3, bytes contains sequences of 8-bit values, str contains sequences of

Unicode characters. bytes and str instances can’t be used together with operators

(like > or +).

In Python 2, str contains sequences of 8-bit values, unicode contains sequences

of Unicode characters. str and unicode can be used together with operators if

the str only contains 7-bit ASCII characters.

Use helper functions to ensure that the inputs you operate on are the type of

character sequence you expect (8-bit values, UTF-8 encoded characters, Unicode

characters, etc.).

If you want to read or write binary data to/from a file, always open the file using a

binary mode (like ' rb' or ' wb' )

PythonArchive

Search This Blog

the Differences between bytes, str, and unicode in Python

Labels

Comments

Post a Comment

Popular posts from this blog

Python decorator: 5 Decorators which accept arguments

Python decorator: 10 Performance overhead when applying decorators to methods

Python decorator: 7 The missing @synchronized decorator