Skip to main content

the Differences between bytes, str, and unicode in Python

     In Python 3, there are two types that represent sequences of characters: bytes and str. Instances of bytes contain raw 8-bit values. Instances of str contain Unicode characters. In Python 2, there are two types that represent sequences of characters: str and unicode. In contrast to Python 3, instances of str contain raw 8-bit values. Instances of unicode contain Unicode characters.
     There are many ways to represent Unicode characters as binary data (raw 8-bit values). The most common encoding is UTF-8. Importantly, str instances in Python 3 and unicode instances in Python 2 do not have an associated binary encoding. To convert Unicode characters to binary data, you must use the encode method. To convert binary data to Unicode characters, you must use the decode method.
     When you’re writing Python programs, it’s important to do encoding and decoding of Unicode at the furthest boundary of your interfaces. The core of your program should use Unicode character types (str in Python 3, unicode in Python 2) and should not assume anything about character encodings. This approach allows you to be very accepting of alternative text encodings (such as Latin-1, Shift JIS, and Big5) while being strict about your output text encoding (ideally, UTF-8).
     The split between character types leads to two common situations in Python code: You want to operate on raw 8-bit values that are UTF-8-encoded characters (or some other encoding). You want to operate on Unicode characters that have no specific encoding. You’ll often need two helper functions to convert between these two cases and to ensure that the type of input values matches your code’s expectations. In Python 3, you’ll need one method that takes a str or bytes and always returns a str.
def to_str( bytes_or_str) :
     if isinstance( bytes_or_str, bytes) :
             value = bytes_or_str. decode( ‘ utf- 8’ )
     else:
             value = bytes_or_str
     return value # Instance of str

You’ll need another method that takes a str or bytes and always returns a bytes.
def to_bytes( bytes_or_str) :
     if isinstance( bytes_or_str, str) :
           value = bytes_or_str. encode( ‘ utf- 8’ )
     else:
           value = bytes_or_str
     return value # Instance of bytes


In Python 2, you’ll need one method that takes a str or unicode and always returns a
unicode.
# Python 2
def to_unicode( unicode_or_str) :
       if isinstance( unicode_or_str, str) :
               value = unicode_or_str. decode( ‘ utf- 8’ )
       else:
               value = unicode_or_str
       return value # Instance of unicode

You’ll need another method that takes str or unicode and always returns a str.
# Python 2
def to_str( unicode_or_str) :
     if isinstance( unicode_or_str, unicode) :
          value = unicode_or_str. encode( ‘ utf- 8’ )
     else:
         value = unicode_or_str
     return value # Instance of str
There are two big gotchas when dealing with raw 8-bit values and Unicode characters in
Python.
The first issue is that in Python 2, unicode and str instances seem to be the same type
when a str only contains 7-bit ASCII characters.
     1.You can combine such a str and unicode together using the + operator.
     2.You can compare such str and unicode instances using equality and inequality
operators.
     3.You can use unicode instances for format strings like ' %s' .
All of this behavior means that you can often pass a str or unicode instance to a
function expecting one or the other and things will just work (as long as you’re only
dealing with 7-bit ASCII). In Python 3, bytes and str instances are never equivalent—
not even the empty string—so you must be more deliberate about the types of character
sequences that you’re passing around.
The second issue is that in Python 3, operations involving file handles (returned by the
open built-in function) default to UTF-8 encoding. In Python 2, file operations default to
binary encoding. This causes surprising failures, especially for programmers accustomed
to Python 2.
For example, say you want to write some random binary data to a file. In Python 2, this
works. In Python 3, this breaks.
with open( ‘ /tmp/random. bin’ , ‘ w’ ) as f:
     f. write( os. urandom( 10) )
>>>
TypeError: must be str, not bytes
The cause of this exception is the new encoding argument for open that was added in
Python 3. This parameter defaults to ' utf- 8' . That makes read and write operations
on file handles expect str instances containing Unicode characters instead of bytes
instances containing binary data.
To make this work properly, you must indicate that the data is being opened in write
binary mode (' wb' ) instead of write character mode (' w' ). Here, I use open in a way
that works correctly in Python 2 and Python 3:
with open( ‘ /tmp/random. bin’ , ‘ wb’ ) as f:
     f. write( os. urandom( 10) )
This problem also exists for reading data from files. The solution is the same: Indicate
binary mode by using ' rb' instead of ' r' when opening a file.
Things to Remember
In Python 3, bytes contains sequences of 8-bit values, str contains sequences of
Unicode characters. bytes and str instances can’t be used together with operators
(like > or +).
In Python 2, str contains sequences of 8-bit values, unicode contains sequences
of Unicode characters. str and unicode can be used together with operators if
the str only contains 7-bit ASCII characters.
Use helper functions to ensure that the inputs you operate on are the type of
character sequence you expect (8-bit values, UTF-8 encoded characters, Unicode
characters, etc.).
If you want to read or write binary data to/from a file, always open the file using a
binary mode (like ' rb' or ' wb' )

Comments

Popular posts from this blog

Python decorator: 5 Decorators which accept arguments

This is the fifth post in my series of blog posts about Python decorators and how I believe they are generally poorly implemented. It follows on from the previous post titled  Implementing a universal decorator , with the very first post in the series being  How you implemented your Python decorator is wrong . So far in this series of posts I have explained the short comings of implementing a decorator in the traditional way they are done in Python. I have shown an alternative implementation based on an object proxy and a descriptor which solves these issues, as well as provides the ability to implement what I call a universal decorator. That is, a decorator which understands the context it was used in and can determine whether it was applied to a normal function, an instance method, a class method or a class type. In this post, I am going to take the decorator factory which was described in the previous posts and describe how one can use that to implement decorators whic...

Python decorator: 10 Performance overhead when applying decorators to methods

This is the tenth post in my series of blog posts about Python decorators and how I believe they are generally poorly implemented. It follows on from the previous post titled  Performance overhead of using decorators , with the very first post in the series being  How you implemented your Python decorator is wrong . In the previous post I started looking at the performance implications of using decorators. In that post I started out by looking at the overheads when applying a decorator to a normal function, comparing a decorator implemented as a function closure to the more robust decorator implementation which has been the subject of this series of posts. For a 2012 model MacBook Pro the tests yielded for a straight function call: 10000000 loops, best of 3: 0.132 usec per loop When using a decorator implemented as a function closure the result was: 1000000 loops, best of 3: 0.326 usec per loop And finally with the decorator factory described in this series of blog ...

Python decorator: 7 The missing @synchronized decorator

This is the seventh post in my series of blog posts about Python decorators and how I believe they are generally poorly implemented. It follows on from the previous post titled  Maintaining decorator state using a class , with the very first post in the series being  How you implemented your Python decorator is wrong . In the previous post I effectively rounded out the discussion on the implementation of the decorator pattern, or at least the key parts that I care to cover at this point. I may expand on a few other things that can be done at a later time. At this point I want to start looking at ways this decorator pattern can be used to implement better decorators. For this post I want to look at the  @synchronized  decorator. The concept of the  @synchronized  decorator originates from Java and the idea of being able to write such a decorator in Python was a bit of a poster child when decorators were first added to Python. Despite this, there is no...