Python like most programming languages has certain behaviors that can confuse anyone who is new to the language. This appendix contains an overview of the Python features that are most important to understand for anyone who wants to create Django applications and who is already familiar with another programming language (e.g. Ruby, PHP).
In this appendix you'll learn about: Python strings, unicode and other annoying text behaviors; Python methods and how to use them with default, optional, *args and **kwargs arguments; Python classes and subclasses; Python loops, iterators and generators; Python list comprehensions, generator expressions, maps and filters; as well as how to use the Python lambda keyword for anonymous methods.
Strings, unicode and other annoying text behaviors
Working with text is so common in web applications, that you'll eventually be caught by some of the not so straightforward ways Python interprets it. First off, beware there are considerable difference in how Python 3 and Python 2 work with strings.
Python 3 provides an improvement over Python 2, in the sense there are just two instead of three ways to interpret strings. But still, it's important to know what's going on behind the scenes in both versions so you don't get caught off-guard working with text. Listing A-1 illustrates a series of string statements run in Python 2 to showcase this Python version's text behavior.
Listing A-1. Python 2 literal unicode and strings
Python 2.7.3 (default, Apr 10 2013, 06:20:15) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.getdefaultencoding() 'ascii' >>> 'café & pâtisserie' 'caf\xc3\xa9 & p\xc3\xa2tisserie' >>> print('\xc3\xa9') é >>> print('\xc3\xa2') â
The first action in listing A-1
shows the default Python encoding that corresponds to
ascii
and which is the default for all Python 2.x
versions. In theory, this means Python is limited to
representing 128 characters, which are the basic letters and
characters used by all computers -- see any ASCII table for
details[1]. This is just in theory
though, because you won't get an error when attempting to input a
non-ASCII character in Python.
If you create a string statement
with non-ASCII characters like 'café &
pâtisserie'
, you can see in listing A-1 the
é
character is output to \xc3\xa9
and the â
character is output to
\xc3\xa2
. These outputs which appear to be gibberish,
are actually literal Unicode or UTF-8 representations of the
é
and â
characters,
respectively. So take note that even though the default Python 2
encoding is ASCII, non-ASCII characters are converted to
literal Unicode or UTF-8 representations.
Next in listing A-1 you can see
that using the print()
statement on either of these
character sequences outputs the expected é
or
â
characters. Behind the scenes, Python 2 offers
the convenience of inputting non-ASCII characters in an ASCII
encoding environment, by automatically encoding strings into
literal Unicode or UTF-8 representations. To confirm this
behavior, you can use the decode()
method, as
illustrated in listing A-2
Listing A-2. Python 2 decode unicode and u'' prefixed strings
Python 2.7.3 (default, Apr 10 2013, 06:20:15) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> 'café & pâtisserie'.decode('utf-8') # Outputs: u'caf\xe9 & p\xe2tisserie' >>> print(u'\xe9') # Outputs: é >>> print(u'\xe2') # Outputs: â
In listing A-2 you can see the
statement 'café &
pâtisserie'.decode('utf-8')
outputs u'caf\xe9
& p\xe2tisserie'
. So now the same string decoded from
Unicode or UTF-8 converts the é
character or
\xc3\xa9
sequence to \xe9
and the
â
character or \xc3\xa2
sequence to
\xe2
. More importantly, notice the output string in
listing A-2 is now preceded by a u
to indicate a
Unicode or UTF-8 string.
Therefore the
é
character can really be represented by both
\xc3\xa9
and \xe9
, it's just that
\xc3\xa9
is the literal Unicode or UTF-8
representation and \xe9
is a Unicode or UTF-8
character, representation. The same case applies for the
â
character or any other non-ASCII character.
The way Python 2 distinguishes between the two representations is
by appending a u
to the string. In listing A-2 you can
see calling print(u'\xe9')
-- note the preceding
u
-- outputs the expected é
and
calling print(u'\xe2')
outputs the expected
â
.
This Python 2 convenience of allowing non-ASCII characters in an ASCII encoding environments, works so long as you don't try to forcibly convert a non-ASCII string that's already loaded into Python into ASCII, a scenario that's presented in listing A-3.
Listing A-3. Python 2 UnicodeEncodeError: 'ascii' codec can't encode character
Python 2.7.3 (default, Apr 10 2013, 06:20:15) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> 'café & pâtisserie'.decode('utf-8').encode('ascii') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
In listing A-3 you can see the
call 'café &
pâtisserie'.decode('utf-8').encode('ascii')
throws the
UnicodeEncodeError
error. Here you're not getting any
convenience behavior -- like when you input non-ASCII characters --
because you're trying to process an already Unicode or UTF-8
character (i.e. \xe9
or \xe2
) into ASCII,
so Python rightfully tells you it doesn't know how to treat
characters that are outside of ASCII's 128 character range.
You can of course force ASCII
output on non-ASCII characters, but you'll need pass an additional
argument to the encode()
method as illustrated in
listing A-4.
Listing A-4. Python 2 encode arguments to process Unicode to ASCII
Python 2.7.3 (default, Apr 10 2013, 06:20:15) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> 'café & pâtisserie'.decode('utf-8').encode('ascii','replace') # Outputs: 'caf? & p?tisserie' >>> 'café & pâtisserie'.decode('utf-8').encode('ascii','ignore') # Outputs: 'caf & ptisserie' >>> 'café & pâtisserie'.decode('utf-8').encode('ascii','xmlcharrefreplace') # Outputs: 'café & pâtisserie' >>> 'café & pâtisserie'.decode('utf-8').encode('ascii','backslashreplace') # Outputs: 'caf\\xe9 & p\\xe2tisserie'
As you can see in listing A-4,
you can pass a second argument to the encode()
method
to handle non-ASCII characters: the replace
argument
so the output uses ?
for non-ASCII characters; the
ignore
argument to simply bypass any non-ASCII
positions; the xmlcharrefreplace
to output the XML
entity representation of the non-ASCII characters; or the
backslashreplace
to add a backlash allowing the output
of an escaped non-ASCII reference.
Finally, listing A-5 illustrates
how you can create Unicode strings in Python 2 by prefixing them
with the letter u
.
Listing A-5. Python 2 Unicode strings prefixed with u'' Python 2.7.3 (default, Apr 10 2013, 06:20:15) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> u'café & pâtisserie' u'caf\xe9 & p\xe2tisserie' >>> print(u'caf\xe9 & p\xe2tisserie') café & pâtisserie
In listing A-5 you can see the
u'café & pâtisserie'
statement. By
appending the u
to the string you're telling Python
it's a Unicode or UTF-8 string, so the output for the characters
é
and â
are
\xe9
and \xe2
, respectively. And by
calling the print
statement on the output for this
type of string preceded by u
, the output contains the
expected é
and â
letters.
Now let's explore how Python 3 works with unicode and strings in listing A-6.
Listing A-6. Python 3 unicode and string
Python 3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.getdefaultencoding() 'utf-8' >>> 'café & pâtisserie' 'café & pâtisserie'
As you can see in listing A-6, the encoding is UTF-8 or Unicode, which is the default for all Python 3.x versions. By using UTF-8 or Unicode as the default, it makes working with text much simpler. There's no need to worry or deal with how special characters are handled, everything is handled as UTF-8 or Unicode. In addition, because the default is Unicode or UTF-8, the leading u on strings is irrelevant and not supported in Python 3.
Next, let's move on to explore
the use of Python's escape character and strings. In Python, the
backslash \
character is Python's escape character and
is used to escape the special meaning of a character and declare it
as a literal value.
For example, to use an apostrophe
quote in a string delimited by quotes, you would need to escape the
apostrophe quote so Python doesn't confuse where the string ends
(e.g.'This is Python\'s "syntax"'
). A more particular
case of using Python's backslash is on those special characters
that use a backslash themselves. Listing A-7 illustrates various
strings that use characters composed of a backslash so you can see
this behavior.
Listing A-7. Python backslash escape character and raw strings
>>> print("In Python this is a tab \t and a line feed is \n") In Python this is a tab and a line feed is >>> print("In Python this is a tab \\t and a line feed is \\n") In Python this is a tab \t and a line feed is \n >>> print(r"In Python this is a tab \t and a line feed is \n") In Python this is a tab \t and a line feed is \n
In the first example in listing
A-7 you can see the \t
character is converted to a tab
space and the \n
character to a line feed (i.e. new
line). This is the actual character composition of a tab -- as a
backslash followed by the letter t -- and a line feed -- as a
backslash followed by the n. As you can see in the second example
in listing A-7, in order for Python to output the literal value
\t
or \n
you need to add another
backslash -- which is after all Python's escape character.
The third example in listing A-7
is the same string as the previous ones, but it's preceded by
r
to make it a Python raw string. Notice that
even though the special characters \t
and
\n
are not escaped, the output is like the second
example with escaped characters.
This is what's special about
Python raw strings. By preceding a string with r
, you
tell Python to interpret backslashes literally, so there's no need
to add another backslash like the second example in listing
A-7.
Python raw strings can be particularly helpful when manipulating strings with a lot of backslashes. And one particular case of strings that rely a lot on backslashes are regular expressions. Regular expressions are a facility in almost all programming languages to find, match or compare strings to patterns, which makes them useful in a wide array of situations.
The crux of using Python and regular expression together, is they both give special meaning to backslashes, a problem that even the Python documentation calls The Backslash Plague[2]. Listing A-8 illustrates this concept of the backslash plague and raw strings in the context of Python regular expressions.
Listing A-8. Python backslash plague and raw strings with regular expressions
>>> import re # Attempt to match liternal '\n', (equal statement: re.match("\\n","\\n") ) >>> re.match("\\n",r"\n") # Attempt to match liternal '\n', (equal statement: re.match("\\\\n","\\n") ) >>> re.match("\\\\n",r"\n") <_sre.SRE_Match object at 0x7fedfb2c7988> # Attempt to match liternal '\n', (equal statement: re.match(r"\\n","\\n") ) >>> re.match(r"\\n",r"\n") <_sre.SRE_Match object at 0x7fedfb27c238>
In listing A-8, we're trying to
find a regular expression to match a literal \n
-- in
Python syntax this would be r"\n"
or
"\\n"
. Since regular expressions also use
\
as their escape character, the first logical attempt
at a matching regular expression is "\\n"
, but notice
this first attempt in listing A-8 fails.
Because we're attempting to
define a regular expression in Python, you'll need to add an
additional backslash for Python and yet another one to escape the
regular expression, bringing the total to four backslashes! As you
can see in listing A-8, the regular expression that matches a
literal \n
is the second attempt
"\\\\n"
.
As you can see in this example,
dealing with backslashes in Python and in the context of regular
expression can lead to very confusing syntax. To simplify this, the
recommended approach to define regular expressions in Python is to
use raw strings so backslashes are interpreted literally.
In the last example in listing A-8, you can see the regular
expression r"\\n"
matches a literal \n
and is equivalent to the more confusing regular expression
"\\\\n"
.
Note Python's escape character and raw string behavior is the same in both Python 2 and Python 3.