Exocet: A Second Look

So what I didn't really talk about last time is that more than just letting you directly express module loading, Exocet also implements:

Parameterized Modules



In Python, classes and functions take parameters, but modules don't. When you load a module, you don't have any opportunity to tell it what you want, or what context it's being loaded in. A lot of times, this matters.

For example, some code has optional dependencies. This idiom can be seen a lot in some parts of Twisted:

try:
from OpenSSL import SSL
except ImportError:
SSL = None


Code following this checks if 'SSL' is None to decide whether to define methods and classes that provide support for SSL connections in Twisted.


Other code can depend on one of multiple providers of an interface. Here's the famous example from the docs for the magnificent lxml library:


try:
from lxml import etree
print("running with lxml.etree")
except ImportError:
try:
# Python 2.5
import xml.etree.cElementTree as etree
print("running with cElementTree on Python 2.5+")
except ImportError:
try:
# Python 2.5
import xml.etree.ElementTree as etree
print("running with ElementTree on Python 2.5+")
except ImportError:
try:
# normal cElementTree install
import cElementTree as etree
print("running with cElementTree")
except ImportError:
try:
# normal ElementTree install
import elementtree.ElementTree as etree
print("running with ElementTree")
except ImportError:
print("Failed to import ElementTree from any known place")


The effect of this code is to try to import, in order, one of:

  1. lxml
  2. cElementTree from the stdlib
  3. ElementTree from the stdlib
  4. ElementTree installed separately
  5. cElementTree installed separately


I don't think it's a stretch to say this is rather silly. How would you feel if you saw a function that tried to access five different global variables in a row in order to decide what to do?

And though all of these modules implement the same interface, they're still different code, and you might hit some edge case where their behaviour differs. How do you test your code using each possible ElementTree implementation? As Glyph points out, it's always sunny in Python, and every piece of bad code can be worked around by writing worse code; you could have your unit tests fool around with sys.modules. But can't there be something better?

Here's a different way of thinking about it entirely.

In languages that aren't as good as Python, dependency injection is a technique that gets used to deal with this. Dependency injection has many forms and can be rather complicated, but the general idea is code declares the thing it needs, and something else (unfortunately, often an XML file!) describes what objects to provide to satisfy those dependencies.


With Exocet, we hijack the import statement to describe named parameters, indicating the dependencies our code has.
from exocet.parameters import etree

x = etree.parse(open("mydata.xml"))


Now, if you look in the Exocet tarball, you won't find a parameters.py file. This name doesn't correspond to anything on the filesystem.

So how does this help us? Well, it means you can load your ElementTree-using module like this:
m = exocet.pep302Mapper.withOverrides({"exocet.parameters.etree",
xml.etree.cElementTree})
my_etree_using_module = exocet.loadNamed("my_etree_using_module", m)


Or this:
m = exocet.pep302Mapper.withOverrides({"exocet.parameters.etree",
lxml.etree})
my_etree_using_module = exocet.loadNamed("my_etree_using_module", m)


This way, you can test your code without having to resort to sys.modules hackery, and you can better factor your applications by separating configuration and environment concerns from the rest of your code.

As you may recall from last time, pep302Mapper provides the default Python implementation of module loading. In this case, we've just added another name that can be imported, exocet.parameters.etree.

Being able to provide parameters to modules opens up the possibility to eliminate all use of global objects from your code, and pass objects only to the code that needs them. I believe that with experimentation and refinement of these tools, this technique is going to enable a lot of simpler methods for organizing complex applications and reduce a lot of complications people have to deal with in Python now.

Introducing Exocet

Last time I talked about the deficiencies of Python's module system. Now I'd like to talk about a solution to them.

There are two questions related to Python's module system that come up repeatedly on #python, and don't have obviously good answers.


  1. How do I reload a module?

  2. How do I create a plugin system?

Exocet was written primarily to answer these questions.

Download Exocet 0.5

What is Exocet?


Exocet is a new way to load Python modules. It separates the act of naming a dependency from the act of creating a module object. As a result, more than one instance of a module can be created from the same source file. Also, when creating a module object, precise control can be exerted over what it's allowed to import.

Let's start with a code example.

>>> import exocet, httplib 
>>> urllib = exocet.loadNamed("urllib", exocet.pep302Mapper)

>>> def HTTP_Ex(host):
... print "making HTTP connection to", host
... return httplib.HTTP(host)
...

>>> class _ModuleProxy(object):
... def __getattribute__(self, name):
... if name == 'HTTP':
... return HTTP_Ex
... else:
... return getattr(httplib, name)

>>> httplib_plus = _ModuleProxy()
>>> overriddenMap = exocet.pep302Mapper.withOverrides({"httplib": httplib_plus})
>>> urllib_plus = exocet.loadNamed("urllib", overriddenMap)

>>> print urllib.urlopen("http://python.org").read(121)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

>>> print urllib_plus.urlopen("http://python.org").read(121)
making HTTP connection to python.org
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Quick breakdown of what's going on here: loadNamed takes a module name and a mapper object, and returns a new module object. Note that this is very different from what import does; each invocation of loadNamed returns a new module object, unrelated to any previous ones.

The mapper object is responsible for intercepting all import calls made by the module being loaded. The pep302Mapper object used here just invokes Python's normal importing behaviour, caching loaded modules in sys.modules. But its withOverrides method creates a new mapper, one that looks in a dict first. The urllib_plus module is almost like the urllib module in this example, with one difference: the latter got the real httplib module when it executed the statement import httplib; urllib_plus got the httplib_plus module when it ran that statement. The other mapper that Exocet includes by default is emptyMapper, in which all import statements will fail. You can construct your own, providing any object under any name to be available to import statements in modules you load.

So after constructing these two modules, they're usable as normal. The only difference is that urllib_plus invokes our wrapper function when it calls httplib.HTTP(), producing the printed line before proceeding with its work.

So, how does this behaviour answer the questions I mentioned earlier?

Module reloading


As we saw last time, reload isn't suitable for real-world use; it changes things around too much in some places, not enough in others. With Exocet, you don't have to reload modules because you can just load them. When you want a new module with updated code, you just load it and have a new module object. All the old objects still exist as long as you want them to; there's no orphaned-instance problem. If you need new versions of the module's dependencies you can load them first and put them in the mapper so that they're visible to your new module. Old instances, classes, and modules can be removed in the normal way — by Python's garbage collection, when nothing uses them any longer.

Plugin systems


The normal use of Python packages and modules involves importing specific ones by name. However, sometimes you want to load a bunch of modules and call something in each. Since Exocet borrows code from twisted.python.modules, it's possible to iterate over a package and load each module in it:

>>> import exocet 
>>> [exocet.load(x, exocet.pep302Mapper) for x in
... exocet.getModule("twisted.plugins").iterModules()]
[<module 'twisted.plugins.cred_anonymous' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/cred_anonymous.py'>,
<module 'twisted.plugins.cred_file' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/cred_file.py'>,
<module 'twisted.plugins.cred_memory' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/cred_memory.py'>,
<module 'twisted.plugins.cred_unix' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/cred_unix.py'>,
<module 'twisted.plugins.twisted_conch' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_conch.py'>,
<module 'twisted.plugins.twisted_ftp' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_ftp.py'>,
<module 'twisted.plugins.twisted_inet' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_inet.py'>,
<module 'twisted.plugins.twisted_lore' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_lore.py'>,
<module 'twisted.plugins.twisted_mail' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_mail.py'>,
<module 'twisted.plugins.twisted_manhole' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_manhole.py'>,
<module 'twisted.plugins.twisted_names' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_names.py'>,
<module 'twisted.plugins.twisted_news' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_news.py'>,
<module 'twisted.plugins.twisted_portforward' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_portforward.py'>,
<module 'twisted.plugins.twisted_qtstub' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_qtstub.py'>,
<module 'twisted.plugins.twisted_reactors' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_reactors.py'>,
<module 'twisted.plugins.twisted_runner' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_runner.py'>,
<module 'twisted.plugins.twisted_socks' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_socks.py'>,
<module 'twisted.plugins.twisted_telnet' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_telnet.py'>,
<module 'twisted.plugins.twisted_trial' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_trial.py'>,
<module 'twisted.plugins.twisted_web' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_web.py'>,
<module 'twisted.plugins.twisted_web2' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_web2.py'>,
<module 'twisted.plugins.twisted_words' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_words.py'>]


In this case we used the default pep302Mapper since we wanted plugin modules to be able to import anything they wanted. However, a different mapper could be provided which provided different modules, limited them to only importing certain ones, or none at all.

Unit testing



Hardly anybody asks about this on #python, but they should. By loading the modules you test with Exocet, you can replace their dependencies with stubs without monkeypatching. Just provide overrides to the mapper object when loading the module, and you have a stubbed or mocked module to test without altering the behaviour of other tests.

Next steps


Exocet lets you load most existing Python modules, but in a way that gives much better control over what happens. This makes room for radically different ways of thinking about how to organize Python programs. Since this approach is so different, practice is needed to come up with new idioms for how to use it effectively. I've tried carefully to provide mechanism here with very little policy. There are a few kinks to still work out in how Exocet works in more complicated scenarios, but give it a try! It might be just the thing for your next IRC bot. ;-)

If you want to help with Exocet development, it happens here, on Launchpad. I look forward to your feedback.

Modules in Python: the Good, the Bad, and the Ugly

I've spent a lot of time living with Python's module system, both in my own
work and in helping people on Freenode's #python channel. A lot of Python's
power comes from its module system; however, it could be better. It can be
hard to think about how modules could be done differently, since it's so
central to the design of Python software, but it's worth the effort. Here's
some stuff I've been thinking about.



The Good




No global namespace for objects

This is the benefit of module systems in general and it's a big one. It's what I miss the most when writing in C, for example (And until very recently — Javascript!) Being able to treat a source file as an actual container for code instead of an arbitrary pile of functions and variables with no structure makes code much easier to comprehend and navigate.

Python modules are easy to write and to import

Particularly I'm comparing this to Scheme 48 and ML, which both have very well-designed and powerful module systems, but they're rather confusing to the newcomer because they require a good bit of up-front knowledge to construct a module that's useful to anyone else. In Python, you just stick some code in a file and then all the names in it are importable. My earliest Python memory was joining #python and asking "how do I import some code I wrote in one file into another"? I was told 'for a file named foo.py, use "import foo"'. My reaction was "Really? That's all?" Providing a low barrier to entry for creating and using modules is an extremely powerful advantage of Python.



The Bad



Modules are in a global namespace

Although modules contain classes, functions, etc., there's no containment hierarchy for module names themselves. Different modules can have functions and classes with the same name in them, but there's nothing that can contain multiple modules with the same name. This shows up as a problem when you want to write unit tests that use fake versions of some modules, for example. When faking a function or a class, one creates a new version. Modules generally have to be modified rather than replaced, since import looks up modules in the global module namespace.

PYTHONPATH is a rather inflexible way to organize modules

Organizing modules by location in the filesystem is a great way to get started, but it's not the only possible thing one might want. This deficiency has been addressed in recent Pythons via the PEP 302 import hooks. However...

PEP302 hooks help, but aren't enough by themselves

The canonical example of alternate module organization is putting them in a zip file, which Python supports via the standard import hooks now. Now you have extra problems, though. Python packages are a good way to organize modules, but they don't provide a way to enumerate their contents. To work around this, everybody looks at the filesystem layout to determine what's in a package. But if your modules aren't being loaded directly from the filesystem, this approach won't work.



The Really Bad



Modules are singletons (i.e., global mutable state)

This is the dark secret at the heart of any large-scale Python project. One can be very careful about organizing one's state into instances and so forth, but all modules are still visible and modifiable by any code at any time.

Still easy to write unreadable code via monkey-patching

It's easy and convenient to assign to module attributes any time you feel like it. The result is that any time you see "from foo import someObject", you can't every be sure about where that object was defined unless you read all the source code in the application. Even when it's desirable to change module contents (such as for tests), it's easy to fail to do so in a way that doesn't introduce dependencies or conflicts between tests. The classic example is calling some function that initializes module globals from a config file; if one test does it, it can cause tests run after it to fail or incorrectly succeed.

reload()

The reload function is a symptom of all the above problems. Its inspiration is obvious: loading code that's changed since the current Python process has started is an entirely sensible idea. However, Python's assumptions about how modules work makes this rather difficult to do in a sensible manner. It's common to create new lists rather than modify old ones when a new version of some data is wanted. This convention is reinforced by the ease by which list comprehensions can be used to do this job. The convention encouraged by the existence of reload is exactly opposite, though — instead of creating a new module object, the old one is emptied and refilled with fresh objects. The result is that instances of classes in that module are orphaned; the class they were instantiated from can't be reached by its name. Also, it only reloads a single module; no help is provided in updating modules that depend on it, or updating its own dependencies. Figuring out which modules to reload or not reload at any given time is often very tricky. Plenty of other corner cases exist, such as reinitialization of function default arguments, and so forth. Because of all this, the standard advice on #python is that "reload will not make you happy".



What Now?


So with these problems identified in how Python handles modules, can anything be done?
Well, that's why I wrote Exocet. More about that next time.

Unicode in Python, and how to prevent it



[UPDATE 16 Aug 2011] Armin Ronacher has written a nice module called unicode-nazi that provides the Unicode warnings I discuss at the end of this article.

Though I can't use Python 3 for any of my projects, it does have a few nice things. One particular behaviour where it improves on Python 2 is forbidding implicit conversions between byte strings and Unicode strings. For example:
Python 3.1.2 (release31-maint, Sep 17 2010, 20:34:23)

[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 'foo' + b'baz'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Can't convert 'bytes' object to str implicitly

If you do this in Python 2, it invokes the default encoding to convert between bytes and unicode, leading to manifold unhappinesses. So in Python 2, the above example looks like:
Python 2.6.6 (r266:84292, Sep 15 2010, 15:52:39)

[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'foo' + b'baz'
u'foobaz'

This looks OK, but problems lurk just below the surface. The default encoding used in nearly all circumstances is ASCII, and this conversion will blow up if non-ASCII bytes are involved.
>>> u'foo' + b'\xe2\x98\x83'

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

This is only a 3, at best 3.5 on the Russell scale of API misusability. Even worse, this conversion is done for any function implemented in C that expects unicode or bytes and receives the other type. For example, unicode() converts byte strings to unicode strings, and if not given an explicit encoding argument, uses the default encoding.
>>> unicode('bob')

u'bob'
>>> unicode('\xe2\x98\x83')
Traceback (most recent call last):
File "<stdin>, line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

The right way to do it is call .decode() on the byte string.


>>> '\xe2\x98\x83'.decode('utf8')
u'\u2603'

Just be sure you don't call it on a unicode string by mistake, or you'll get very confused:
>>> u'foo'.decode('utf8')

u'foo'
>>> u'\u2603'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2603' in position 0: ordinal not in range(128)

Hey! That error says "encode", but we asked for it to decode!

It turns out that since the utf8 decoder expects bytes as input, but we gave it unicode... Python helpfully tries to convert the unicode characters into bytes, using the default encoding! This is why the first decode call succeeds - all the characters in it can be converted to ASCII bytes.

Is there any hope?


Since Unicode was added to Python in 2.1, there's been a way to get Python 3's behavior. The site module in the Python standard library sets Python's default encoding on startup. If edited to use "undefined" instead of "ascii", the above examples all fail instead of converting:



>>> u'foo' + baz

Traceback (most recent call last):
File "<stdin>", line 1, in
File "/usr/lib/python2.6/encodings/undefined.py", line 22, in decode
raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding


So why can't we switch to this today? Well, for one thing, lots and lots of code depends on this implicit encoding/decoding. Even code in the standard library:


Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.6/re.py", line 243, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python2.6/sre_compile.py", line 506, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python2.6/sre_parse.py", line 672, in parse
source = Tokenizer(str)
File "/usr/lib/python2.6/sre_parse.py", line 187, in __init__
self.__next()
File "/usr/lib/python2.6/sre_parse.py", line 193, in __next
if char[0] == "\\":
File "/usr/lib/python2.6/encodings/undefined.py", line 22, in decode
raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding


Not to mention loads of existing third-party apps and libraries. So what can we do? Fortunately, Python allows registering new codecs. I've written a slight variation on the ASCII codec, ascii_with_complaints, which preserves the default Python behaviour, but also produces warnings.


>>> import re
>>> re.compile(u'\N{SNOWMAN}+')
/usr/lib/python2.6/sre_parse.py:193: UnicodeWarning: Implicit conversion of str to unicode
if char[0] == "\\":
/usr/lib/python2.6/sre_parse.py:418: UnicodeWarning: Implicit conversion of str to unicode
if this and this[0] not in SPECIAL_CHARS:
/usr/lib/python2.6/sre_parse.py:421: UnicodeWarning: Implicit conversion of str to unicode
elif this == "[":
/usr/lib/python2.6/sre_parse.py:478: UnicodeWarning: Implicit conversion of str to unicode
elif this and this[0] in REPEAT_CHARS:
/usr/lib/python2.6/sre_parse.py:480: UnicodeWarning: Implicit conversion of str to unicode
if this == "?":
/usr/lib/python2.6/sre_parse.py:482: UnicodeWarning: Implicit conversion of str to unicode
elif this == "*":
/usr/lib/python2.6/sre_parse.py:485: UnicodeWarning: Implicit conversion of str to unicode
elif this == "+":
<_sre.SRE_Pattern object at 0xb76d6f80>


Hopefully this will be useful as a tool for ferreting out those Unicode bugs waiting to break your application, as well.

The true blue fate with an artificial calendar

In case you were wondering, here's what I've been busy with lately.

I haven't given up on Python yet. Stay tuned, big stuff on the way.

Two-thirds slow, one-third amazing



My evident neglect of this site was not intentional. Moving across the country and starting a new job tend to reduce one's available time for open source work, and mine hasn't resulted in anything really worth announcing for the past year (or more). But today that changes!

Download PyMeta 0.4.0



Since the last release of Ecru I have been trying to get rid of its dependency on Python, by porting the E parser to E. In the process of doing so, I realized it was probably a bad idea to try to use a parser whose only form of error reporting was the string "Parse error". Since I'm still more familiar with Python than E, I started implementing error reporting in PyMeta. (Also, Python has a debugger.) This resulted in some significant rewrites of the internals.

So what's different in PyMeta 0.4?

Comments!


With its new space-age technology, PyMeta 0.4 will treat '#' as a comment character, just like Python. (Yes, this was rather overdue.)

Reorganized code generator


Previously, code for grammars was generated by the grammar parser calling methods on a builder object that directly emitted Python code as strings. Now the grammar parser builds a tree, which is then consumed by a code generator. If you want to generate something other than Python (or change how Python code gets generated), it should be a lot simpler now. Look at pymeta.builder.PythonWriter for specifics, specifically the generate_ methods.

Error tracking and reporting


This is the big one. Previously, PyMeta expressions returned a value or raised an error. Now, each expression evaluated by the parser now returns a value and an error, even if a successful parse was found. If parse failure occurs, an error still gets raised. Combining expressions with "|" returns the error from furthest into the input and combines ties.

So the result of this is that you can now get nicely formatted output telling you where stuff went wrong and how. Here's an example, based on the old TinyHTML parser:


When you mismatch tags, the parser notices:


Notice here that the information in ParseError is structured so that future tools can figure out stuff about what failed and how.

If there's more than one possible valid input at the point of failure, the parser will tell you:


Plans for the Future


There are a few directions I'd like to take PyMeta in the future. A really nice thing would be to have a way to generate grammars ahead of time easily, writing out a Python module. This release includes bin/generate_parser which does a rather naive version of this. The problem is figuring out how to make grammars that are subclasses of something other than just OMetaBase.
Also, with the new code generator setup, it'd be fairly easy to generate Cython instead of Python, resulting in grammars that can be compiled as extension modules, hopefully resulting in much faster parse times. PyMeta isn't meant to be blazingly fast -- PEG parsers aren't known to be the most efficient -- but it'd be nice to squeeze all we can out of it.

Other people have asked for event based parsing and incremental output. Seems like a neat idea... I'd just have to figure out what that means. :-)

Special thanks to Marien Zwart and Cory Dodt for their contributions and encouragement for this release.

Ecru 0.2.0

It's been a bit longer for this release.



The major new additions are message buffering on eventual references and the Ref.whenResolved and Ref.whenBroken methods, as well as the __whenMoreResolved Miranda method. Along with eventual sends and the vat support from the previous release, this is enough to provide support for the when syntax sugar. The Python interface now can control the E event loop: ecru.api.iterate runs a single item in the queue of the vat exposed to Python, and ecru.api.incrementalEval queues an E expression to be run, returning a promise to Python immediately. There's an example IRC-bot REPL in doc/examples/ircbot.py.

Special shout-out to Eric Mangold (aka teratorn) for bugfixes and testing. Thanks Eric!