For the last few months I’ve been working in a kind of CMS for proteomics results using pyramid (in reality I started it with repoze.bfg which became pyramid after joining the pylons project).
My experience with pyramid has been really smooth until I had to write a form to parse huge input files in order to build experiment sites.
In an average view for an add form I would return a response redirecting to the newly created
model resource. But in this case, the proteomics files can be quite large. I needed some way of having the view returning a response while the files were being parsed. Here is a simplified representation of how I made it work:
from multiprocessing import Value, Process
is_parsing = Value('B', 0)
print "Parsing started"
import time; time.sleep(10)
print "Parsing is done"
is_parsing.value = False
if not is_parsing.value:
is_parsing.value = True
Here I’m launching another process to do the parsing while the web app returns the response without the parsing being done. The is_building variable is shared in both processes (no need of global statement). In my case I only want to run one parsing process at a time, so I didn’t have the need for locks or queues.
In the template of this view I can either offer a form to create an experiment or inform that there is already a experiment site being built. When building I can have the browser polling to check if the parsing is done.
That’s enough in my case, I don’t need to scale to thousands of users parsing multiple files at the same time, but I was curious about how to deal with that problem if I had to. I played a bit with different ideas I was given in the always supportive #repoze channel at freenode.
First, instead of multiple processes I tried OS threads, I know about the infamous GIL but want to see it with my own eyes. However I got some intimidating random errors from paster/pyramid that were enough to drive me off that path.
I also monkeypatched the standard library with eventlet so that the OS threads would become green threads. The dummy example I show above seemed to run fine but when trying in my real application I ran into more cryptic thread errors from monkeypatched ZODB, which is what I use in my real app. I also tried gevent with similar results.
If you want to use eventlet or gevent you have to find another storage mechanism that works with green threads. Update: I was monkeypatching incorrectly, Andrey Popp’s comment explains how to do it.
Another potential source of problems when scaling with long polling, specially if you would like to add a nice responsive progress bar and a kind of log showing what is being done while parsing.
WebSockets are being regarded as the ultimate solution to deal with this kind of problems. First of all, let’s pretend websockets would be supported by all major browsers soon.
How can websockets be handled in pyramid? It turns out that dealing with websockets within the WSGI protocol is messy. However eventlet and gevent have ways to have websockets working within WSGI. Theoretically you could run a monkeypatched pyramid application behind gunicorn which can make the websockets accessible in the request.environ in pyramid. There is still some websocket protocol tasks (i.e. handshaking, closing socket, etc.) which would make writing something looking like a normal pyramid view hard.
But it happens that Ben Ford has a already written a wrapper to take care of that problem: stargate (as he says, communication for pyramids). With stargate, in your pyramid app you create websocket view by subclassing from a base class that deals with the minutiae of the websocket protocol (up to v76, the latest version at the time of writing). In a websocket view, instead of returning anything from that view, you just write a handler to catch what is coming from the websocket. The great advantage of stargate is that you don’t need to run another process to deal with websockets, you can handle websockets from within pyramid. Additionally, stargate has 100% unit test coverage and some documentation.
While websockets look like the way forward I think it’s going to take some time for websockets to become mainstream in all browser, specially after mozilla announced it won’t support websockets in the next release of firefox because of security issues.
However, with node.js, it seems finally event driven web frameworks are becoming mainstream, bringing projects like Socket.IO. Socket.IO provides an abstraction layer to the developer to write event driven web applications. Socket.IO gives the same way of writing regardless of what the browser supports, being websockets, long polling or Flash; the developer writes the app the same way.
Although initially Socket.IO is meant to be used with node.js, there is something available for the server side in Python: SocketTornad.IO. It’s built on top of Tornado web framework. In spite of Tornado having some WSGI support, I’m afraid it won’t be easy to have the async features when in WSGI mode.
If I were to support now many concurrent users in a highly responsive application I would probably ditch pyramid and go directly with SocketTornado.IO. Perhaps I will still be using pyramid for the non async part and have a front web server dispatching requests accordingly.
But it turns out that this is just a fun thought experiment, the multiprocess solution is fine for me because, like most web developers or bioinformaticians, for now I don’t need to write highly responsive applications for thousands of users.
Update: Marius Gedminas pointed out a better way to do this with locks. I will leave the code snippet using Value because is quite illustrative but you shouldn’t use Value if you to do something similar, instead check a better example I wrote in another post.
Update: Check pyramid_socketio for a newer version of async apps.