Danny Navarro's Blog

Stochastic Trains of Thought

Migrating to Octopress

I finally found some time to migrate my blog to Octopress from Wordpress.com. The critical reason to migrate from Wordpress has been the support for nice code syntax highlighting, something I couldn’t have wordpress.com, at least for free. I know there are very nice wordpress plugins for syntax highlighting but in order to use them I would have to host it myself. I don’t want to go through the hassle of maintaining a typical PHP/MySQL stack or to be worried about being slashdotted.

Having worked with an excellent documentation tool like Sphinx, I started looking to static blog generators meant. It turned out that Manu Viera, a colleague working at Yaco with me, shared the same itch and had already looked several static web generators in Python, which is our main language at Yaco. Manu found pelican the best candidate but still I found it a bit immature, not something like something like Jekyll.

Then I found Octopress, a framework built on top of Jekyll with several plugins, including syntax highlighting or automatic support for disqus comments.

The migration from wordpress was not too painful. I used the default Jekyll script to import wordpress posts and disqus importer for the comments. After some sed commands I got nice markdown formatted scripts.

I had some trouble in the beginning configuring an isolated Ruby runtime in Arch Linux just for Octopress but after discovering rbenv, everything went smooth. (I prefer rbenv instead RVM with rbenv I know at any moment what it’s doing).

Deploying an Octopress generated site to github pages is as easy as pie.

Aside of nice Python syntax highlighting now I have some extra advantages I didn’t have with wordpress.com:

  • Markdown syntax when writing my posts.

  • I can use the best text editor to mankind: vim :P

  • My blog data becomes more manageable. If at some point I don’t want to host it github, I could just to push it somewhere else with no modification.

  • I got a very nice default theme for free, that aside of looking good, it’s also very easy to tweak and maintain.

  • Now I have a good excuse to learn Ruby outside of RoR influence. Ruby is one of those languages I wish I would be better at, even if Python remains my main working language.

In any case, I must say the service provided by wordpress.com has been quite good but this one of those cases where you have to say: “Sorry, it’s not you, it’s just me”.

Using Custom Events in Pyramid

Pyramid is a WSGI application framework that primarily follows a request-response mechanism. However, if you need to work with events you can still use them. It comes with some default event types that are emitted implicitely by Pyramid as long as you have a subscriber for them. For most applications the default event types are enough, but what if you want to write your custom event type and emit it explicitly from your code? It turns out that the application registry that Pyramid uses by default comes with a handy notify method. Pyramid uses this method internally  for its default events. Here is how you would take advantage of it:

from pyramid.events import subscriber

class MyCustomEventType(object):
    def __init__(self, msg):
        self.msg = msg

@subscriber(MyCustomEventType)
def my_subscriber(event):
    print(event.msg)

def my_view(request):
    request.registry.notify(MyCustomEventType("Here it comes"))
    return {}

When running the application, every time a request goes through my_view, an event with a message is emitted, in this case, “Here it comes”. The subscriber then handles the event by printing the message, but it could do anything you want.

Notice that I’m using a decorator to hook my_subscriber. In order for the decorator to work you have to make sure you call the scan method when configuring the application.

Be aware though, that all these events are synchronous because Pyramid is primarily a request-response framework, all the events emitted block until the subscribers are done. If you want non-blocking events in Pyramid you could spawn a process from the subscriber or come with some other solution.

But the events in Pyramid are just another functionality that it offers. Pyramid is not a event-oriented framework, if you want to go all the way with async events you should look into Twisted or Tornado.

Why Arch Linux

I have been using Arch Linux for 3 years now. I still use Debian and Ubuntu for the servers I administer but I acknowledge Arch Linux has taught many valuable lessons.

With Arch Linux there is very little in your system that you are not aware of. You have to configure everything yourself by editing config files. The process is not that hard because all those configuration files are meant to be tweaked. You also count with an excellent wiki to help you.

The Arch Linux philosophy doesn’t try to shield the user from complexity with extra layers. Instead it focuses on making the direct configuration as simply as possible. For example, writing a proper boot script is much straightforward than in other distros. At the same time if you are not careful you have more chances of really screw things up everything.

Arch Linux aggressively updates from upstream sources. This has the advantage and disadvantages of being always in the bleeding-edge. I also like the idea of putting more responsibility about the stability of software in developers than in packagers, as long as you are aware of this as a user. As a user you have to assume the responsibility of being at the cutting-edge. Things may not go always smooth but you count with excellent tools to manage chaos.

That brings me to the real killer feature that makes Arch Linux shine over the rest: the packaging system. PacmanABS, AUR, makepkg and the PKGBUILD format are just great. You usually don’t have to mess with packaging that much, everything installs nicely and dependencies are correctly handled, specially if you stick to the official repository.

But if you don’t like something about a package or need another version you have all the tools in place for the creation and introspection of packages without disrupting pacman bookkeeping (pacman is the equivalent of dpkg/apt-get in Debian).  Let me illustrate all this with something I had to deal with this week.

I decide to use Compass to make my stormy relationship with CSS smoother. Compass is a Ruby gem, the usual way to install gems is through Ruby packaging system but I don’t want to mess with the Ruby libraries already installed in the system with pacman. If I install those gems as root pacman will not be able to keep track of them, everything could break in the future, and most importantly, without an easy solution.

A way to deal with this issue is to install the Compass gem in some directory and handle the runtime somehow. You usually end up with a new runtime environment for each project you start. There are excellent tools to manage runtimes in Ruby like Rake, but boy, I already have enough managing my Python virtualenvs.

I see that Compass is already in AUR. AUR is a very liberal package repository where anyone can upload source packages. When you install from AUR you usually have a review the PKGBUILD, the comments of other users and check how many users have voted the package to be included in official repositories. With tools like yaourt the whole process is very smooth.

Alright, the ruby-compass PKGBUILD looks good to me so I install it. Now compass is a good system citizen and can be updated, installed and uninstalled through pacman. Compass works as expected but it turns out that the most interesting feature I wanted to use in Compass is only available in the latest version of Compass, the version in AUR is not the latest one.

No problem, it’ll probably be some version bumps and I’ll be done. I download the PKGBUILD, bump the versions and build the package again but then I realize that the new version depends on new Ruby gems that are not in AUR.

At this point I would avoid getting into a dependency hell and go for Rake, but wait, I’m using Arch Linux, let’s see what happens if I continue with the Arch flow.

I take the PKGBUILD of Compass as a template, which is generic enough for any Ruby gem, and use them for the Ruby dependencies. I update licences, versions checksums, build them and done, everything works. They are all coming from rubyforge and follow the same building conventions, making my life easy as a packager.

I upload the PKGBUILDs to AUR with just one burp command. Now I can install the latest version of compass through pacman without any issue. I then send my modified version of PKGBUILD to the original Compass packager who updates it. That’s it, now anyone can install the latest version of Compass with all its dependencies from AUR. I now can install Compass at home with just one command: just yaourt -Sy ruby-compass.

Now I just have to keep an eye in new updates on the dependencies I’m now maintaining in AUR but rubyforge offers an excellent notification system for gem updates.

That’s it. The whole thing took less than 30 minutes.

I don’t know if nowadays writing a DEB package spec is that hard, I acknowledge I never tried. The tutorials I found about them drove me away when I considered it some years ago.

It’s not only the packaging format itself, there is also the community and policy aspects. Editing your PKGBUILDs is something that every Arch Linux user does. For AUR there is very little regulating making the packaging smoother process at the expense of shifting the trust on the packages to the user. In general, most packages in AUR are good enough but for production machines I still value more the trust the Debian and Ubuntu package maintainers.

That’s where open source community shines, you have many choices.

Commenting Out in Chamaleon Templates

If you want to prevent Chameleon from rendering some portions of an HTML template you might be tempted to do something like this:


<!-- <div>${context.name}</div> -->

However Chameleon will still evaluate what’s inside the ${…} block even if it’s within an HTML comment. Chameleon must do this because you might want to insert conditional comments.

This dummy tal:condition block will do the job:


<span tal:condition="None"> <div>${context.name}</div> </span>

Chameleon ignore anything inside the condition block.

Moving to Spain

After almost 3 years in The Netherlands working as proteomics informatician at Albert Heck’s lab, I’m moving to Seville, Spain, to work as a web developer for Yaco Sistemas, a fresh and dynamic open source friendly company.

This is an important shift in my career since I won’t be working on proteomics informatics and academic research anymore. I have mixed feelings about leaving proteomics. On one hand I like the area because there are plenty of tough challenges to be solved. But on the other hand I’m glad I can dedicate all my time to develop web applications, that might not be as sophisticated as proteomics software, but that will be immediately useful for the masses. I love web development and the Python community but within proteomics I could only intersect with the Python web development community quite sporadically. Now I’ll have the chance to be part of it full time.

Personally, The Netherlands is the most comfortable and easy-going country I ever lived. Here I had the chance to work with very smart people and made friends that will never forget. What I have learned during these years is priceless.

But I can’t deny my origins, Spain is where I feel at home even if sometimes I don’t find it too exciting because I’m too familiar with the culture. However Seville is quite far from my hometown, in the North of Spain. The culture in the South is very different from the North, so in a way, I’ll be another foreigner excited about the peculiarities I discover about Andalusian culture.

Async Pyramid Example Done Right

After speaking with Marius Gedminas at freenode, he gave me enough hints to rewrite my previous async view example with locks instead of Value, which is prone to race conditions. I also added a queue to allow jobs to wait for being processed.



from multiprocessing import Process, Lock, Queue

job = 0
q = Queue(maxsize=3)
lock = Lock()

def work():
    import time; time.sleep(8)
    job = q.get()
    print("Job done: {0}".format(job))
    print("Queue size: {0}\n".format(q.qsize()))
    if not q.empty():
        work()
    else:
        lock.release()

def my_view(request):
    global job
    if not q.full():
        job += 1
        q.put(job)
        # Not running
        if lock.acquire(False):
            Process(target=work).start()
            print("Job {0} submitted and working on it".format(job))
        else:
            print("Job {0} submitted while working".format(job))
    else:
        print("Queue is full")
    print("Queue size: {0}\n".format(q.qsize()))
    return {'project':'asyncapp'}

With every request a job is sent. Here the queue accepts 3 jobs. The recursion in work makes sure there is only 1 process working at a time.

I will leave my previous example with Value because it’s easier to understand but this version is much safer.

Update: You can avoid the use of locks by using 2 queues.

Async Web Apps With Pyramid?

For the last few months I’ve been working in a kind of CMS for proteomics results using pyramid (in reality I started it with repoze.bfg which became pyramid after joining the pylons project).

My experience with pyramid has been really smooth until I had to write a form to parse huge input files in order to build experiment sites.

In an average view for an add form I would return a response redirecting to the newly created model resource. But in this case, the proteomics files can be quite large. I needed some way of having the view returning a response while the files were being parsed. Here is a simplified representation of how I made it work:

from multiprocessing import Value, Process

is_parsing = Value('B', 0)

def parse():
    print "Parsing started"
    import time; time.sleep(10)
    print "Parsing is done"
    is_parsing.value = False

def my_view(request):
    if not is_parsing.value:
        is_parsing.value = True
        Process(target=parse).start()
    else:
        print("Still parsing...")
    return {'project':'asyncapp',
            'is_parsing': is_parsing.value}


Here I’m launching another process to do the parsing while the web app returns the response without the parsing being done. The is_building variable is shared in both processes (no need of global statement). In my case I only want to run one parsing process at a time, so I didn’t have the need for locks or queues.

In the template of this view I can either offer a form to create an experiment or inform that there is already a experiment site being built. When building I can have the browser polling to check if the parsing is done.

That’s enough in my case, I don’t need to scale to thousands of users parsing multiple files at the same time, but I was curious about how to deal with that problem if I had to. I played a bit with different ideas I was given in the always supportive #repoze channel at freenode.

First, instead of multiple processes I tried OS threads, I know about the infamous GIL but want to see it with my own eyes. However I got some intimidating random errors from paster/pyramid that were enough to drive me off that path.

I also monkeypatched the standard library with eventlet so that the OS threads would become green threads. The dummy example I show above seemed to run fine but when trying in my real application I ran into more cryptic thread errors from monkeypatched ZODB, which is what I use in my real app. I also tried gevent with similar results. If you want to use eventlet or gevent you have to find another storage mechanism that works with green threads. Update: I was monkeypatching incorrectly, Andrey Popp’s comment explains how to do it.

Another potential source of problems when scaling with long polling, specially if you would like to add a nice responsive progress bar and a kind of log showing what is being done while parsing.

WebSockets are being regarded as the ultimate solution to deal with this kind of problems. First of all, let’s pretend websockets would be supported by all major browsers soon.

How can websockets be handled in pyramid? It turns out that dealing with websockets within the WSGI protocol is messy. However eventlet and gevent have ways to have websockets working within WSGI. Theoretically you could run a monkeypatched pyramid application behind gunicorn which can make the websockets accessible in the request.environ in pyramid. There is still some websocket protocol tasks (i.e. handshaking, closing socket, etc.) which would make writing something looking like a normal pyramid view hard.

But it happens that Ben Ford has a already written a wrapper to take care of that problem:  stargate (as he says, communication for pyramids). With stargate, in your pyramid app you create websocket view by subclassing from a base class that deals with the minutiae of the websocket protocol (up to v76, the latest version at the time of writing). In a websocket view, instead of returning anything from that view, you just write a handler to catch what is coming from the websocket. The great advantage of stargate is that you don’t need to run another process to deal with websockets, you can handle websockets from within pyramid. Additionally, stargate has 100% unit test coverage and some documentation.

While websockets look like the way forward I think it’s going to take some time for websockets to become mainstream in all browser, specially after mozilla announced it won’t support websockets in the next release of firefox because of security issues.

However, with node.js, it seems finally event driven web frameworks are becoming mainstream, bringing projects like Socket.IO. Socket.IO provides an abstraction layer to the developer to write event driven web applications. Socket.IO gives the same way of writing regardless of what the browser supports, being websockets, long polling or Flash; the developer writes the app the same way.

Although initially Socket.IO is meant to be used with node.js, there is something available for the server side in Python: SocketTornad.IO. It’s built on top of Tornado web framework. In spite of Tornado having some WSGI support, I’m afraid it won’t be easy to have the async features when in WSGI mode.

If I were to support now many concurrent users in a highly responsive application I would probably ditch pyramid  and go directly with SocketTornado.IO. Perhaps I will still be using pyramid for the non async part and have a front web server dispatching requests accordingly.

But it turns out that this is just a fun thought experiment, the multiprocess solution is fine for me because, like most web developers or bioinformaticians, for now I don’t need to write highly responsive applications for thousands of users.

Update: Marius Gedminas pointed out a better way to do this with locks. I will leave the code snippet using Value because is quite illustrative but you shouldn’t use Value if you to do something similar, instead check a better example I wrote in another post.

Update: Check pyramid_socketio for a newer version of async apps.

The Bioinformatics Curse

As a bioinformatician you will be considered a programmer by biologists and a biologist by programmers. When talking with programmers you will suck at programming, when talking with biologists you will suck at biology. Biologists don’t want to know much about computing, understandably, they want to get their job done. Programmers might show some curiosity in biology but tend to shield themselves from biology complexity in order to get to get work done. As a bioinformatician you have to know enough of biology to be in the cutting-edge so what you research continues being relevant and keep improving your programming skills so you are still productive for what is expected of a programmer nowadays.

Some influential bioinformaticians group try to define the bioinformatics field as if it were precisely the research they are doing, frowning upon bioinformatics research not similar to theirs (or similar but superior to theirs). The followers of these groups try to imitate them so that they can be some day become experts in the field. I see also other bioinformaticians gathering together just because they work in biology using a computer, regardless of how little overlap there is in the things they do. It’s like group therapy, sharing experiences with people marginalized for the same reason.

Bioinformatics field is still in the very beginning. The field is very broad and will eventually be fragmented in multiple official fields. Working in an emerging field can be very exciting because you don’t have the constrains rules of an established field. But if social recognition is important to you, think twice when getting into bioinformatics. You’d likely feel out-of-place wherever you go.

ZSH Prompt for Virtualenv, Git and Bzr

Some of my colleagues are surprised I don’t use NetBeans or Eclipse or some other fat IDE. In fact, I believe the most powerful IDE is UNIX.

I also acknowledge that fine-tuning all your UNIX tools to your exact requirements can be cumbersome. But you don’t need to customize everything from scratch. Open source is great because you can steal what other people are sharing and put it altogether as you like.

Lately I have been reading some great posts about customization ZSH. I consider the shell prompt a fundamental part of the UNIX IDE. In the screenshot below I show how my customized ZSH prompt plays nicely with bzr, git and virtualenvwrapper.

I barely have done anything from scratch. I just stitched configurations and tips from these sources:

  1. Prompt decorations: http://aperiodic.net/phil/prompt/, http://git.sysphere.org/dotfiles/tree/zshrc
  2. Zenburn theme: http://git.sysphere.org/dotfiles/tree/Xdefaults
  3. git prompt hack: http://briancarper.net/blog/570/git-info-in-your-zsh-prompt
  4. virtualenv prompt hack: http://www.doughellmann.com/docs/virtualenvwrapper/tips.html#zsh-prompt
  5. bzr prompt hack: from scratch imitating git hack.

End results:

  1. .zshrc
  2. virtualenwrapper postactivate and postdeactivate

Isn’t open source great? I would be flattered if you also can steal my configuration files ;)

In another blog post I’ll write about my customizations for Vim, urxvt and awesomewm to reach the ultimate UNIX IDE.

The Value of ‘Wasting’ Your Time

I’m the kind of guy who likes to learn just about everything just for fun. I have the feeling I don’t really grasp something until I have real experience with it. That’s why redoing something other people have done is the best activity I can think for learning what’s the matter really about. But it seems some people have difficulties tolerating my views about learning. I sometimes hear things like:

Why do you reinvent the wheel? Why do you want to waste your time?

Well, if I hadn’t ‘wasted’ so much time during all these years I would still be doing terrible things like using regex to parse HTML with a crippled Perl scripts. The skills for what I’m getting my salary are mostly from wasted time. I wouldn’t be nowhere with only my formal training.

Information technology nowadays is mostly about about knowledge investment. Doing stuff is not as hard as learning how to do stuff orders of magnitude more efficiently. Learning is kind of accumulative, the more you learn the better you are at learning and the faster your efficiency grows. With computers is really hard to hit a physical limit where it’s not possible to improve significantly more by learning.

It’s well known that few highly productive people can beat large corporations made of people who work linearly with the skills they once learned in order to get a job.