I’ve been a busy developer for the last little while. I’ve put out a game analytics stack that (AFAIK) rivals the features of every commercially available solution in the gaming space. Along the way I’ve been trying to follow an agile development approach of rapid development and deployment, and make sure that the features get out in front of the stakeholders as they are completed.
Of course, that means that the path to get here hasn’t necessarily been terribly smooth, and it’s been filled with a great many late nights. A lot of those late nights and weekends have been centered around making development deadlines, but almost all of the really late nights have been for deployments or devops purposes. Which brings me to the focus of why I’m writing this blog post.
One of the things I do for a living is throw data around. Not just data, but lots of data – and lots of kinds of data too. The data warehouse part of the analytics stack is complicated and there’s lots of runners pushing data all over the place. Believe it or not, cron has actually been sufficient so far for our job scheduling needs. At some point I expect that I’ll have to move to something like Oozie – or maybe just skip it entirely and head straight for the Storm (this seems more my speed anyway).
Over time, I’ve added features like parallel importing, parallel summaries, more summaries, and so so much more. One of the ongoing (many) battles I’ve been facing is the memory footprint of unique and percentile calculations. Combining breakneck feature development with billions of events and millions in row cardinality has driven the deployments to be multi day affairs and devops to take up an increasingly large amount of my time.
With that in mind, I’d like to impart to you a cool quick and dirty job queue manager. For my particular purposes it lets my batch processing platform operate quite a bit like a data stream or message passing processor – without overloading the (meager) processing resources available. First, let me state that I have long been a fan of xargs and it makes a daily appearance in my shell. However, it has several critical failings for this purpose:
- Untrapped application deaths can “permanently” lower your processing throughput rate
- You can’t add tasks to the input list once things are underway
- You can’t remove tasks from the input list once things are underway
- It doesn’t realistically scale into crontab
With these limitations in mind, I set out to find a way to improve my current crontab based solution in some key areas:
- We must not overload the processing resources by firing off too many processes
- The processes must restart quickly when data is available to be processed
- I don’t want to hear about it when a process fails because there’s nothing to do (flock based solutions)
- I do want to hear about it when there’s error output to be had
- Ideally, this would scale across machines on the cloud
A crontab styled on the following was the outcome of my search – and it fulfills all the requirements. The magic happens in several parts. First, the command “sem” is an alias for (GNU) parallel –semaphore. It’s not available on ubuntu (coreutils/moreutils parallel is different), so you’ll need to install it manually (see below). Let’s examine this part of the command: “sem –id proc -j2 ImportProcess”. This checks the “proc” counting semaphore and fires off a non-blocking ImportProcess if there are less than two objects using that semaphore. If there are 2+, it will block.
At a glance, that’s exactly what I want. It won’t run if there’s already N of them running, but it will just sit there. The requests will pile up and slow everything down. I looked at the arguments available in parallel and sem naturally, but none of them seemed to do what I want. sem –timeout claims to simply force-fire the process after a time and parallel –timeout kills the process if it’s still running after a certain amount of time. What I wanted was to have the process only wait for the mutex for so long.
My first thought was that I could use timeout to accomplish this, but as it turns out parallel ignores SIGTERM and continues to wait. However, timelimit -qs9 sends a kill -9 to the blocking sem request. It’s ugly, but effective and works. The final piece of the puzzle would be to swallow the death of timelimit. That’s where “|| true” comes in. As with all things, there’s a limit to how cool this particular piece of code is – I also lose notications of the OS killing my application (for example, it runs out of memory). I’ll work on that later, probably by adding a patch to parallel’s many, many, many, many options.
MAILTO=your_email@your_domain.com */1 * * * * timelimit -qs9 -t1 /usr/local/bin/sem --id proc -j2 ImportProcess || true */1 * * * * timelimit -qs9 -t1 /usr/local/bin/sem --id proc -j5 TransformProcess || true */1 * * * * timelimit -qs9 -t1 /usr/local/bin/sem --id proc -j7 SummaryProcess || true
Installing GNU Parallel:
tar jxf parallel-20130222.tar.bz2
sudo make install
which parallel # Make sure this says /usr/local/bin instead of /usr/bin