Failing In So Many Ways


Liang Nuren – Failing In So Many Ways

Python and Pypy, Pyscopg2 vs Psycopg2cffi

I’ve been using Python with psycopg2 [init.d] pretty much full time for a couple of years now.  However, today I was browsing through the documentation and realized that I didn’t know the performance penalty for various operations and cursor factories.  I’ve traditionally assumed that the default tuple based cursor must have a pretty significant performance benefit to make up for its lack of features, but I the performance penalty for DictCursor didn’t feel all that bad so I’ve just rolled with it.

Another habit I’ve taken up recently was attempting to get processing underway sooner while minimizing the memory footprint.  This means in a lot of applications I’ve taken to avoiding cursor.fetchall() in favor of iteratively fetching from the cursor.  I also didn’t have a quantitative measurement for the performance impact of this.  For the curious, the approach looks something like this (check below for the gist):

with conn.cursor() as cur:
    for row in cur:

So today at work I resolved that I’d spend my Bart ride home writing a quick benchmark to test various interesting cursor factories as well as fetchall() vs [ x for x in cursor ].  Once the testing got underway I realized that I could run the same code to test the new psycopg2cffi module as well as pypy-2.0.2 (with psycopg2cffi).  These are the results for fetching 10000 rows with 8 columns 1000 times on my computer:

# 1k calls, cume duration
# 10k rows fetched
# +--------------------+----------------+--------------------+-------------------------+
# |   Default Cursor   | psycopg2/2.7.3 | psycopg2cffi/2.7.3 | psycopg2cffi/pypy-2.0.2 |
# +====================+================+====================+=========================+
# | fetch_results      | 18.072         | 18.076             | 32.817                  |
# +--------------------+----------------+--------------------+-------------------------+
# | fetch_iter_results | 20.560         | 20.691             | 33.817                  |
# +--------------------+----------------+--------------------+-------------------------+
# +--------------------+----------------+--------------------+-------------------------+
# |     DictCursor     | psycopg2/2.7.3 | psycopg2cffi/2.7.3 | psycopg2cffi/pypy-2.0.2 |
# +====================+================+====================+=========================+
# | fetch_results      | 18.405         | 18.377             | 32.434                  |
# +--------------------+----------------+--------------------+-------------------------+
# | fetch_iter_results | 19.563         | 19.674             | 33.265                  |
# +--------------------+----------------+--------------------+-------------------------+
# +--------------------+----------------+--------------------+-------------------------+
# |  NamedTupleCursor  | psycopg2/2.7.3 | psycopg2cffi/2.7.3 | psycopg2cffi/pypy-2.0.2 |
# +====================+================+====================+=========================+
# | fetch_results      | 18.296         | 18.291             | 32.158                  |
# +--------------------+----------------+--------------------+-------------------------+
# | fetch_iter_results | 19.599         | 19.650             | 32.999                  |
# +--------------------+----------------+--------------------+-------------------------+

The thing that surprised me most about these results was that iterating across the cursor wasn’t really that much more expensive than fetchall.  I suspect that the cost increases with increased network latency, but at least some of that cost will be paid with fetchall as well.   I think it’s a good idea to set up a more rigorous benchmark before saying it’s “low cost”, but either way I really appreciate the ability to start operating on a smaller dataset while keeping the memory footprint low.

I was also pretty surprised by how little of a performance penalty DictCursor and NamedTupleCursor had.  It probably shouldn’t surprise me much considering network latency and data transfer absolutely should trivialize object creation costs.  I guess the take away from this is: if you’re going to go through all the effort of going to the database to get a piece of data, make sure you return it in a way that makes sense and is convenient to manipulate.

I was, unfortunately, not terribly surprised by Pypy’s poor performance here.  Whiel this application is ostensibly a tight loop pulling from the database and creating objects, it doesn’t feel like it plays to what I think are Pypy’s  best strengths.  I’ve had my best luck using Pypy to run specialized applications that spend most of their time in tight computation loops.

For the curious, the benchmarking file is here (along with the above chart).


Filed under: Data Warehousing, Software Development, , , ,

Quick And Dirty Job Queue

I’ve been a busy developer for the last little while. I’ve put out a game analytics stack that (AFAIK) rivals the features of every commercially available solution in the gaming space. Along the way I’ve been trying to follow an agile development approach of rapid development and deployment, and make sure that the features get out in front of the stakeholders as they are completed.

Of course, that means that the path to get here hasn’t necessarily been terribly smooth, and it’s been filled with a great many late nights. A lot of those late nights and weekends have been centered around making development deadlines, but almost all of the really late nights have been for deployments or devops purposes.  Which brings me to the focus of why I’m writing this blog post.

One of the things I do for a living is throw data around.  Not just data, but lots of data – and lots of kinds of data too.  The data warehouse part of the analytics stack is complicated and there’s lots of runners pushing data all over the place.  Believe it or not, cron has actually been sufficient so far for our job scheduling needs.  At some point I expect that I’ll have to move to something like Oozie – or maybe just skip it entirely and head straight for the Storm (this seems more my speed anyway).

Over time, I’ve added features like parallel importing, parallel summaries, more summaries, and so so much more.  One of the ongoing (many) battles I’ve been facing is the memory footprint of unique and percentile calculations.  Combining breakneck feature development with billions of events and millions in row cardinality has driven the deployments to be multi day affairs and devops to take up an increasingly large amount of my time.

With that in mind, I’d like to impart to you a cool quick and dirty job queue manager.  For my particular purposes it lets my batch processing platform operate quite a bit like a data stream or message passing processor – without overloading the (meager) processing resources available.  First, let me state that I have long been a fan of xargs and it makes a daily appearance in my shell.  However, it has several critical failings for this purpose:

  • Untrapped application deaths can “permanently” lower your processing throughput rate
  • You can’t add tasks to the input list once things are underway
  • You can’t remove tasks from the input list once things are underway
  • It doesn’t realistically scale into crontab

With these limitations in mind, I set out to find a way to improve my current crontab based solution in some key areas:

  • We must not overload the processing resources by firing off too many processes
  • The processes must restart quickly when data is available to be processed
  • I don’t want to hear about it when a process fails because there’s nothing to do (flock based solutions)
  • I do want to hear about it when there’s error output to be had
  • Ideally, this would scale across machines on the cloud

A crontab styled on the following was the outcome of my search – and it fulfills all the requirements.  The magic happens in several parts.  First, the command “sem” is an alias for (GNU) parallel –semaphore.  It’s not available on ubuntu (coreutils/moreutils parallel is different), so you’ll need to install it manually (see below).  Let’s examine this part of the command: “sem –id proc -j2 ImportProcess”.  This checks the “proc” counting semaphore and fires off a non-blocking ImportProcess if there are less than two objects using that semaphore.  If there are 2+, it will block.

At a glance, that’s exactly what I want.  It won’t run if there’s already N of them running, but it will just sit there.  The requests will pile up and slow everything down.  I looked at the arguments available in parallel and sem naturally, but none of them seemed to do what I want.  sem –timeout claims to simply force-fire the process after a time and parallel –timeout kills the process if it’s still running after a certain amount of time.  What I wanted was to have the process only wait for the mutex for so long.

My first thought was that I could use timeout to accomplish this, but as it turns out parallel ignores SIGTERM and continues to wait.  However, timelimit -qs9 sends a kill -9 to the blocking sem request.  It’s ugly, but effective and works.  The final piece of the puzzle would be to swallow the death of timelimit.  That’s where “|| true” comes in.  As with all things, there’s a limit to how cool this particular piece of code is – I also lose notications of the OS killing my application (for example, it runs out of memory).  I’ll work on that later, probably by adding a patch to parallel’s many, many, many, many options.
*/1 * * * * timelimit -qs9 -t1 /usr/local/bin/sem --id proc -j2 ImportProcess || true
*/1 * * * * timelimit -qs9 -t1 /usr/local/bin/sem --id proc -j5 TransformProcess || true
*/1 * * * * timelimit -qs9 -t1 /usr/local/bin/sem --id proc -j7 SummaryProcess || true

Installing GNU Parallel:

tar jxf parallel-20130222.tar.bz2
cd parallel-20130222/
sudo make install
which parallel # Make sure this says /usr/local/bin instead of /usr/bin

Filed under: Data Warehousing, Software Development

Python JSON Performance

So I’ve been pretty open about the fact that I’ve moved from data warehousing in the television and online ad industries to data warehousing in the gaming industry. The problem domains are so incredibly different. In the television and ad industries, there’s a relatively small amount of data that people are actually concerned about. Generally speaking, those industries are most interested in how many people saw something (viewed the ad), how many people interacted with it (clicked on it), and whether they went on to perform some other action (like buying a product).

However, in the gaming industry we’re interested in literally everything that a user does – and not in the creepy way. The primary goals are to monitor and improve user engagement, user enjoyment, and core business KPIs.  There are a lot of specific points to focus on and try to gather this information, and right now the industry standard appears to be a highly generalized event/payload system.

When looking at highly successful games like Temple Run (7M DAU [gamesbrief]) it’s only 150 events per user to get a billion events per day.  Between user segmentation and calculating different metrics it’s pretty easy to see why you’d have to process parts of the data enough times that you’re processing trillions of events and hundreds of GB of facts per day.

When I see something that looks that outrageous, I tend to ask myself whether that’s really the problem to be solving. The obvious answer is to gather less data but that’s exactly the opposite of what’s really needed. So is there a way that to get the needed answers without processing trillions of events per day? Yes I’d say that there is; but perhaps not with the highly generic uncorrelated event/payload system.  Any move in that direction would be moving off into technically uncharted territory – though not wholly uncharted for me. I’ve built a similar system before in another industry, albeit with much simpler data.

If you aren’t familiar at all with data warehousing, a ten thousand foot overview (slightly adapted for use in gaming) would look something like this.  First, the gaming client determines what are interesting facts to collect about user behavior and game performance. Then it transmit JSON events back to a server for logging and processing.  From there the data is generally batch processed and uploaded to a database* for viewing.

So as a basic sanity check, I’m doing some load testing to determine whether it is feasible to gather and process much higher resolution information about a massively successful game and it’s users than seems to be currently available in the industry.  Without going into proprietary details, I’ve manufactured analytics for a fake totalhelldeath game.  It marries Temple Run’s peak performance with a complicated economy resembling Eve Online’s.

From there, I’m compressing days of playtime into minutes and expanding the user base to be everyone with a registered credit card in the app store (~400M people as of 2012) [wikipedia].  The goal here is to see how far it’s possible to reasonably push an analytics platform in terms of metrics collection, processing, and reporting.  My best estimate for the amount of data to be processed per day in this load test is ~365 GB/day of uncompressed JSON.  While there’s still a lot that’s up in the air about this, I can share how dramatically the design requirements differ:


  • Reporting Platform: Custom reporting layer querying 12TB PostgreSQL reporting databases
  • Hardware: Bare metal processing cluster with bare metal databases
  • Input Data: ~51GB/day uncompressed binary (~150TB total uncompressed data store)
  • Processing throughput: 86.4 billion facts/day across 40 cores (1M facts/sec)

Analytics Load Test:

  • Reporting Platform: Reporting databases with generic reporting tool
  • Hardware: Amazon Instances
  • Input Data: ~365 GB/day uncompressed JSON (~40k per “hell fact” – detailed below)
  • Processing throughput: duplication factor * 8.5M facts/game day (100 * duplication facts/sec)

I’ve traditionally worked in a small team on products that have been established for years.  I have to admit that it’s a very different experience to be tasked with building literally everything from the ground up – from largely deciding what analytics points are reasonable to collect to building the system to extract and process it all. Furthermore, I don’t have years to put a perfect system into place, and I’m only one guy trying to one up the work of an entire industry.  The speed that I can develop at is critical: so maintaining Agile practices [wikipedia], successful iterations [wikipedia], and even the language I choose to develop in is of critical importance.

The primary motivator for my language choice was a combination of how quickly I can crank out high quality code and how well that code will perform.  Thus, my earlier blog post [blog] on language performance played a pretty significant role in which languages saw a prototype.  Python (and pypy specifically) seems well suited for the job and it’s the direction I’m moving forward with.  For now I’m building the simplest thing that could possibly work and hoping that the Pypy JIT will alleviate any immediate performance shortfalls.  And while I know that a JIT is basically a black box and you can’t guarantee performance, the problem space showed high suitability to JIT in the prototyping phase.  I foresee absolutely no problems handling the analytics for a 1M DAU game with Python – certainly not at the data resolution the industry is currently collecting.

But, I’m always on the look out for obvious performance bottlenecks.  That’s why I noticed something peculiar when I was building out some sample data a couple of days ago. On the previous project I worked on, I found that gzipping the output files in memory before writing to disk actually provided a large performance benefit because it wrote 10x less data to disk.  This shifted our application from being IO bound to being CPU bound and increased the throughput by several hundred percent.  I expected this to be even more true in a system attempting to process ~365GB of JSON per day, so I was quite surprised to find that enabling in-memory gzip cut overall application performance in half.  The implication here is that the application is already CPU bound.

It didn’t take much time before I’d narrowed down the primary culprit: json serialization in pypy was just painfully slow. It was a little bit surprising considering this page [] cites pypy’s superior json performance over cpython.  Pypy is still a net win despite the poor JSON serialization performance, but the win isn’t nearly as big as I’d like it to be. So after a bit of research I found several json libraries to test and had several ideas for how the project was going to fall out from here:

  • Use a different json library. Ideally it JITs better than built in and I can just keep going.
  • Accept pypy’s slow json serialization as a cost of (much) faster aggregation.
  • Accept cpython’s slower aggregation and optimize aggregation with Cython or a C extension later
  • Abandon JSON altogether and go with a different object serialization method (protobuf? xdr?)

After some consideration, I ruled out the idea of abandoning JSON altogether. By using JSON, I’m (potentially) able to import individual records at any level into a Mongo cluster and perform ad hoc queries. This is a very non-trivial benefit to just throw away! I looked at trying many JSON libraries, but ultimately settled on these three for various reasons (mostly relating to them working):

To test each of these libraries, I devised a simple test with the goal of having the modules serialize mock event data.  This is important because many benchmarks I’ve seen are built around very small contrived json structures.  I came up with the following devious plan in order to make sure that my code couldn’t really muck up the benhmark results:

  • create JSON encodable dummy totalhelldeath fact list
  • foreach module: dump list to file (module.dump(facts, fp))
  • foreach module: read list from file (facts = module.load(fp))

Just so that everything is immediately obvious: this was run on one core of an Amazon XL instance, and the charts are measuring facts serialized per second.  That means that bigger bars are better here.

Read Performance

There’s really no obvious stand out winner here, but it’s obvious that the builtin json library is lacking in both cpython and pypy. It obviously runs a bit faster with cpython, but it’s not enough to really write home about. However, simplejson and ujson really show that their performance is worth it. In my not-so-expert opinion, I’d say that ujson walks away with a slight victory here.

Write Performance

However, here there is an obvious standout winner. And in fact, the margin of victory is so large that I feel I’d be remiss if I didn’t say I checked file sizes to ensure it was actually serializing what I thought it was! There was a smallish file size difference (~8%), primarily coming from the fact that ujson serializes compact by default.

So now I’m left with a conundrum: ujson performance is mighty swell, and that can directly translate to dollars saved.  In this totalhelldeath situation, I could be sacrificing as much as 71k + 44k extra core-seconds per day by choosing Pypy over CPython.  In relative money terms, that means it effectively increases the cost of an Amazon XL instance by a third.  In absolute terms, it costs somewhere between $5.50 USD/day and $16 USD/day – depending on whether or not it’s necessary to spin up an extra instance or not.

Obviously food for thought. Obviously this load test isn’t going to finish by itself, so I’m putting Python’s (lack of) JSON performance behind me.  But the stand out performance from ujson’s write speed does mean that I’m going to be paying a lot closer attention to whether or not I should be pushing towards CPython, Cython, and Numpy instead of Pypy.  In the end I may have no choice but to ditch Pypy altogether – something that would make me a sad panda indeed.

Filed under: Data Warehousing, Game Design, Personal Life, Software Development

Pair Programming, Code Reviews, and Data Warehousing

Code Reviewing

Code reviewing [Wikipedia] is the concept of having some form of peer review of finished code in order to ensure that it does what its supposed to do and that the approach taken to solve the problem was a good one.  There are two really common forms of code review – the formal code review and the lightweight code review.  A formal code review involves a thorough review and understanding of every line of code, frequently by everyone on a team.  Obviously, this is a very heavy process and formal code reviews are considered too time intensive for anything but the most sensitive code; they are considered almost antiquated these days.  Lightweight code reviews tend to be more informal and involve shorter looks at smaller blocks of code – but the danger is that the code review can be meaningless because of “rubber stamping”.  Both formal and informal code reviews have been shown to decrease the defect rate and improve knowledge transfer within a team.

Pair Programming

Pair programming [Wikipedia] is the concept of having two (or more) developers work on the same piece of code at the same time at the same work station.  In a very real way, pair programming is “on the fly” code reviewing – as such it also lowers the defect rate and it improves knowledge transfer.  It’s generally accepted that two programmers get a single piece of work done faster than one, but not twice as fast.  There is a net productivity loss when pair programming, and its hoped that the benefits make up for it.  I’ve personally seen it work a variety of ways, from Driver/Navigator to Test Ping Pong.  In all cases, both parties are expected to fully understand the overall design and code being written.

Data Warehousing

Data Warehousing [Wikipedia] is a branch of computing which involves the creation and care of large stores of data for the purpose of answering questions.  For instance, it is useful to know how many people clicked on a particular ad banner, or how many RC Helicopters were sold at Best Buys in Ohio.  For the Eve readers, killboards are examples of either Data Marts or Data Warehouses – depending on who you ask.  The discipline is closely related to data mining [Wikipedia], which often makes use of a data warehouse.

Most data warehousing is done via Extract, Transform, Load [Wikipedia] processes in databases like  PostgreSQLOracle, and MySQL, though certainly most serious data warehousing is done with a combination of technologies involving  Distributed File Systems [Wikipedia] and Map/Reduce.  To give you some idea of the scales involved in data warehousing: the largest single instance databases in the world weigh in at over 2 PB and data is amazingly scarce about larger data stores.  I’d estimate some of the larger data warehouses in the world weigh in at hundreds of PB now.  Personally, I’ve worked with data warehouses on 500GB and processing millions of facts per day to 150TB+ and processing up to trillions of facts per day.  I’d say your average data warehousing company isn’t likely to see more than 75GB of data per day and will store something on the order of 500GB-2TB.

The Dilemma

The internet debates over pair programming vs code reviews seem to be endless, but most of the teams I’ve encountered practicing some form of XP have a fairly strong preference for pair programming over code reviews.  The argument tends to go that what is really important is the second set of eyes on the code.  Furthermore, pair programming naturally avoids the danger of “rubber stamp” code reviews because its much harder when your reviewer is helping write the code.  These are absolutely valid observations and I’m a big fan of pair programming.

However, I feel like the right answer for a data warehousing team is not to pick one – but to pick both when possible. While this does mean that the process is very slightly heavier, I want to point out that the cost of failure is much higher.  A friend of mine points out that when most SAAS developers make a mistake, they fix it and bounce a web server – but when I make a mistake, we spend three weeks (re)migrating data.  Ultimately what everyone involved  – from product managers to the developers – wants is for the team to deliver results in a timely manner… and really, three weeks is a hell of a delay because you didn’t spend 20 minutes doing a code review.

So please consider the cost of failure when you’re considering whether you should do pair programming or code reviews.

Filed under: Data Warehousing, Databases, Software Development, , , ,