Failing In So Many Ways

Icon

Liang Nuren – Failing In So Many Ways

Python JSON Performance

So I’ve been pretty open about the fact that I’ve moved from data warehousing in the television and online ad industries to data warehousing in the gaming industry. The problem domains are so incredibly different. In the television and ad industries, there’s a relatively small amount of data that people are actually concerned about. Generally speaking, those industries are most interested in how many people saw something (viewed the ad), how many people interacted with it (clicked on it), and whether they went on to perform some other action (like buying a product).

However, in the gaming industry we’re interested in literally everything that a user does – and not in the creepy way. The primary goals are to monitor and improve user engagement, user enjoyment, and core business KPIs.  There are a lot of specific points to focus on and try to gather this information, and right now the industry standard appears to be a highly generalized event/payload system.

When looking at highly successful games like Temple Run (7M DAU [gamesbrief]) it’s only 150 events per user to get a billion events per day.  Between user segmentation and calculating different metrics it’s pretty easy to see why you’d have to process parts of the data enough times that you’re processing trillions of events and hundreds of GB of facts per day.

When I see something that looks that outrageous, I tend to ask myself whether that’s really the problem to be solving. The obvious answer is to gather less data but that’s exactly the opposite of what’s really needed. So is there a way that to get the needed answers without processing trillions of events per day? Yes I’d say that there is; but perhaps not with the highly generic uncorrelated event/payload system.  Any move in that direction would be moving off into technically uncharted territory – though not wholly uncharted for me. I’ve built a similar system before in another industry, albeit with much simpler data.

If you aren’t familiar at all with data warehousing, a ten thousand foot overview (slightly adapted for use in gaming) would look something like this.  First, the gaming client determines what are interesting facts to collect about user behavior and game performance. Then it transmit JSON events back to a server for logging and processing.  From there the data is generally batch processed and uploaded to a database* for viewing.

So as a basic sanity check, I’m doing some load testing to determine whether it is feasible to gather and process much higher resolution information about a massively successful game and it’s users than seems to be currently available in the industry.  Without going into proprietary details, I’ve manufactured analytics for a fake totalhelldeath game.  It marries Temple Run’s peak performance with a complicated economy resembling Eve Online’s.

From there, I’m compressing days of playtime into minutes and expanding the user base to be everyone with a registered credit card in the app store (~400M people as of 2012) [wikipedia].  The goal here is to see how far it’s possible to reasonably push an analytics platform in terms of metrics collection, processing, and reporting.  My best estimate for the amount of data to be processed per day in this load test is ~365 GB/day of uncompressed JSON.  While there’s still a lot that’s up in the air about this, I can share how dramatically the design requirements differ:

Previously:

  • Reporting Platform: Custom reporting layer querying 12TB PostgreSQL reporting databases
  • Hardware: Bare metal processing cluster with bare metal databases
  • Input Data: ~51GB/day uncompressed binary (~150TB total uncompressed data store)
  • Processing throughput: 86.4 billion facts/day across 40 cores (1M facts/sec)

Analytics Load Test:

  • Reporting Platform: Reporting databases with generic reporting tool
  • Hardware: Amazon Instances
  • Input Data: ~365 GB/day uncompressed JSON (~40k per “hell fact” – detailed below)
  • Processing throughput: duplication factor * 8.5M facts/game day (100 * duplication facts/sec)

I’ve traditionally worked in a small team on products that have been established for years.  I have to admit that it’s a very different experience to be tasked with building literally everything from the ground up – from largely deciding what analytics points are reasonable to collect to building the system to extract and process it all. Furthermore, I don’t have years to put a perfect system into place, and I’m only one guy trying to one up the work of an entire industry.  The speed that I can develop at is critical: so maintaining Agile practices [wikipedia], successful iterations [wikipedia], and even the language I choose to develop in is of critical importance.

The primary motivator for my language choice was a combination of how quickly I can crank out high quality code and how well that code will perform.  Thus, my earlier blog post [blog] on language performance played a pretty significant role in which languages saw a prototype.  Python (and pypy specifically) seems well suited for the job and it’s the direction I’m moving forward with.  For now I’m building the simplest thing that could possibly work and hoping that the Pypy JIT will alleviate any immediate performance shortfalls.  And while I know that a JIT is basically a black box and you can’t guarantee performance, the problem space showed high suitability to JIT in the prototyping phase.  I foresee absolutely no problems handling the analytics for a 1M DAU game with Python – certainly not at the data resolution the industry is currently collecting.

But, I’m always on the look out for obvious performance bottlenecks.  That’s why I noticed something peculiar when I was building out some sample data a couple of days ago. On the previous project I worked on, I found that gzipping the output files in memory before writing to disk actually provided a large performance benefit because it wrote 10x less data to disk.  This shifted our application from being IO bound to being CPU bound and increased the throughput by several hundred percent.  I expected this to be even more true in a system attempting to process ~365GB of JSON per day, so I was quite surprised to find that enabling in-memory gzip cut overall application performance in half.  The implication here is that the application is already CPU bound.

It didn’t take much time before I’d narrowed down the primary culprit: json serialization in pypy was just painfully slow. It was a little bit surprising considering this page [pypy.org] cites pypy’s superior json performance over cpython.  Pypy is still a net win despite the poor JSON serialization performance, but the win isn’t nearly as big as I’d like it to be. So after a bit of research I found several json libraries to test and had several ideas for how the project was going to fall out from here:

  • Use a different json library. Ideally it JITs better than built in and I can just keep going.
  • Accept pypy’s slow json serialization as a cost of (much) faster aggregation.
  • Accept cpython’s slower aggregation and optimize aggregation with Cython or a C extension later
  • Abandon JSON altogether and go with a different object serialization method (protobuf? xdr?)

After some consideration, I ruled out the idea of abandoning JSON altogether. By using JSON, I’m (potentially) able to import individual records at any level into a Mongo cluster and perform ad hoc queries. This is a very non-trivial benefit to just throw away! I looked at trying many JSON libraries, but ultimately settled on these three for various reasons (mostly relating to them working):

To test each of these libraries, I devised a simple test with the goal of having the modules serialize mock event data.  This is important because many benchmarks I’ve seen are built around very small contrived json structures.  I came up with the following devious plan in order to make sure that my code couldn’t really muck up the benhmark results:

  • create JSON encodable dummy totalhelldeath fact list
  • foreach module: dump list to file (module.dump(facts, fp))
  • foreach module: read list from file (facts = module.load(fp))

Just so that everything is immediately obvious: this was run on one core of an Amazon XL instance, and the charts are measuring facts serialized per second.  That means that bigger bars are better here.

Read Performance

There’s really no obvious stand out winner here, but it’s obvious that the builtin json library is lacking in both cpython and pypy. It obviously runs a bit faster with cpython, but it’s not enough to really write home about. However, simplejson and ujson really show that their performance is worth it. In my not-so-expert opinion, I’d say that ujson walks away with a slight victory here.

Write Performance

However, here there is an obvious standout winner. And in fact, the margin of victory is so large that I feel I’d be remiss if I didn’t say I checked file sizes to ensure it was actually serializing what I thought it was! There was a smallish file size difference (~8%), primarily coming from the fact that ujson serializes compact by default.

So now I’m left with a conundrum: ujson performance is mighty swell, and that can directly translate to dollars saved.  In this totalhelldeath situation, I could be sacrificing as much as 71k + 44k extra core-seconds per day by choosing Pypy over CPython.  In relative money terms, that means it effectively increases the cost of an Amazon XL instance by a third.  In absolute terms, it costs somewhere between $5.50 USD/day and $16 USD/day – depending on whether or not it’s necessary to spin up an extra instance or not.

Obviously food for thought. Obviously this load test isn’t going to finish by itself, so I’m putting Python’s (lack of) JSON performance behind me.  But the stand out performance from ujson’s write speed does mean that I’m going to be paying a lot closer attention to whether or not I should be pushing towards CPython, Cython, and Numpy instead of Pypy.  In the end I may have no choice but to ditch Pypy altogether – something that would make me a sad panda indeed.

Filed under: Data Warehousing, Game Design, Personal Life, Software Development

RL Politics

This is a political blog post.  I don’t really care if you don’t agree – or if you read it.  Its mostly here so that I can grapple with these thoughts in my own way.  A few people (including some coworkers) will probably read it, and I hope nobody gets offended.  I’m pretty sure that nobody I know in person is going to be offended at any rate. If you do get offended, please remember that I’m a programmer, not an accountant.  I’m absolutely positive that my back of the envelope analysis of the numbers and political situation is complete hogwash and bullshit.  And I’m a total moron, ignoramus, and idiot.  IANAL, YMMV, etc. 🙂

So, according to the IRS [irs.gov], $1,175,422,000,000 in income tax was paid by 144,103,375 tax returns.  This means that the average (mean) tax return paid $8,156.80 in taxes.  According to the same data source, the average (median) American also makes $33,048 per year.  This isn’t apples to apples (mixing medians and means), but one could make the statement that your “average” American paid ~24.7% in taxes – which is a little bit more than I have seen quoted as the “average” tax paid per person.

Mitt Romney made $21,700,000 in 2010 and paid 3,000,000 in taxes [politico.com] for an effective tax rate of 13.8%.  The number cited in popular media is 14%, so I suspect one (or more likely, both) of the estimates are off a bit.  Of course, the national media (and Newt Gingrich, of course) are having a field day with these numbers.  I’m not going to go into that, and in fact I’ve only passingly read the analysis in the news.  I’ll do my own analysis, which is undoubtedly wrong and biased in some way or another.  So if we neglect the fancy accounting and diverse sources of income, I would have expected Mitt Romney’s effective income tax rate to be ~35% [wikipedia].  This would mean his taxes should have weighed in at $7,595,000 – more than twice what he actually paid.

According to the same site, in 2009 there were 8274 tax returns with over $10,000,000 on them, and they totaled for $240,133,885,000 in gross income.  I’m not savvy enough in the reading of these spread sheets to make a really detailed analysis – and frankly I don’t want to take the time to become so.  So I’ll just make sweeping generalizations that are patently incorrect.  For example, despite the fact that I know the news has been making noise about many of these people paying no taxes at all, that’s an average of $29,022,708 per return – while its significantly more than Romney made, its also seems pretty close so I’ll use him as an average representative of this tax bracket group.

So lets first examine some numbers here:

  • Romney reported as much money as 657 “average” Americans.
  • Romney paid as much money as 368 “average” Americans.
  • Romney should have paid as much money as 931 “average” Americans.
  • The “missing taxes” is equivalent to an extra 563 “average” Americans paying taxes.
  • The group as a whole should have been taxed $84,046,859,750 – which means that they would account for the taxes of 10,303,901 “average” Americans.
  • Using Romney as a representative sample (massive statistical error there – and hell the actual numbers for this are probably available somewhere on the internet), they actually paid $33,138,476,130 – as much as 4,062,681 “average” Americans.
  • Again, using Romney as the sample, avoided $50,908,383,620 in taxes – as much as 6,241,220 “average” Americans.  Putting that in perspective: fancy finances made it as though the entire populations of Los Angeles and Chicago didn’t pay taxes.   Assume they were composed entirely of “average” hard working Americans – for those anyone out there who wants to think they’re nothing but big ghettos anyway.

So really, I don’t begrudge him making all that money – he’s obviously a pretty shrewd guy and is ruthlessly taking advantage of the current tax rules.  There’s almost certainly nothing illegal in his taxes, because I’m sure he has a fantastic accountant.  In one sense, that’s exactly the kind of guy that you want to be in the Oval Office: someone that’s ruthless and knows how to set his finances straight.  Unfortunately, I don’t see him campaigning to fix this kind of behavior… and in fact I see quite a bit of campaigning to preserve and expand it.  That makes me think that once in the Oval Office he’d be excellent at setting his finances straight… and not so much the finances of the country itself.

And yet, to me Romney is the best the GOP has to offer these days.  He’s the only GOP candidate that doesn’t seem to be bowing down to the whim of the tea party and their efforts to institute a Talibanesque “Christian Fundamentalist” Theocracy here in the United States.  He’s the only GOP candidate that seems to be willing to stand up to those who are willing to hold the entire nation hostage with their shady back room politicking.  But, he is still a Republican and “his party” is drifting farther and farther from what I hope is main stream thinking.  He’s contaminated if for no other reason than his roots and simple proximity.

I don’t know whether its really true or not anymore, but I have traditionally thought of myself as a religiously, socially, and financially conservative person.  This makes tons of sense too, because I grew up in Texas – the very belt buckle of the Bible Belt.  I grew up Republican, and proudly so.  But somewhere along the line, something changed – either with me or with the party of my youth.  Probably both really… but from my perspective they broke faith with me first.  And so the really sad thing is that I’m watching the political party of my youth divide and destroy itself.

And the most galling thing of all?  I’m glad to see it.

Filed under: Personal Life, , ,

An Introduction

First, let me introduce myself.  I am Liang Nuren from Eve Online and from Rift.  Eve-Search tells me that I was the 6th most prolific poster on the Eve Online forums with 16,517 posts between 2007-02-26 and 2011-06-24.  It also tells me that there were 17,955,165 typed characters, but this includes quoted text.  I’ve currently got 1,526 posts on the Rift forums, starting 2011-06-28. There’s an enormous amount of information out there about my opinions… and frankly, you’re free to go look at it.  I won’t rehash most of it here.

But, let’s start with where I come from.  A long time ago, in a land literally far, far away I was a little boy (imagine that).  I played on my computer and learned DOS commands and DOS scripting and even some assembly and BASIC.  When Windows rolled around, I thought it was the silliest idea in the world – why click through 10 menus to get what you want, when you can just type it on the command line in a quarter of the time!?  I’ve… somewhat revised this opinion.  But not much.

Back then, I played single player games like Infocom games, King’s Quest, and even Wasteland.  The Infocom games in particular led me to BBSing, where I found Legend of the Red Dragon and Tradewars 2002.  I do not even want to think about how much time I’ve burned on those two… but Tradewars was definitely the major time sink.  Some time later I heard about this crazy invention called “the internet”… except I don’t think it was called that at the time.  Well, I learned about MUDs there.

And there went my life.  For years.  Literally.  I think one of the first games I discovered was The Legend of Terris on AOL.  I played that until it went pay to play, at which time I was forced to quit.  Sadly, I don’t remember my name from there… but I’m sure it was some form of “Red Dwarf” or “Black Dwarf” or some such.  It might have even been some Latin phrase – I think by that time I was in high school where I took 3 years of Latin.  After that, I moved on to other MUDs – some of which I stayed at for a very long time.

I suppose this is a good time to point out that these MUDs are what made me decide that I wanted to know how to “code”.  There’s only so many times that you can wish something operated a different way before you look into how to do it yourself!  I remember that I joined the MUD-Dev list, as well as the ROM-Dev list, and maybe a couple of other MUD development lists.  I remember reading email chains by the original Everquest devs (IIRC) as they debated classes vs classlessness, levels vs levellessness and more.  I find myself wishing I could go back and read those email threads again.  But, I digress.

So I “learned to code” from video games – and from taking CS classes in High School – around this time.  I landed a sweet gig as a Jr Web Designer at a local Web Design shop, but they got bought by a bigger firm and we all got laid off.  I met Manuel LaBore and finished out High School… and then didn’t go to college.  What a bone headed move!  But, I met a wonderful lady and married her just in time for the next school year to start… and, well, I went to college.  I blew through college in 3 years, majoring first in CS and then in Math.  I worked more than full time through it too – which was a necessity when you have a wife and kids.

And then….. and then the .com bubble busted.  This made it kinda hard to get a job and I didn’t get one in my industry for a while.  But oh man, the job I got when I got it!  I did some majorly cool projects at that company – from Perl Monkeying to Data Warehousing to Distributed Processing with first Hadoop and then our own custom solution.  Very, very cool stuff with awesome people.  I eventually moved on after 4-5 years and I’m now helping a new company put together a cool distributed processing data warehouse.  The problems aren’t as hard as my last employer, but the work environment is hard to beat and they’re letting me play with cool toys.  Fun, fun!

So what can you expect to read about in this blog?  Well, there’ll be game theory discussions, for sure.  There’ll probably also be a fair amount of programming talk.  Hit me up on Twitter (@LiangNuren) if you want.  I’ll see if I can snag some of my old blog entries from my Evepress site and put them up here.

-Liang

Filed under: Personal Life