Failing In So Many Ways


Liang Nuren – Failing In So Many Ways

Pessimistic vs Optimistic Memoization

def memoize(obj):
    cache = obj.cache = {}

    def memoizer(*args, **kwargs):
        if args not in cache:
            cache[args] = obj(*args, **kwargs)
        return cache[args]
    return memoizer

I think one of the first things that anyone does with a dynamic language is write a memoizer to save time on expensive calculations. This is the memoizer from I like this one because it exposes the cache for clearing – an important feature in tests. This can be accomplished by creating a list of memoized functions and manually resetting their cache. The actual cache reset looks something like this:

def method_name(some_arg):
    return some_arg + 1

method_name.cache = {}

However, I think there’s several things we should know about this particular decorator before just using it:

  • It does not properly account for **kwargs. The thing to remember here is that **kwargs is implicitly a dictionary – an unhashable object. There are several popular methods of hashing a dictionary, but the far and away most popular appears to be hashing on frozenset(dict.items()). Another much less popular way is tuple(zip(dict)). We’ll do some testing to determine which is superior. One important thing to remember here is that large amounts of kwargs and long variable names can lead to quite a performance penalty no matter which one is ultimately better.
  • It does not properly handle unhashable or nested arguments. I think this is probably an acceptable limitation because solving it imposes a large penalty on both code performance and code maintainability. I think it is imperative to have a proper test suite to ensure that @memoized methods are not passed unhashable or nested arguments.
  • There appears to be two competing ways to do caching in Python. The first is the Look Before You Leap approach that conventional wisdom dictates, and is the one used here. Some cursory thought on the matter tells me that a more optimistic method of handling cache hits with try/except might work better. We’ll do some testing to determine which is superior.

Each caching strategy was tested over a list of 1 million tuples and utilize kwargs.  The numbers in the legend represent the basic cardinality of the unique values in the tested list.  The cache hit rate can be found by dividing the cardinality by 1 million.  Each memoization strategy was tested 20 times and the test results here are the average. I think that a picture is worth a thouand words, and so I’ve included a pretty graph. However, I’ve also included the base data below.

I think there should be several take aways from this:

  • tuple(zip(dict)) is superior to frozenset(dict.items())
  • Optimistic caching (try/except) is generally superior Look Before You Leap (key in cache)
  • There is a noticeable performance penalty for caching on kwargs. It might be worth having several memoize annotations and using the most appropriate one.
  • Lots of **kwargs with long names causes a major performance penalty

This is the final version of the memoizer (many thanks to Bryce Drennan in the comments for catching a bug in the memoizer):

def memoize(obj):
    cache = obj.cache = {}

    def memoizer(*args, **kwargs):
        key = (args, tuple(zip(kwargs.iteritems())))
            return cache[key]
        except KeyError, e:
            value = obj(*args, **kwargs)
            cache[key] = value
            return value
    return memoizer

This is the previous version of the memoizer:

def memoize(obj):
    cache = obj.cache = {}

    def memoizer(*args, **kwargs):
        key = (args, tuple(zip(kwargs)))
            return cache[key]
        except KeyError, e:
            cache[key] = value = obj(*args, **kwargs)
            return value
    return memoizer

Raw data

Tiny (Iterations: 1000000, Cardinality: 100)

  • Reference : 1.2129
  • Set : 3.8267
  • Zip : 3.0283
  • Pessimistic : 3.0055
  • Optimistic : 2.4478
  • No kwargs Reference : 0.5133
  • No kwargs Pessimistic : 1.1473
  • No kwargs Optimistic : 0.9309

Low (Iterations: 1000000, Cardinality: 10000)

  • Reference : 1.3167
  • Set : 4.5701
  • Zip : 3.4687
  • Pessimistic : 3.5359
  • Optimistic : 2.9393
  • No kwargs Reference : 0.6553
  • No kwargs Pessimistic : 1.3239
  • No kwargs Optimistic : 1.1201

Med (Iterations: 1000000, Cardinality: 99995)

  • Reference : 1.6757
  • Set : 4.9049
  • Zip : 3.8719
  • Pessimistic : 3.8955
  • Optimistic : 3.2962
  • No kwargs Reference : 0.9838
  • No kwargs Pessimistic : 1.7194
  • No kwargs Optimistic : 1.5371

Filed under: Software Development, , ,

Finding Unused Functions in a Python Source Tree

So for the last few months I have been crunching up a storm on getting the analytics out the door for our latest game.  I just finished some cool new features and had to do some major refactors (one step at a time, of course).  I began to suspect that I’d left some stray code hanging out there that wasn’t being used anymore.  I figured a great way to solve the problem was by looking at each function and grepping the source tree to see if it existed anywhere else.  First, let me introduce you to ack-grep [] (aliased as ‘ack’ below).  The best thing about ack-grep is that it lets you filter out certain file types from your results – and despite being written in Perl it’s frequently much faster than grep.

Now I’ll go over the evolution of a shell command.  Some of the steps are injected here because I knew where I was going – but for your benefit I’ll explain how things evolved.  The code base in question is about 10k lines over 140 files.

$ ack ‘def’

… bunch of functions and a bunch of other stuff …

$ ack ‘ def ‘

… function definitions …

Now let’s get the function name itself.  It should look like ‘def “function name”(arguments):’ so the first order of business is to cook up a regular expression to filter the line.  I’m better with Perl than with Sed, so I did it like this:

$ ack ‘def ‘ | perl -pe ‘s/.*def (\w+)\(.*/\1/’

I scrolled through looking for any obvious errors but didn’t find any.  Now comes the magic sauce.  xargs -I{} creates a new command for every input line and inserts the input row where {} exists.  Basically, this created a for each loop in shell.

$ ack ‘def ‘ | perl -pe ‘s/.*def (\w+)\(.*/\1/’ | xargs -I{} bash -c ‘ack “{}”‘

One thing I saw was that function usages were being attributed to functions that *contained* their name.

$ ack ‘def ‘ | perl -pe ‘s/.*def (\w+)\(.*/\1/’ | xargs -I{} bash -c ‘ack “\b{}\b”‘

That’s better, but functions that are defined multiple times (like __init__) are coming up a lot…

 $ ack ‘def ‘ | perl -pe ‘s/.*def (\w+)\(.*/\1/’ | sort | uniq | xargs -I{} bash -c ‘ack “\b{}\b”‘

Ok, but now the text is just flying by so fast and some functions are used hundreds of times and others very few…

$ ack ‘def ‘ | perl -pe ‘s/.*def (\w+)\(.*/\1/’ | sort | uniq | xargs -I{} bash -c ‘ack “\b{}\b” | wc -l’

Ok, now I know how many times something appears but not what it is…

$ ack ‘def ‘ | perl -pe ‘s/.*def (\w+)\(.*/\1/’ | sort | uniq | xargs -I{} bash -c ‘echo {} && ack “\b{}\b” | wc -l’

Ok, that’s great.  Now I have the name of the function and how many times that name appears in the source tree.  It’s kinda unwieldy though because the name is on a different line than the count and I can’t use grep.

$ ack ‘def ‘ | perl -pe ‘s/.*def (\w+)\(.*/\1/’ | sort | uniq | xargs -I{} bash -c ‘echo -n {} && ack “\b{}\b” | wc -l’

Fantastic.  Now I can look at just the functions that rarely get used!  Oh look, there’s so many tests coming up… :-/

$ ack ‘def ‘ | perl -pe ‘s/.*def (\w+)\(.*/\1/’ | sort | uniq | xargs -I{} bash -c ‘echo -n {} && ack “\b{}\b” | wc -l’ | egrep -v ‘^test_’

Ok, now let’s limit it to things appearing once…

$ ack ‘def ‘ | perl -pe ‘s/.*def (\w+)\(.*/\1/’ | sort | uniq | xargs -I{} bash -c ‘echo -n {} && ack “\b{}\b” | wc -l’ | egrep -v ‘^test_’ | egrep ‘ 1$’

Fantastic.  That’s exactly what I was looking for.

Filed under: Software Development, , ,