Self-optimizing data structures: using types to make lists faster

There are lots of optimization opportunities in a statically typed, purely functional language. This post describes an approach that improves the performance of off-the-shelf Haskell list functions by about 25% on average. The code is available as a drop-in replacement for Data.List.inkscape

This is part of an ongoing series on high-performance Haskell programming. Related posts include stream fusion for lists, programming the inliner, Judy arrays and Haskell, making binary parsing faster and applying stream fusion to arrays.


Statically typed, purely functional code is a sweet spot for optimization hackers. All that information available at compile time lets us write some pretty radical transformations, and know they are safe. Stream fusion, for example, relies on having type and effect information available statically, and uses that to make complexity-changing rewrites of your code. Clever algebraic transformations such as call-pattern specialization can make some pretty aggressive performance improvements. We can also make the runtime more efficient: by introducing parallelism, or again, exploiting type and effect information via pointer tagging. All this is easier in a strongly, statically typed and pure setting — there’s just less stuff you have to worry about when doing the transformations.

This post introduces an optimization technique based on types: self-optimizing polymorphic data types. When instantiated to a particular type, the code presented will generate both a custom, specialized data structure, and specialized functions over that type. The result is a general mechanism to improve the data density, and thus performance, of parametrically polymorphic data structures (where the same code operates generically on any type), such as lists, trees and finite maps. The general concept should be useful in any language with parametrically polymorphic types, and its implementation in Haskell, via class-associated data types is particularly elegant (requiring no compiler modification).


Parametric polymorphism is a great programmer tool. It improves expressivity, by letting the programmer use the same code at any possible type, yet still retains type safety. For example, the list type is a polymorphic type, and the length function computes the length of any lists, no matter what the element type is.

-- The list type: either empty, or cons, holds any type 'a'
data [] a = [] | a : [a]

Prelude> :t length
length :: [a] -> Int

The type system ensures that we can never write length to depend on a property of the element type, ensuring the code is maximally generic. It is simply not possible to mess with the elements of the list, given the type. Any attempt to do so will be a compiler error. This gives us some strong guarantees of abstraction, which we will take advantage of for optimization purposes.

Such polymorphic types and functions are flexible, but we pay a cost for this. In order for the code to work on any type ‘a’, we must ensure that all values have a uniform representation. In most languages, this is via an object or closure – some reference to heap allocated data – and Haskell is no exception. All polymorphic components will be stored as pointers to heap allocated thunks.

You can see this tradeoff most vividly in this picture of a list of pairs:

The uniform representation means that all values are represented via pointers to values, so the same code (e.g. map, filter) will work on any element type. But look .. we have cons nodes with pointers to pair nodes, which in turn point to integer nodes. (Also note the explicit sharing that is visible – the integer 1 and 2 are shared between multiple uses – made possible by purity).

Such a structure is generic, but has poor data density. If the compiler could see that the integer nodes hold word-sized values, it could unpack them into their constructors. GHC can already unpack monomorphic types, removing all indirection and ‘thunkiness’ associated with the data, via a manual unpack pragma at the data declaration site:

data PairIntInt = Pair {-# UNPACK #-} Int {-# UNPACK #-} Int

The Int values will now just be stored directly as word-sized fields in the Pair heap object. GHC can also infer the location of these pragmas if you use strict fields, and enable the -funbox-strict-fields optimization:

data PairIntInt = Pair !Int !Int

However, this technique only works on monomorphic fields. It does not make sense for polymorphic fields, since they have variable size, and the compiler does not (except in very specific circumstances) have access to information about what types are in use.

To unpack, say, all elements of a list of pairs, we need to instead generate a custom type:

data ListPairs  = Empty
                | Cons !Int !Int ListPairs

This will happily represent each node with all elements stored locally, but is no longer polymorphic – so any functions have to be written specially for this type. The result however, is a much denser data representation:

Note that unpacking changes the evaluation strategy of the data. Those elements are now strict. The spine of the list is still lazy, however, so it is still just as useful a type for control structures and generators. GHC is remarkably flexible about the control the programmer has over the evaluation strategy (strict, lazy, mixture), so we’ll take advantage of that in this library to produce a spine-lazy, element strict list type that is able to do this unpacking and specialization for us. I encourage you to explore mixed strict and lazy structures as well.

(An aside: these graphs were generated directly from the live Haskell heap via vacuum – another great new tool for Haskell development. I’ve not written about this yet, but you can see a video of it in use here. You can trace any Haskell heap structure to a graph representation, saving it as svg or png, or inspecting it live from the REPL).

So, how do we teach GHC to do this transformation automatically, and how do we do that without sacrificing code reuse — I only want to write the list library once!

Class-associated data types

Our problem can be stated as:

  • when we use a data structure at a particular element type, we require both custom functions, and a custom data type, to be generated

This gives us an insight into how to produce a solution. Type classes in Haskell are precisely the mechanism for having per-type functions. And a recent major extension to type classes has been the addition of class-associated data types (in their most general form, as type families).

These two features together will allow us to specialize the container type at each element type, and generate specialized functions for each use, removing “the uniform-representation tax” and improving performance across the board.


Where previously, the data type for lists is defined as:

data [] a = [] | a : [a]

we will instead associate the data type with a class, and leave the implementation to later: We also have to lift  the introduction and elimination forms for the type into the class (as we have no explicit constructors or pattern matching anymore):

class AdaptList a where
     data List a
     -- | The empty list
     empty   :: List a

     -- | Prepend a value onto a list
     cons    :: a -> List a -> List a

     -- | Is the list empty?
     null    :: List a -> Bool

     -- | The first element of the list
     head    :: List a -> a

     -- | The tail of the list
     tail    :: List a -> List a

Now we just need to write some functions over this abstract list type. First, conversion, to and from “real” lists:

-- | /O(n)/, convert an adaptive list to a regular list
toList :: AdaptList a => List a -> [a]
toList xs
 | null xs   = []
 | otherwise = head xs : toList (tail xs)

-- | /O(n)/, convert an adaptive list to a regular list
fromList :: AdaptList a => [a] -> List a
fromList []     = empty
fromList (x:xs) = x `cons` fromList xs

Note how we have to replace explicit pattern matching with calls to the elimination forms (head, tail, uncons), and constructors are replaced with calls into the type class methods (cons, empty). We will see that GHC is able to inline away all the type class dictionaries, resulting in specialized code when used at any specific type.

Some general functions on lists: map, fold, filter:

-- | /O(n)/. 'map' @f xs@ is the list obtained by applying @f@ to each element
-- of @xs@
map :: (AdaptList a, AdaptList b) => (a -> b) -> List a -> List b
map f as = go as
 go xs
   | null xs   = empty
   | otherwise = f (head xs) `cons` go (tail xs)
{-# INLINE map #-}
-- | /O(n)/. 'foldl', applied to a binary operator, a starting value (typically
-- the left-identity of the operator), and a list, reduces the list
-- using the binary operator, from left to right
foldl :: AdaptList b => (a -> b -> a) -> a -> List b -> a
foldl f z0 xs0 = go z0 xs0
 go !z xs
   | null xs   = z
   | otherwise = go (f z (head xs)) (tail xs)
{-# INLINE foldl #-}

-- | /O(n)/. 'filter', applied to a predicate and a list, returns the list of
-- those elements that satisfy the predicate
filter :: AdaptList a => (a -> Bool) -> List a -> List a
filter p xs0
 | null xs0  = empty
 | otherwise = go xs0
  go xs
   | null xs     = empty
   | p x         = x `cons` go ys
   | otherwise   =          go ys
   where x  = head xs
         ys = tail xs
{-# INLINE filter #-}

These implementations are essentially identical to the Haskell98 Prelude definitions, but with all constructors replaced with type class methods, and pattern matching replaced with “smart deconstructors”. Also, all recursive functions are written in worker/wrapper style, so they may be inlined (crucial to enable the function to be specialized).

So now we have an interface to lists and functions on them, but no concrete list type defined. We will do this via a class-associated type, with one instance for each element type we care about. This is a generic strategy, and so can be automated via some SYB-style generic code generation tricks. The result is a set of instances enumerating all the different specialized data representations statically:

instance AdaptList Int where

 data List Int = EmptyInt | ConsInt {-# UNPACK #-}!Int (List Int)

 empty = EmptyInt
 cons  = ConsInt

 null EmptyInt = True
 null _ = False

 head EmptyInt = errorEmptyList "head"
 head (ConsInt x _) = x

 tail EmptyInt = errorEmptyList "tail"
 tail (ConsInt _ x) = x

And so on for all the other types. Now, when we write a function on say, List Int, GHC will replace all our code with the specialized data type, where the Int elements are unpacked. All calls to the smart constructors and deconstructors will be replaced with the specialized versions as well.

We should thus get denser, faster list types and functions.

You can see the full implementation here, as well as the same technique applied to sums and products, and with the full list API filled out. I’m calling these “adaptive” data structures, as they adapt to a different shape based on how you use them.

Let’s benchmark and see if things get any faster. As usual, I will use criterion to do the measurement and graphing. The best open source benchmarking library around – makes benchmarking fun, the way QuickCheck makes testing fun.

Benchmarks: map

In these benchmarks, I’m measuring the standard GHC 6.10 Data.List functions, and the base list data type, against the version on Hackage. In each case, I generate 100k random integers (using the mersenne-random generator), build a list from those numbers, and measure the time to traverse the structure using each pair of functions. Criterion takes care of running the functions hundreds of times until meaningful data can be extracted. The execution time probability graphs are presented here. Note that they’re on different scales each time.

So, the numbers. Specialized map is 34% faster. Mean goes from 31.9ms to 21.2ms. Garbage collection is reduced by 40% (which seems typical across the board for these functions.

Regular Haskell list type and functions:

Specializing list map::

Looking at the core, we see that when used at an Int type, we get a specialized loop with unboxed parameters generated (no additional un- or re-boxing is required to store values back into the data structure!). This is a transformation GHC can’t do automatically, due to the uniform representation requirement.

Here’s the core for ‘adaptive sum’, taken from ghc-core,

$wgo :: Data.Adaptive.List.List Int -> Int# -> Int#
$wgo xs n = case xs of
    EmptyInt     -> n
    ConsInt y ys -> $wgo ys (+# n y)

Compared to regular lists, we see the additional unboxing of the list parameter that is required:

    $wlgo :: Int# -> [Int] -> Int#
    $wlgo n xs = case xs of
        []     -> n
        (y:ys) -> case y of
                     I# y# -> $wlgo (+# n y#) ys

Benchmarks: filter

Filter is 15% faster. Mean goes from 42.5ms to 36.2ms.

data.list.filter-densities-800x200Specialized version:

data.adaptive.list.filter-densities-800x200Benchmarks: foldl

Reducing a list is 23% faster (sum, length, maximum, product, minimum). Mean goes from 11.2ms to 8.7ms.

data.list.foldl'-densities-800x200and when specialized:



One of the obivous questions is: what about fusion? Stream fusion is an entirely different automatic optimization that optimizes multiple traversals over a data structure into a single, more complicated traversal, dramatically improving time and space usage. The data structures and functions shown in this post complement fusion well — we can still fuse loops, on speciazlied data and functions, removing unnecessary traversals. Stream fusion also enables opportunities for specialization, which will no longer be of much use, as our data types have specialized themselves.

We also lose sharing in this approach. Sorry, but your Int fields will be copied by value. The approach makes most sense for small atomic types anyway, which may be unpacked, and these are precisely the types where sharing is unlikely to be helpful.

Essentially, we now treat the list type and constructors as an abstract view on to some custom, specialized representation. We may be able to regain pattern matching on the abstract structures via GHC’s view patterns. I’ve not tried this yet, but it looks easy enough.

The list implementation is a useful proof of concept — you can use it in your list-heavy programs today if you want. The most promising direction, though, in my view, is for container types such as IntMap and Sequence, where already the constructors are abstract, and there’s a small set of primitive operations, and we really care about data density. There may be some big wins for coalescing of nodes in IntMap, for example.


In summary, we have a general approach to allow, for the first time, uniform specialization of both parametrically polymorphic functions and data, resulting in good speedups with only library modifications required. The Haskell type system, via class-associated type classes, is sufficiently expressive to describe abstract and concrete representation mappings, and the GHC optimizer (particular the inliner, and the unpacking pragma) are enough to precisely control the layout of data, along with the generation of custom functions for each type. The approach should give speedups in any language with a uniform representation policy, though it is certainly easier when you don’t need to modify the compiler.The approach here is very specific, and may be subsumed by more general global information optimization techniques, such as super compilation (coming to GHC). We also rely on the magic of the GHC Inliner to remove all traces of type class dictionaries.

The next steps will be to combine this with list fusion, to have the fastest lazy lists I know of, and to try it out on IntMap and friends, to yield faster polymorphic container types.

The result is satisfying: your Haskell code goes faster, while remaining high level and pure.

Further reading

This is the first time I’ve written down this approach, but the idea has been kicking around for a few months. You can see my WG2.8 talk on the subject here, and a video of my talking about initial experiments in this direction at the Utrecht hackathon. I’d be interested if readers know of any related work for this joint-specialization of data and function code, to eliminate the indirection cost of parametric polymorphism.

Fast mutable collections for Haskell: more benchmarks

Today at HacPDX I did more work on the Judy bindings described yesterday. Here is judy 0.2.1 up (lunchtime release). These are bindings to the (in)famous judy arrays library, allowing Haskell keys and values to be stored in fast, mutable maps and hash structures. You can currently index via words (i.e. you have to hash keys yourself), and you can store value types or pointers to Haskell values allocated with stable pointers.

Today, I started by filling out the API some more, and benchmarking of how each API function scales. The functions I’m interested in are:

size ::  JudyL a -> IO Int
insert :: JA a => Key -> a -> JudyL a -> IO ()
lookup :: JA a => Key      -> JudyL a -> IO (Maybe a)
delete ::         Key      -> JudyL a -> IO ()

Let’s see how the performance of these functions scales with input. Benchmarking is performed via criterion (newly released to the public!) with random input keys and values generated from the mersenne twister.

Implementation methodology

My general strategy with constructing a high performance Haskell library is to:

  • Pick a candidate benchmark set.
  • Pick a good candidate data structure to make the benchmark faster.
  • Carefully implement the type, or the binding to the foreign type and its representation in Haskell
  • Carefully abstract out the unsafeties in any low level interface via type or language features.
  • Function-by-function, check by eye the optimized code that GHC produces (with ghc-core) (in particular, inlining, specialization and unboxing).
  • Function-by-function, benchmark the absolute performance as input grows (with criterion), with reference to model structures.
  • Use QuickCheck to show correctness against a model, with HPC to confirm code coverage of the test.
  • Document extensively with haddock.
  • Collect ideas on how to generalize the interface (via type classes, more flexible types, representation-changing interfaces via type families).


The size function tells you how many keys are in the Judy array. It has O(1) complexity, as we see from empirical measurement, here with randomized arrays with Int values and keys. We get the same cost whether it is 100k, 1M or 10 million elements in the structure. size is fast and scales.


A common operation on associative arrays is lookup(). It better be fast, and not degrade.


size-10M-densities-800x200If we plot the mean lookup times against N, we’ll see the lookup time grows slowly (taking twice as long when the data is 100 times bigger). (N.B. I forgot to change the titles on the above graphs).


Delete has an almost identical profile to insert, unsurprisingly. Time to delete an element grows very slowly with N, and in absolute terms if very fast.




Very fast, scalable mutable maps and hashes for Haskell

I’m at HacPDX working on judy, a finite-map like interface to the classic judy arrays library, which provides fast and scalable mutable collection types for Haskell. While developing this library, where performance is the primary concern, I’m making heavy use of Bryan O’Sullivan’s new criterion benchmarking and statistics suite, announced recently at the Haskell Implementors Workshop.

Criterion is awesome. It is going to change how we design high performance libraries in Haskell. Read on to see why.

The Judy bindings for Haskell store Haskell keys and values in judy arrays, with resources tracked by a ForeignPtr on the Haskell heap. They provide a mutable, imperative collection type, similar to the old Data.HashTable (now slated for removal), but with an interface closer to Data.Map.

The key benefit over previous hashtable implementations for Haskell is that the judy bindings scale very well, as we shall see. Also, unlike, say, Data.IntMap, judy arrays are mutable, making them less useful for some applications. They are not a solution for all container problems. However, if you need large scale collections, the judy binding might be appropriate.

The library is under active development, and currently approximates a mutable IntMap structure, with more work planned to add optimized hashes, type-family based representation switching, and more. It has a straight forward interface:

new    :: JA a => IO (JudyL a)
insert :: JA a => Key -> a -> JudyL a -> IO ()
lookup :: JA a => Key      -> JudyL a -> IO (Maybe a)
delete ::         Key      -> JudyL a -> IO ()

Let’s look at the performance profile.


Here we measure the cost for inserting 1k, 100k and 10M consecutive word-sized keys and values, over repeated runs. Criterion takes care of the measurement and rendering. First 1k values:


Above, we see the timings for 100 runs of inserting one thousand values into an empty judy array. The fastest times were around 0.22ms (220 microseconds) to build the table of 1000 values. The slowest was around 230 microseconds. A tight cluster.

Criterion can then compute the probability density curve, showing a good clumping of times.


Next we see that inserting one hundred thousand elements takes about 100x longer. That is, it scales linearly, as the docs suggest it should. We go from 1k elements in 0.2ms to 100k in 20ms.


Here we see a good clustering of values again. There’s very little variance. I can reliably insert 100k elements in 20ms.

Now the bigger test, 10 million elements. Again, the performance of the Judy bindings scales linearly with input size. 0.2ms for 1k, 20ms for 100k, ~2.2s for 10M, though there are two distinct performance bands at 10M elements.


The density function shows the performance clusters quite clearly. A peak around 2.2s and a broader peak around 2.4s.


Judy arrays scale very, very well. I was able to insert 100 million elements in 22s (10x slower again), using < 1G of memory. IntMap at equilvalent N exhausted memory.


The main problem with Data.HashTable is its reliance on the garbage collector to not get in the way. As the hash sizes grow, heap pressure becomes more of an issue, and the GC runs more often, swamping performance. However, for smaller workloads, and with decent default heap sizes Data.HashTable outperforms the Judy bindings:


While at larger sizes Data.HashTable performance degrades, taking on average 5.6s to insert 10M elements (using +RTS -H1000M -A500M).

insert-hash-10M-densities-600x200With default heap settings Data.HashTable has very poor performance, as GC time dominates the cost.

insert-hash-10M-densities-600x200That is, the Data.HashTable, at N=10M is 20x slower than Judy arrays, and with optimized heap settings, Data.HashTable is 2.5x slower than judy.

Data.IntMap and Data.Map

We can also measure the imperative lookup structures againts persistance, pure structures: Data.IntMap (a big-endian patricia tree) and Data.Map (a size-balanced tree).

Data.IntMap at N=10M, with default heap settings takes on average 7.3s to insert 10M elements, or about 3.3x slower than judy arrays (and slower than the optimized HashTable).

Data.Map at N=10M, with default heap settings, takes on average 24.8s to insert 10M elements, or about 11x slower than judy arrays.


At small scale, (under 1M elements), for simple atomic types being stored, there are a variety of container types available on Hackage which do the job well: IntMap is a good choice, as it is both flexible and fast. At scale, however, judy arrays seem to be the best thing we have at the moment, and make an excellent choice for associative arrays for large scale data. For very large N, it may be the only in-memory option.

You can get judy arrays for Haskell on Hackage now.

I’ll follow up soon with benchmarks for lookup and deletion, and how to generalize the interface.

Data.Binary: performance improvments for Haskell binary parsing

I’ve just uploaded to Hackage Data.Binary which contains some good performance improvements for binary parser users.

Data.Binary is a popular serialization library for Haskell, which uses Get and Put monads to efficiently translate Haskell structures to and from lazy bytestrings (basically, byte streams) in a pure style. It is used by 85 other packages on Hackage, for everything from network packet parsing, trie container types, network traffic analysis, to web server state persistance, and high speed cryptography.

As a refresher, recall that Binary provides both a Get and Put monadic environment. The Put monad gives you a locally scoped “buffer filling” mode, where calls to ‘put’ implicitly append values to a buffer. A sequence of ‘puts’ is a pure computation that returns a bytestring at the end. Like so:

    runPut :: Put -> ByteString
        -- runPut takes a serializer code fragment,
        -- and uses it to fill a bytestring with bytes.

    Data.Binary> runPut (do put 2; put (Just 7); put "hello")

You can also stream these into the zlib binding to gain on the fly compression (via bzlib or zlib). Conversely, you can parse Haskell data from binary formats, using the ‘Get’ monad. Given a bytestring stream, parse it into some Haskell structure:

    runGet :: Get a -> ByteString -> a
       -- runGet takes a binary parser, a bytestring,
       -- and returns some 'a' parsed from the bytesting.

    Data.Binary.Get> let s = pack "\NUL\NUL\NUL\NUL\STX\SOH
    Data.Binary.Get> runGet (do a <- get
                                b <- get
                                c <- get
                                return (a :: Integer,b :: Maybe Integer ,c :: String)) s
    (2,Just 7,"hello")

There are also default Get and Put parsers and serializers in the Binary type classes, which provides ‘encode’ and ‘decode’ methods:

    encode :: Binary a => a -> ByteString
    decode :: Binary a => ByteString -> a

Binary is used for heavy lifting binary wire protocol parsing and writing. As such it needs to be very fast. It uses aggressive inlining, careful data representations, specialization, and some interesting monadic representation transformations, to ensure GHC can optimize user-supplied parsers well. The changes, thanks to Simon Marlow, are two-fold, and improve the performance of reading into Haskell data structures by about 40%.

Firstly, the underlying Get monad representation is changed from:

    newtype Get a = Get { unGet :: S -> (a, S) }


    newtype Get a = Get { unGet :: S -> (# a, S #) }

That is, we use an explict stack/register allocated unboxed tuple to return the parsed value from the parser. This allows GHC to more aggressively optimize code containing repeated reads from the parser state. (As an aside, we also looked at a continuation-based encoding of the Get monad, but there is no performance win due to the very simple tuple return type, which GHC is already able to decompose nicely. Continuation encodings are more useful if we returned, say, an Either).

Secondly, the monadic bind for the Get monad is now strict. Moving from an explicitly lazy:

    m >>= k   = Get (\s -> let (a, s') = unGet m s in unGet (k a) s')


    m >>= k   = Get $ \s -> case unGet m s of
                             (# a, s' #) -> unGet (k a) s'

This seems to also improve the kind of code GHC generates, which coalescing of repeated reads into straight line code without tests.

Overall, there is no user-visible change to Data.Binary, as only the monadic plumbing has been modified. Your code should just get faster. Parsing streams of Word64s on my laptop goes from ~100M/s to 140M/s.

Haskell for Everyone: Hackage and the Haskell Platform : Haskell Implementors Workshop 2009

“FP Super Fun Week”, aka ICFP + Haskell Symposium + CUFP + Haskell Implementors Workshop + DEFUN has come and gone.

Malcolm Wallace did an amazing job recording most of the sessions, so as a result we have the talk and discussion about Hackage and the Haskell Platform.

You can follow along with the slides here.

Haskell for Everyone: Hackage and the Haskell Platform


Improving Data Structures with Associated Types

Original .PDF

This talk describes how to view live heap structures in Haskell, including sharing and unboxing information, and then to use insights on the representation of data to improve the performance and space, for the first time, of uniform polymorphic data types by changing their representation. We’ll use associated data types to guide regular optimizations of polymorphic structures.

The library described in this talk is available on hackage, along with the visualization tool used to build it. There’s also a screencast of the visualization tool in use.

This talk was originally presented at WG2.8 in Frauenchiemsee, Germany in June 2009.

Stream Fusion for Haskell Arrays

PDF Version

Arrays have traditionally been an awkward data structure for Haskell programmers. Despite the large number of array libraries available, they have remained relatively awkward to use in comparison to the rich suite of purely functional data structures, such as fingertrees or finite maps. Arrays have simply not been first class citizens in the language.

In this talk I’ll begin with a survey of the more than a dozen array types available, including some new matrix libraries developed in the past year. I’ll then describe a new efficient, pure, and flexible array library for Haskell with a list like interface, based on work in the Data Parallel Haskell project, that employs stream fusion to dramatically reduce the cost of pure arrays. The implementation will be presented from the ground up, along with a discussion of the entire compilation process of the library, from source to assembly.

The library described in this talk is available on hackage, and is now used by a few projects, including haskell-monte-carlo haskell-pqueue-mtl haskell-statistics-fusion haskell-uvector-algorithms and Bryan O’Sullivan’s new uber benchmark suite.

This talk was originally presented at Galois on August 28th, 2008.


Get every new post delivered to your Inbox.

Join 72 other followers