Back to old tricks .. (or, baby steps in Rust)

I’ve been learning Rust for the past twenty days or so, working through the Blandy & Orendorff book, coding up things as I go. Once I got into playing with Rust traits and closures and associated types, the similarities to programming in Haskell with typeclasses, data structures, closure passing and associated types was pretty obvious.

As a warm up I thought I’d try porting the stream fusion core from Haskell to Rust. This was code I was working on more than a decade ago. How much of it would work or even make sense in today’s Rust?

Footnote: this is the first code I’ve attempted to seriously write for more than 2 years, as I’m diving back into software engineering after an extended sojourn in team building and eng management. I was feeling a bit .. rusty. Let me know if I got anything confused.

TLDR

  • Two versions of the basic stream/list/vector APIs: a data structure with boxed closure-passing and pure state
  • And a more closure-avoiding trait encoding that uses types to index all the code statically

Fun things to discover:

  • most of the typeclass ‘way of seeing’ works pretty much the same in Rust. You index/dispatch/use similar mental model to program generically. I was able to start guessing the syntax after a couple of days
  • it’s actually easier to get first class dictionaries and play with them
  • compared to the hard work we do in GHC to make everything strict, unboxed and not heap-allocated, this is the default in Rust which makes the optimization story a lot simpler
  • Rust has no GC , instead using a nested-region like allocation strategy by default. I have to commit to specific sharing and linear memory use up front. This feels a lot like an ST-monad-like state threading system. Borrowing and move semantics take a bit of practice to get used to.
  • the trait version looks pretty similar to the core of the standard Rust iterators, and the performance in toy examples is very good for the amount of effort I put in
  • Rust seems to push towards method/trait/data structure-per-generic-function programming, which is interesting. Encoding things in types does good things, just as it does in Haskell.
  • Non-fancy Haskell, including type class designs, can basically be ported directly to Rust, though you now annotate allocation behavior explicitly.
  • cargo is a highly polished ‘cabal’ with lots of sensible defaults and way more constraints on what is allowed. And a focus on good UX.

As a meta point, it’s amazing to wander into a fresh programming language, 15 years after the original “associated types with class” paper, and find basically all the original concepts available, and extended in some interesting ways. There’s a lot of feeling of deja vu for someone who worked on GHC optimizations/unboxing/streams when writing in Rust.

Version 1: direct Haskell translation

https://github.com/donsbot/experiments-rust/blob/master/stream-fusion/src/closure.rs

First, a direct port of the original paper. A data type for stepping through elements of stream, including ‘Skip’ so we can filter things. And a struct holding the stream stepper function, and the state. Compared to the Haskell version (in the comment) there’s more rituals to declare what goes on the heap, and linking the lifetime of objects together. My reward for doing this is not needing a garbage collector. There’s quite an enjoyable serotonin reward when a borrow checking program type checks :-)

You can probably infer from the syntax this will be operationally expensive: a closure boxed up onto the heap. Rust’s explicit control of where things are stored and for how long feels a lot like a type system for scripting allocators, and compared to Haskell you’re certainly exactly aware of what you’re asking the machine to do.

Overall, its a fairly pleasing translation and even though I’m putting the closure on the heap I’m still having it de-allocated when the stream is dropped. Look mum, closure passing without a GC.

We can create empty streams, or streams with a single element in them. Remember a stream is just a function from going from one value to the next, and a state to kick it off:

The lambda syntax feels a bit heavy at first, but you get used to it. The hardest part of these definitions were:

  • being explicit about whether my closure was going to borrow or own the lifetime of values it captures. Stuff you never think about in Haskell, with a GC to take care of all that thinking (for a price). You end up with a unique type per closure showing what is captured, which feels _very_ explicit and controlled.
  • no rank-2 type to hide the stream state type behind. The closest I could get was the ‘impl Seed’ opaque return type, but it doesn’t behave much like a proper existential type and tends to leak through the implementation. I’d love to see the canonical way to hide this state value at the type level without being forced to box it up.
  • a la vector ‘Prim’ types in Haskell, we use Copy to say something about when we want the values to be cheap to move (at least, that’s my early mental model)

I can generate a stream of values, a la replicate:

As long as I’m careful threading lifetime parameters around I can box values and generate streams without using a GC. This is sort of amazing. (And the unique lifetime token-passing discipline feels very like the ST monad and its extensions into regions/nesting). Again , you can sort of “feel” how expensive this is going to be, with the capturing and boxing to the heap explict. That boxed closure dynamically invoked will have a cost.

Let’s consume a stream, via a fold:

Not too bad. The lack of tail recursion shows up here, so while I’d normally write this as a ‘go’ local work function with a stack parameter, to get a loop, instead in Rust we just write a loop and peek and poke the memory directly via a ‘mut’ binding. Sigh, fine, but I promise I’m still thinking in recursion.

Now I can do real functional programming:

What about something a bit more generic: enumFromTo to fill a range with consecutive integer values, of any type supporting addition?

The trait parameters feel a lot like a super layered version of the Haskell Num class, where I’m really picking and choosing which methods I want to dispatch to. The numeric overloading is also a bit different (A::one()) instead of an overloaded literal. Again, is almost identical to the Haskell version, but with explicit memory annotations, and more structured type class/trait hierarchy. Other operations, like map, filter, etc all fall out fairly straight forward. Nice: starting to feel like I can definitely be productive in this language.

Now I can even write a functional pipeline — equivalent to:

   sum . map (*2) . filter (\n -> n`mod`2 == 1) $ [1..1000000::Int]
=> 500000000000

As barebones Rust:

It actually runs and does roughly what I’d expect, for N=1_000_000:

$ cargo run

500000000000

Another moment of deja vu, installing Criterion to benchmark the thing. “cargo bench” integration is pretty sweet:

So about 7 ms for four logical loops over 1M i64 elements. That’s sort of plausible and actually not to bad considering I don’t know what I’m doing.

The inflation is… not great, not terrible | Meanwhile in Budapest

The overhead of the dynamic dispatch to the boxed closure is almost certainly going to dominate, and and then likely breaks inlining and arithmetic optimization, so while we do get a fused loop, we get all the steps of the loop in sequence. I fiddled a bit with the inlining the consuming loop, which shaved 1ms off, but that’s about it.

A quick peek at the assembly f–release mode, which I assume does good things, and yeah, this isn’t going to be fast. Tons of registers, allocs and dispatching everywhere. Urk.

But it works! The first thing directly translated works, and it has basically the behavior you’d expect with explict closure calls. Not bad!

A trait API

https://github.com/donsbot/experiments-rust/blob/master/stream-fusion/src/trait.rs

That boxed closure bothers me a bit. Dispatching to something that’s known statically. The usual trick for resolving things statically is to move the work to the type system. In this case, we want to lookup the right ‘step’ function by type. So I’ll need a type for each generator and transformer function in the stream API. We can take this approach in Rust too.

Basic idea:

  • move the type of the ‘step’ function of streams into a Stream trait
  • create a data type for each generator or transformer function, then impl that for Stream. This is the key to removing the overhead resolving the step functions
  • stream elem types can be generic parameters, or specialized associated types
  • the seed state can be associated type-indexed

I banged my head against this a few different ways, and settled on putting the state data into the ‘API key’ type. This actually looks really like something we already knew how to do – streams as Rust iterators – Snoyman already wrote about it 3 years ago! — I’ve basically adapted his approach here after a bit of n00b trial and error.

The ‘Step’ type is almost the same, and the polymorphic ‘Stream’ type with its existential seed becomes a trait definition:

What’s a bit different now is how we’re going to resolve the function to generate each step of the stream. That’s now a trait method associated with some instance and element type.

So e.g. if I want to generate an empty stream, I need a type, and instance and a wrapper:

Ok not too bad. My closure for stepping over streams is now a ‘next’ method. What would have been a Stream ‘object’ with an embedded closure is now a trait instance where the ‘next’ function can be resolved statically.

I can convert all the generator functions like this. For example, to replicate a stream I need to know how many elements, and what the element is. Instead of capturing the element in a closure, it’s in an explicit data type:

The step function is still basically the same as in the Haskell version, but to get the nice Rust method syntax we have it all talk to ‘self’.

We also need a type for each stream transformer: so a ‘map’ is now a struct with the mapper function, paired with the underlying stream object it maps over.

This part is a bit more involved — when a map is applied to a stream element, we return f(x) of the element, and lift the stream state into a Map stream state for the next step.

I can implement Stream-generic folds now — again, since I have no tail recursion to consume the stream I’m looping explicitly. This is our real ‘driver’ of work , the actual loop pulling on a chain of ‘stream.next()’s we’ve built up.

Ok so with the method syntax this looks pretty nice:

I had to write out the types here to understand how the method resolving works. We build up a nice chain of type information about exactly what function we want to use at what type. The whole pipeline is a Map<Filter<Range < … type> , all an instance of Stream.

So this should do ok right? No boxing of closures, there could be some lookups and dispatch but there’s enough type information here to know all calls statically. I don’t have much intuition for how Rust will optimize the chain of nested Yield/Skip constructors.. but I’m hopeful given the tags fit in 2 bits, and I don’t use Skip anywhere in the specific program.

288 microseconds to collapse a 1_000_000 element stream. Or about 25x faster. Nice!

So the type information and commitment to not allocating to the heap does a lot of work for us here. I ask cargo rustc --bin stream_test --release -- --emit asm for fun. And this is basically what I want to see: a single loop, no allocation, a bit of math. Great.

It’s converted the %2 / *2 body into adding a straight i64 addition loop with strides. I suspect with a bit of prodding it could resolve this statically to a constant but that’s just a toy anyway. All the intermediate data structures are gone.

Overall, that’s a pretty satisfying result. With minimal effort I got a fusing iterator/stream API that performs well out of the box. The Rust defaults nudge code towards low overhead by default. That can feel quite satisfying.

Evolving Faster Haskell Programs (now with LLVM!)

… In which I use genetic algorithms to search for optimal LLVM optimizer passes to make Haskell programs faster …

On a sunny day last Spring, I spent some time playing with genetic algorithms (GAs) for breeding faster Haskell programs, by improving the inlining hints suggested to GHC. The results were pretty cool: the GA found new inlining settings for existing Language Benchmark Game programs — that had already been heavily scrutinized — improving both programs I tried, and in one case, by 18%.

Now, 12 months later, it’s again a sunny Spring day, and we’ve got quite a few new tools to play with in the Haskell world, for working on interesting performance puzzles:

Today we’ll see what more we can extract from the LLVM backend.

GHC Overview

LLVM has a massive suite of low level optimizations, most of which were previously unavailable to the Haskell programmer. GHC has become, in a way, a Haskell to C– to LLVM compiler. (C– is translated to LLVM bytecode, which is then optimized using classic optimizations prior to code generation). To a first approximation, the LLVM optimizer implements Muchnik (recommended reading), and GHC has just imported 25 years of imperative language optimizations. This is exciting for me.

First GHC does masses of source to source transformations on the Haskell Core language (including, e.g. fusion and constructor specialization), before translating Core to C–, an imperative language interface between high-level compilers and optimizing code generators. Some transformations take place here, before spitting out LLVM bytecode using David Terei’s LLVM backend to GHC (PDF).

By default GHC just passes -O3 to the LLVM optimizer, which enables a number of passes, including:

  • interprocedural sparse conditional constant propagation.
  • combine instructions to form fewer, simple instructions
  • dead code elimination and basic block merging
  • bottom-up inlining of functions
  • promoting “by reference” arguments to be “by value” argument
  • transformeing tail recursive calls to local jumps
  • reassociating commutative expressions in an order that is designed to promote better constant propagation,
  • loop identification and simplification
  • a number of SSA analsyis passes
  • loop invariant code motion

Now, it is entirely unclear to me whether the default set of optimization passes enabled by llvm -O3 are suitable for the kind of code GHC generates, and I’m not alone in suspecting they’re not ideal. With the GCC backend, we were mostly out of luck in having control over the low level optimizations. The GCC optimization pipeline wasn’t terribly programmable, but LLVM lets us run any analysis or transformation pass in any order we want.

Which LLVM passes to use, and which order to run them, with what analysis in between, is a huge search problem, though. There are roughly 200 optimization and analysis flags, and these, mostly, can be run in any order, any number of times. If we want to run say, twenty of the passes (about what -O3 does), that’s what, about 10^46 arrangements!

So the challenge is to find a set of optimizations that catch some of the idioms in GHC’s generated code, and to hopefully find ways to make Haskell programs even faster.

Making it into a search problem

So I’ll use the approach I took in 2009 for the inlining problem, and use the genetic algorithm library, acovea, to breed the best set of flags for a few programs, where fitness for breeding is determined by total runtime. In this post I’m using GHC head, with the LLVM patch (more info in this post), and a custom patch to remove the default -O3 passed to LLVM.

I first have to give a specification of the flags to GHC + LLVM. Here’s the specification for GHC/LLVM flags that I am going to use — and you can use it too.

Note that LLVM allows us to specify the entire optimization pipeline order on the commandline, including duplicate phases. I don’t think Acovea will apply flags multiple times (no duplicate flags). We’ll see if that matters.

The acovea set of libraries are fairly widely available now, e.g. in Arch Linux. For more background on acovea, see the previous post on inlining — it’s easy to build if not packaged for your distro.

Computing fitness

The programs we want to improve have to report their fitness. To do this I’ll use criterion’s timing function, wrapping up the code we actually want to optimize in some measurement functions. This will just report total wall clock running time for the code under analysis.

We could make our fitness function more robust to noise by using criterion’s sampling and iteration framework, which will run the core code sufficiently often to remove noise. I have notes at the end on what my ideal tool for this looks like.

A warm up: sum of a sequence

To warm up, and see if this stuff works, I’ll begin with the toy program from last week’s fusion post. There, GHC fuses two loops into one, but it fails to collapse that final loop into a constant (something that GCC -O3 can do to manually fused C code). It’s a good test program to see how loops are handled in a new backend.

GHC + GCC isn’t able to spot that the loop is a constant either, however, but maybe GHC + LLVM can. I don’t know what flag I’ll need to have this happen, so let’s run the GA to explore the search space…

First, the code, wrapped in the measurement function:

import qualified Data.Vector as U

import Criterion.Measurement
import Text.Printf

-- code under analysis: sum [1..n] on vectors.
--
v = U.sum $ U.enumFromTo 1 (100000000 :: Int)

main = do
    d <- time_ $ v `seq` return ()
    putStrLn (secs' d)
secs' :: Double -> String
secs' k
    | k < 0      = '-' : secs' (-k)
    | otherwise  = printf "%.3f" k

We can compile and run this by hand:

$ ghc -Odph --make ex1.hs -fforce-recomp
[1 of 1] Compiling Main             ( ex1.hs, ex1.o )
Linking ex1 ...

$ time ./ex1
0.111
./ex1  0.12s user 0.01s system 101% cpu 0.122 total

While it computes the sum, it prints its fitness. Note that it removes any startup or IO cost incurred in measurement. When compiled with standard Haskell optimizations, and the existing native code generator, the program runs in 122ms. The LLVM backend does better, with its default optimizations added to the mix:

$ ghc -Odph --make ex1.hs -fforce-recomp -fllvm -optlo-O3
[1 of 1] Compiling Main             ( ex1.hs, ex1.o )
Linking ex1 ...

$ time ./ex1
0.051
./ex1  0.06s user 0.00s system 98% cpu 0.061 total

However, despite the value being a constant, llvm -O3 hasn’t removed the loop. Here’s the assembly of the inner loop:

.LBB1_2:
        leaq    1(%rsi), %rax
        addq    %rsi, %r14
        cmpq    $100000001, %rax
        jge     .LBB1_5
        addq    $2, %rsi
        addq    %rax, %r14
.LBB1_1:                                # %tailrecurse
        cmpq    $100000001, %rsi
        jl      .LBB1_2

So I take the code, and wrap it up in the GA tool, then go off and do other things for a few hours:

$ time runacovea -input ex1.hs
    -config /home/dons/share/libacovea/config/ghc-6_12.acovea

I chopped out the data and graphed it for you. Look at how the mean running times reduced in each generation.

It quickly drops down to the llvm -O3 plateau, then marches past that to turn the loop into a constant.

Optimistic options:
                        -optlo-globalopt  (1.723)
                    -optlo-loop-unswitch  (1.866)
                          -optlo-mem2reg  (2.536)
                         -optlo-prune-eh  (1.627)
Pessimistic options:
                             -optlo-adce  (-1.862)
                             -optlo-licm  (-1.623)
                               -optlc-O1  (-1.528)
                               -optlc-O3  (-2.149)
Common:
    -Odph -O2 -funbox-strict-fields
    -fllvm
    -optlo-O2 -optlo-globalopt -optlo-mem2reg -optlo-prune-eh

Once it was done, 4 hours later, I was pretty excited. Look what happened by generation 12! Instead of averaging say, 50ms a run, it was averaging 3ms a run. By generation 16, there was no run time at all.

LLVM did it!

It must have found a way to eliminate the loop entirely.

The flags it recommended were:

        -optlo-globalopt
    -optlo-loop-unswitch
          -optlo-mem2reg
         -optlo-prune-eh
 

So let’s try these out:

$  ghc ex1.hs -Odph --make -fforce-recomp -fllvm -optlo-O2
        -optlo-globalopt -optlo-loop-unswitch -optlo-mem2reg -optlo-prune-eh
$ time ./ex1
5000000050000000
./ex1  0.01s user 0.00s system 114% cpu 0.006 total

Bam! Digging around in the assembly, which is munged in all sorts of ways, there it is:

        movabsq $5000000050000000, %r14 # imm = 0x11C3793ADB7080

The sum from 1 to n was computed at compile time by GHC and LLVM. Now, it is unclear to me why these optimizations where what was needed — and I seem to get the constant produced with other flags after -optlo-O2 as well (loop-unswitching, mem2reg). So my theory is that many of these passes share a common clean up phase, which is actually doing the work. Anyway, there it is: better code than we’ve ever convinced either -fasm or -fvia-C to generate.

I think this is the limit case for what the GA can do. We know enough information was available statically to compute the entire computation, and standard optimization techniques would get us there. It was just a matter of getting the flags right. It will be interesting to see what improvements the GA can find in code that is less obviously amenable to tweaking.

More realistic?

Let’s trying summing a stream of random numbers — so there’s no clever closed form solution for LLVM to find. Can it just, in general, improve loops in Haskell?

First, the source:

import qualified Data.Vector as U
import qualified Data.Vector.Random.Mersenne as R
import System.Random.Mersenne.Pure64

-- compute the sum of a large vector of randoms generated
-- with the mersenne twister.
--
sums g = U.sum (R.randoms g 10000000 :: U.Vector Int)

main = do
    g <- newPureMT
    print (sums g)

A vector is generated using the Mersenne Twister, containing 10^7 random 64 bit integers. We then sum those values. GHC fuses this into a single loop where accumulation and random generation are interleaved. That means it is very simple code. Just a loop, a call to the mersenne twister, and an accumulator. There shouldn’t be that much to improve on.

I launch Acovea on it, as before, and 4 hours later we have:

Optimistic options:
                        -optlo-mem2reg  (1.914)
                       -optlo-scalarrepl  (2.413)
Pessimistic options:
                               -optlo-O1  (-2.577)
                               -optlo-O2  (-2.64)
                               -optlo-O3  (-2.515)

Very interesting. the default optimization flags reduce running times, according to Acovea. Let’s check that:

$ ghc --make -Odph -fllvm -optlo-O3 ex2.hs

$ time ./ex2
4460916339009010646
./ex2  0.31s user 0.00s system 90% cpu 0.347 total

And with the “common options” that were suggested:

$ ghc--make -Odph -O2 -funbox-strict-fields -fllvm -optlo-memdep
-optlo-abcd -optlo-loop-unroll -optlo-mem2reg -optlo-scalarrepl 

$ time ./ex2
4322192527286846546
./ex2  0.28s user 0.01s system 97% cpu 0.302 total

So it found a 13% improvement over the default llvm flags.

Sum of Squares

Now, something relatively serious, the sum of squares of a random vector. First, in Haskell.

-- compute the sum of squares of a vector.
--
sumsq g = U.sum (U.map (\x -> x * x) vector)
    where
        vector = R.randoms g 10000000 :: U.Vector Int

GHC then fuses the three loops here (generation, mapping and summing), into a single loop with no allocations, which is then ready for low level optimizations.

After crunching away for 4 hours, Acovea recommends:

Optimistic options:
                      -optlo-loop-reduce  (1.59)
                       -optlo-scalarrepl  (2.079)

Pessimistic options:
                               -optlo-O1  (-2.591)
                               -optlo-O2  (-2.537)
                               -optlo-O3  (-2.591)
                               -optlc-O0  (-2.591)
                               -optlc-O1  (-1.722)

With regular flags, we get:

$ ghc -v0 --make -fforce-recomp -O2 -funbox-strict-fields
-fllvm -optlo-O3 -optlc-O3 ex3.hs

$ time ./ex3
8241655411004882824       

ex3  0.30s user 0.01s system 96% cpu 0.315 total

And indeed, it’s “best of the best”, we get faster code:

$ ghc -v0 --make -fforce-recomp -Odph -O2 -funbox-strict-fields
-fllvm -optlo-disable-inlining -optlo-basicaa -optlo-basiccg
-optlo-count-aa -optlo-domfrontier -optlo-domtree
-optlo-globalsmodref-aa -optlo-memdep -optlo-no-aa
-optlo-postdomtree -optlo-codegenprepare -optlo-abcd
-optlo-functionattrs -optlo-block-placement -optlo-constmerge
-optlo-constprop -optlo-die -optlo-dse -optlo-globaldce
-optlo-globalopt -optlo-indvars -optlo-inline -optlo-ipconstprop
-optlo-ipsccp -optlo-lcssa -optlo-loop-deletion -optlo-loop-index-split
-optlo-loop-unroll -optlo-loop-unswitch -optlo-loopsimplify
-optlo-mem2reg -optlo-memcpyopt -optlo-scalarrepl
-optlo-tailcallelim ex3.hs

$ time ex3
4829650279590649126
ex3  0.29s user 0.01s system 100% cpu 0.296 total

Finding, again, 10% or so.

And though this is a little more complex (3 arguments in the loop, instead of two), GHC turns this:

sumsq g = U.sum (U.map (\x -> x * x) vector)
    where
        vector = U.enumFromN 1 10000000 :: U.Vector Int

with the same set of flags as the first version, into something that goes from running 228 ms with default LLVM flags to 14 ms with the flags found by Acovea in the first example (which seem to work really well for enumerations!).

Summary

The LLVM optimization layer for GHC adds a huge amount of mostly untapped potential. The optimization space is huge, and using a GA or similar approach to custom-optimize particular programs can be useful for finding common flags, or special purpose flags for niche applications (like scientific simulations). One downside is that it can take quite a while to find a solution. We could speed things up in at least two ways: Additionally, we don’t yet have a way to generate multiple passes (maybe we should run -optlo-$foo 10 times!). GHC just keeps running its optimizer until the code stops changing, maybe we need a similar approach with LLVM?

I also speculate that  the large loop bodies generated through significant fusion should give LLVM some good straight line code to work on.

My next step is to wrap up this approach into a simple Haskell library, modelled on Criterion, such than Haskeller can easily try this:

import Evolution

main = evolve main'

which will take ‘main’, and start evolving the flags used to compile it, against the current GHC/LLVM spec. It will use criterion to compute the fitness in a robust way (so you don’t need to worry about short or long running times), and at the end will generate a graph of the evolution, and a recommended set of flags to use in future. That’ll be useful to a few people.

We could then apply this tool to all of nofib to compute a general set of useful LLVM flags to use for everyone. Finally, there’s still much scope to improve the bytecode going into LLVM, so that more of the analysis phases will fire. Any Summer of Code students interested?

LLVM also supports plugins loaded on the command line. I’d love to be able to write new passes in Haskell, and load them into LLVM on the fly. This might really make it easy to try out some new FP-specific passes on the SSA-alike representation. Another good Summer of Code project…

The LLVM optimizing backend is fun!