After 3 years, my xmonad configuration now uses GNOME

Nearly three years ago, Spencer Janssen and I started work on xmonad, a tiling window manager for unix that would do what we want, automatically, so we could just concentrate on hacking code, without the window manager getting in the way. The project’s been quite successful — the most downloaded app on Hackage for the last couple of years, and thousands of users. It even has its own twitter, blogreddit and facebook accounts.

Originally I thought of this project something as the anti-GNOME: small, learn, and every part just does one thing only, but well – in the Unix tradition. And it has stayed lean. Around two thousand lines of Haskell for the core system, but with the benefit of hundreds of extensions in the contributor’s library — everyone’s config file is potentially a library module new users can import.

Over the years, GNOME and xmonad have started playing well together to the point that there’s relatively seemless interop between the two projects: you can use the full GNOME environment, and swap in xmonad as your window manager, or use a minimal environment with xmonad, adding in GNOME tools that help you.

Playing well with others is good for your open source software.

I’ve now finally switched my xmonad configuration to use a number of gnome apps, to support the core dynamic tiling provided by xmonad. Here’s my config file:

import XMonad
import XMonad.Config.Gnome
import XMonad.Layout.NoBorders
main = xmonad
    gnomeConfig {
            terminal = "term"
          , layoutHook  = smartBorders (layoutHook gnomeConfig)
    }

Yeah, that’s it. import XMonad.Config.Gnome, add smart borders, and overide the terminal to be my urxvt wrapper. xmonad is configured in Haskell, or languages Haskell can interoperate with.

My session is started up from .xinitrc as:

gitit &
gnome-panel &
gnome-power-manager &
dbus-launch --exit-with-session xmonad

I use gitit as my personal wiki, and then put a few things in the gnome-panel.

I’m really happy with how easy it now is to use xmonad with all the regular GNOME apps that people would like to see. This kind of friendliness to the dominate tools of the day is good for the project — and good for our users.

So you want to hack Haskell for the Google Summer of Code

The Google Summer of Code is a great chance to work on open source projects as a student, and get training from some experience hackers, wonderfully sponsored by Google. Haskell.org will be participating for its 5th year.

If you’re thinking about working on Haskell projects, you should certainly be reading:

Here are some of the things to think about before you decide to submit a proposal to help out Haskell.org this summer, to help you make stronger proposals. This is purely my opinion, and might not necessarily reflect the opinion of all other mentors.

We have limited resources as a community, and the Google Summer of Code has been instrumental in bringing new tools and libraries to our community. Some notable projects from the past few years include:

  • The GHCi debugger
  • Improvements to Haddock
  • Major work on Cabal
  • Generalizing Parsec (parsec3)
  • Shared Libraries for GHC
  • Language.C

These student projects have gone on to have wide impact, or brought new capabilities to the Haskell community. The pattern has been towards work on the most popular tools and libraries, or work on new areas that Haskell programmers are demanding work on. Rarely do we fund development of applications in Haskell, instead concentrating on general infrastructure that supports all Haskell projects.

To succeed with your project proposal, you need to propose doing work on the most important “blockers” for Haskell’s use. Work that:

  • Brings the most value to the community
  • Addresses widespread need
  • Will impact many other libraries and tools
  • Doesn’t rewrite things in Haskell for their own sake
  • Is feasible in 3 months

It can be hard to find projects of the right size — important enough the work will make a difference — and thus attract the attention of mentors (who vote on your proposal), but is still feasible to achieve in 3 months.

To help get a sense of the mood of what the community thinks is “hot” (though not necessarily important), we set up a crowd voting site for ideas on Reddit. But be wary, the crowd can get excited about silly things, and doesn’t necessarily have good business sense.

The list of projects can help you get a sense for what Haskellers are thinking about this year. Not everything here will is feasible for a summer project though, so be sure to get advice!

Some of the social factors to consider in your application:

  • You have one or two mentors with good expertise in the area. Ideally you’re already lining up the mentors to help watch your work.
  • You’re hanging out in #haskell or on the Haskell Reddit, your blog is
    on Planet Haskell, or you’re going to be at the hackathons.

The more involved you are in the community, the more likely you’ll have a sense of what the most important work to propose is.

And of course you need to have skills to pay the bills:

  • You should have demonstrated competence in Haskell (e.g. you’ve
    uploaded things to Hackage)
  • Demonstrated discipline in open source — self-motivation

As a guide to what reviewers will be thinking about when reading your proposal, you should be able to answer (at least) the following questions from the proposal alone:

  • What are you trying to accomplish?
  • How is it done in Haskell now, and with what limitations?
  • Are there similar projects in other languages? What related work is there?
  • If successful, what difference will it make? Will it enable new Haskell business cases? Performance improvements to many libraries? Better accessibility of Hackage code?
  • What are the mid-term and final results you’re hoping to achieve?
  • How will the result of your work affect other projects?

So, now you know what we’re interested in, start writing that proposal!

Evolving Faster Haskell Programs (now with LLVM!)

… In which I use genetic algorithms to search for optimal LLVM optimizer passes to make Haskell programs faster …

On a sunny day last Spring, I spent some time playing with genetic algorithms (GAs) for breeding faster Haskell programs, by improving the inlining hints suggested to GHC. The results were pretty cool: the GA found new inlining settings for existing Language Benchmark Game programs — that had already been heavily scrutinized — improving both programs I tried, and in one case, by 18%.

Now, 12 months later, it’s again a sunny Spring day, and we’ve got quite a few new tools to play with in the Haskell world, for working on interesting performance puzzles:

Today we’ll see what more we can extract from the LLVM backend.

GHC Overview

LLVM has a massive suite of low level optimizations, most of which were previously unavailable to the Haskell programmer. GHC has become, in a way, a Haskell to C– to LLVM compiler. (C– is translated to LLVM bytecode, which is then optimized using classic optimizations prior to code generation). To a first approximation, the LLVM optimizer implements Muchnik (recommended reading), and GHC has just imported 25 years of imperative language optimizations. This is exciting for me.

First GHC does masses of source to source transformations on the Haskell Core language (including, e.g. fusion and constructor specialization), before translating Core to C–, an imperative language interface between high-level compilers and optimizing code generators. Some transformations take place here, before spitting out LLVM bytecode using David Terei’s LLVM backend to GHC (PDF).

By default GHC just passes -O3 to the LLVM optimizer, which enables a number of passes, including:

  • interprocedural sparse conditional constant propagation.
  • combine instructions to form fewer, simple instructions
  • dead code elimination and basic block merging
  • bottom-up inlining of functions
  • promoting “by reference” arguments to be “by value” argument
  • transformeing tail recursive calls to local jumps
  • reassociating commutative expressions in an order that is designed to promote better constant propagation,
  • loop identification and simplification
  • a number of SSA analsyis passes
  • loop invariant code motion

Now, it is entirely unclear to me whether the default set of optimization passes enabled by llvm -O3 are suitable for the kind of code GHC generates, and I’m not alone in suspecting they’re not ideal. With the GCC backend, we were mostly out of luck in having control over the low level optimizations. The GCC optimization pipeline wasn’t terribly programmable, but LLVM lets us run any analysis or transformation pass in any order we want.

Which LLVM passes to use, and which order to run them, with what analysis in between, is a huge search problem, though. There are roughly 200 optimization and analysis flags, and these, mostly, can be run in any order, any number of times. If we want to run say, twenty of the passes (about what -O3 does), that’s what, about 10^46 arrangements!

So the challenge is to find a set of optimizations that catch some of the idioms in GHC’s generated code, and to hopefully find ways to make Haskell programs even faster.

Making it into a search problem

So I’ll use the approach I took in 2009 for the inlining problem, and use the genetic algorithm library, acovea, to breed the best set of flags for a few programs, where fitness for breeding is determined by total runtime. In this post I’m using GHC head, with the LLVM patch (more info in this post), and a custom patch to remove the default -O3 passed to LLVM.

I first have to give a specification of the flags to GHC + LLVM. Here’s the specification for GHC/LLVM flags that I am going to use — and you can use it too.

Note that LLVM allows us to specify the entire optimization pipeline order on the commandline, including duplicate phases. I don’t think Acovea will apply flags multiple times (no duplicate flags). We’ll see if that matters.

The acovea set of libraries are fairly widely available now, e.g. in Arch Linux. For more background on acovea, see the previous post on inlining — it’s easy to build if not packaged for your distro.

Computing fitness

The programs we want to improve have to report their fitness. To do this I’ll use criterion’s timing function, wrapping up the code we actually want to optimize in some measurement functions. This will just report total wall clock running time for the code under analysis.

We could make our fitness function more robust to noise by using criterion’s sampling and iteration framework, which will run the core code sufficiently often to remove noise. I have notes at the end on what my ideal tool for this looks like.

A warm up: sum of a sequence

To warm up, and see if this stuff works, I’ll begin with the toy program from last week’s fusion post. There, GHC fuses two loops into one, but it fails to collapse that final loop into a constant (something that GCC -O3 can do to manually fused C code). It’s a good test program to see how loops are handled in a new backend.

GHC + GCC isn’t able to spot that the loop is a constant either, however, but maybe GHC + LLVM can. I don’t know what flag I’ll need to have this happen, so let’s run the GA to explore the search space…

First, the code, wrapped in the measurement function:

import qualified Data.Vector as U

import Criterion.Measurement
import Text.Printf

-- code under analysis: sum [1..n] on vectors.
--
v = U.sum $ U.enumFromTo 1 (100000000 :: Int)

main = do
    d <- time_ $ v `seq` return ()
    putStrLn (secs' d)
secs' :: Double -> String
secs' k
    | k < 0      = '-' : secs' (-k)
    | otherwise  = printf "%.3f" k

We can compile and run this by hand:

$ ghc -Odph --make ex1.hs -fforce-recomp
[1 of 1] Compiling Main             ( ex1.hs, ex1.o )
Linking ex1 ...

$ time ./ex1
0.111
./ex1  0.12s user 0.01s system 101% cpu 0.122 total

While it computes the sum, it prints its fitness. Note that it removes any startup or IO cost incurred in measurement. When compiled with standard Haskell optimizations, and the existing native code generator, the program runs in 122ms. The LLVM backend does better, with its default optimizations added to the mix:

$ ghc -Odph --make ex1.hs -fforce-recomp -fllvm -optlo-O3
[1 of 1] Compiling Main             ( ex1.hs, ex1.o )
Linking ex1 ...

$ time ./ex1
0.051
./ex1  0.06s user 0.00s system 98% cpu 0.061 total

However, despite the value being a constant, llvm -O3 hasn’t removed the loop. Here’s the assembly of the inner loop:

.LBB1_2:
        leaq    1(%rsi), %rax
        addq    %rsi, %r14
        cmpq    $100000001, %rax
        jge     .LBB1_5
        addq    $2, %rsi
        addq    %rax, %r14
.LBB1_1:                                # %tailrecurse
        cmpq    $100000001, %rsi
        jl      .LBB1_2

So I take the code, and wrap it up in the GA tool, then go off and do other things for a few hours:

$ time runacovea -input ex1.hs
    -config /home/dons/share/libacovea/config/ghc-6_12.acovea

I chopped out the data and graphed it for you. Look at how the mean running times reduced in each generation.

It quickly drops down to the llvm -O3 plateau, then marches past that to turn the loop into a constant.

Optimistic options:
                        -optlo-globalopt  (1.723)
                    -optlo-loop-unswitch  (1.866)
                          -optlo-mem2reg  (2.536)
                         -optlo-prune-eh  (1.627)
Pessimistic options:
                             -optlo-adce  (-1.862)
                             -optlo-licm  (-1.623)
                               -optlc-O1  (-1.528)
                               -optlc-O3  (-2.149)
Common:
    -Odph -O2 -funbox-strict-fields
    -fllvm
    -optlo-O2 -optlo-globalopt -optlo-mem2reg -optlo-prune-eh

Once it was done, 4 hours later, I was pretty excited. Look what happened by generation 12! Instead of averaging say, 50ms a run, it was averaging 3ms a run. By generation 16, there was no run time at all.

LLVM did it!

It must have found a way to eliminate the loop entirely.

The flags it recommended were:

        -optlo-globalopt
    -optlo-loop-unswitch
          -optlo-mem2reg
         -optlo-prune-eh
 

So let’s try these out:

$  ghc ex1.hs -Odph --make -fforce-recomp -fllvm -optlo-O2
        -optlo-globalopt -optlo-loop-unswitch -optlo-mem2reg -optlo-prune-eh
$ time ./ex1
5000000050000000
./ex1  0.01s user 0.00s system 114% cpu 0.006 total

Bam! Digging around in the assembly, which is munged in all sorts of ways, there it is:

        movabsq $5000000050000000, %r14 # imm = 0x11C3793ADB7080

The sum from 1 to n was computed at compile time by GHC and LLVM. Now, it is unclear to me why these optimizations where what was needed — and I seem to get the constant produced with other flags after -optlo-O2 as well (loop-unswitching, mem2reg). So my theory is that many of these passes share a common clean up phase, which is actually doing the work. Anyway, there it is: better code than we’ve ever convinced either -fasm or -fvia-C to generate.

I think this is the limit case for what the GA can do. We know enough information was available statically to compute the entire computation, and standard optimization techniques would get us there. It was just a matter of getting the flags right. It will be interesting to see what improvements the GA can find in code that is less obviously amenable to tweaking.

More realistic?

Let’s trying summing a stream of random numbers — so there’s no clever closed form solution for LLVM to find. Can it just, in general, improve loops in Haskell?

First, the source:

import qualified Data.Vector as U
import qualified Data.Vector.Random.Mersenne as R
import System.Random.Mersenne.Pure64

-- compute the sum of a large vector of randoms generated
-- with the mersenne twister.
--
sums g = U.sum (R.randoms g 10000000 :: U.Vector Int)

main = do
    g <- newPureMT
    print (sums g)

A vector is generated using the Mersenne Twister, containing 10^7 random 64 bit integers. We then sum those values. GHC fuses this into a single loop where accumulation and random generation are interleaved. That means it is very simple code. Just a loop, a call to the mersenne twister, and an accumulator. There shouldn’t be that much to improve on.

I launch Acovea on it, as before, and 4 hours later we have:

Optimistic options:
                        -optlo-mem2reg  (1.914)
                       -optlo-scalarrepl  (2.413)
Pessimistic options:
                               -optlo-O1  (-2.577)
                               -optlo-O2  (-2.64)
                               -optlo-O3  (-2.515)

Very interesting. the default optimization flags reduce running times, according to Acovea. Let’s check that:

$ ghc --make -Odph -fllvm -optlo-O3 ex2.hs

$ time ./ex2
4460916339009010646
./ex2  0.31s user 0.00s system 90% cpu 0.347 total

And with the “common options” that were suggested:

$ ghc--make -Odph -O2 -funbox-strict-fields -fllvm -optlo-memdep
-optlo-abcd -optlo-loop-unroll -optlo-mem2reg -optlo-scalarrepl 

$ time ./ex2
4322192527286846546
./ex2  0.28s user 0.01s system 97% cpu 0.302 total

So it found a 13% improvement over the default llvm flags.

Sum of Squares

Now, something relatively serious, the sum of squares of a random vector. First, in Haskell.

-- compute the sum of squares of a vector.
--
sumsq g = U.sum (U.map (\x -> x * x) vector)
    where
        vector = R.randoms g 10000000 :: U.Vector Int

GHC then fuses the three loops here (generation, mapping and summing), into a single loop with no allocations, which is then ready for low level optimizations.

After crunching away for 4 hours, Acovea recommends:

Optimistic options:
                      -optlo-loop-reduce  (1.59)
                       -optlo-scalarrepl  (2.079)

Pessimistic options:
                               -optlo-O1  (-2.591)
                               -optlo-O2  (-2.537)
                               -optlo-O3  (-2.591)
                               -optlc-O0  (-2.591)
                               -optlc-O1  (-1.722)

With regular flags, we get:

$ ghc -v0 --make -fforce-recomp -O2 -funbox-strict-fields
-fllvm -optlo-O3 -optlc-O3 ex3.hs

$ time ./ex3
8241655411004882824       

ex3  0.30s user 0.01s system 96% cpu 0.315 total

And indeed, it’s “best of the best”, we get faster code:

$ ghc -v0 --make -fforce-recomp -Odph -O2 -funbox-strict-fields
-fllvm -optlo-disable-inlining -optlo-basicaa -optlo-basiccg
-optlo-count-aa -optlo-domfrontier -optlo-domtree
-optlo-globalsmodref-aa -optlo-memdep -optlo-no-aa
-optlo-postdomtree -optlo-codegenprepare -optlo-abcd
-optlo-functionattrs -optlo-block-placement -optlo-constmerge
-optlo-constprop -optlo-die -optlo-dse -optlo-globaldce
-optlo-globalopt -optlo-indvars -optlo-inline -optlo-ipconstprop
-optlo-ipsccp -optlo-lcssa -optlo-loop-deletion -optlo-loop-index-split
-optlo-loop-unroll -optlo-loop-unswitch -optlo-loopsimplify
-optlo-mem2reg -optlo-memcpyopt -optlo-scalarrepl
-optlo-tailcallelim ex3.hs

$ time ex3
4829650279590649126
ex3  0.29s user 0.01s system 100% cpu 0.296 total

Finding, again, 10% or so.

And though this is a little more complex (3 arguments in the loop, instead of two), GHC turns this:

sumsq g = U.sum (U.map (\x -> x * x) vector)
    where
        vector = U.enumFromN 1 10000000 :: U.Vector Int

with the same set of flags as the first version, into something that goes from running 228 ms with default LLVM flags to 14 ms with the flags found by Acovea in the first example (which seem to work really well for enumerations!).

Summary

The LLVM optimization layer for GHC adds a huge amount of mostly untapped potential. The optimization space is huge, and using a GA or similar approach to custom-optimize particular programs can be useful for finding common flags, or special purpose flags for niche applications (like scientific simulations). One downside is that it can take quite a while to find a solution. We could speed things up in at least two ways: Additionally, we don’t yet have a way to generate multiple passes (maybe we should run -optlo-$foo 10 times!). GHC just keeps running its optimizer until the code stops changing, maybe we need a similar approach with LLVM?

I also speculate that  the large loop bodies generated through significant fusion should give LLVM some good straight line code to work on.

My next step is to wrap up this approach into a simple Haskell library, modelled on Criterion, such than Haskeller can easily try this:

import Evolution

main = evolve main'

which will take ‘main’, and start evolving the flags used to compile it, against the current GHC/LLVM spec. It will use criterion to compute the fitness in a robust way (so you don’t need to worry about short or long running times), and at the end will generate a graph of the evolution, and a recommended set of flags to use in future. That’ll be useful to a few people.

We could then apply this tool to all of nofib to compute a general set of useful LLVM flags to use for everyone. Finally, there’s still much scope to improve the bytecode going into LLVM, so that more of the analysis phases will fire. Any Summer of Code students interested?

LLVM also supports plugins loaded on the command line. I’d love to be able to write new passes in Haskell, and load them into LLVM on the fly. This might really make it easy to try out some new FP-specific passes on the SSA-alike representation. Another good Summer of Code project…

The LLVM optimizing backend is fun!

Evaluation strategies and synchronization: things to watch for

A talk presented at the weekly Galois Developer Symposium.

The source of the talk and examples are all online.

This talk shows two areas relating to evaluation strategies in Haskell, and how they can subtly interact with threads, synchronization and performance. In addition, we briefly look at the interaction with asynchronous exceptions.

In particular,

  • Be careful with lazy values in synchronization variables
  • Consider using the strict-concurrency package for synchronization
  • All modify* functions should be very very carefully used
  • Interactions between async exceptions (e.g. block/killThread) and threads need expert eyes.

Fusion makes functional programming fun!

It’s deeply satisfying when the stream fusion optimization kicks in, in GHC Haskell. It is a powerful illustration of how the absence of side effects makes the compiler’s job so much easier, and lets it be far more aggressive.

Here’s a little example to show the real expressive power that we get when we have representation-changing optimizations like fusion to rely on. A simple example showing how good GHC is doing at this kind of stuff, these days: add a sequence of numbers. I’ve kept it super minimal so you can mentally compute each transformation to see GHC at work. (Of course there are lots of other ways to write trivial computations like this, but anything more substantial will make the assembly much harder to follow. So for teaching purposes, we’ll keep the source problem simple. Now, you can also use any of the 100+ other functions that fuse to construct all sorts of interesting programs with no closed form solution. Get at it!).

We begin with a pipeline of multiple recursive functions, brought together with function composition. As high level as it gets, and any junior Haskell program should be able to write stuff like this:

import qualified Data.Vector as U

    main = print . U.sum $ U.enumFromTo 1 (100000000 :: Int)

Now, semantically, there is an intermediate 100 million element array being created by enumFromTo. Try doing that in your favorite strict language. However…

… fusion kicks in, and GHC turns our program, via the source-to-source optimization phase, into a single tail recursive function, using unboxed values only .. with no intermediate array! No heap allocaiton, no stack, only one loop. A different algorithm!

    loop :: Int# -> Int# -> Int#
    loop x y = case y <= 100000000 of
          False -> x
          True  -> loop (x + y) (y + 1)

The compiler reordered our computations pretty wildly, and it can do this because there are no side effects by default — making large scale transformations like this far more feasible. I used ghc 6.13 with the llvm patches, and ghc -Odph -fllvm –make A.hs

The programmer didn’t have to think about recursion at all in the original program, let alone manual unboxing. Thanks GHC!

And now this tail recursive function gets compiled into an imperative loop via the codeGen phase. First, with the native code backend to GHC:

loop:
    .Lc216:
       cmpq $10000000,%rsi
       jle .Lc219
       movq %r14,%rbx
       movq (%rbp),%rax
       jmp *(%rax)
    .Lc219:
       addq %rsi,%r14
       incq %rsi
       jmp loop

And again, with the LLVM backend to GHC:

loop:
    leaq    1(%rsi), %rax
    addq    %rsi, %r14
    cmpq    $10000001, %rax
    jge     .LBB1_5
    addq    $2, %rsi
    addq    %rax, %r14
test:                                # %tailrecurse
    cmpq    $10000001, %rsi
    jl      loop

And that’s it. Our kernel code is as low level and un-functional as you get, but our input program was the most classic of high level approaches: a pipeline of functions held together with composition, with intermediate data structures all over the place.

And it performs!

With -fasm: mean: 92.9 ms:

With -fllvm: mean: 79.3ms (15% improvement with -fllvm):

Maybe our compiler is — sometimes — sufficiently smart.

To me, the fusion optimization is the epitome of what functional programming is about. I can write code at a very high level, throwing functions and composition around like candy, and the compiler will make stunning, complexity-changing transformations on my code, so it just runs damn fast. Can your language — functional or otherwise — do this?

Modern Benchmarking in Haskell

Thanks to work by Bryan O’Sullivan, there has been a rennaissance in performance benchmarking tools for Haskell, built upon Criterion.

Compared to most other benchmarking frameworks (for any programming language, not just Haskell), criterion focuses on being easy to use, informative, and robust.

The Galois Tech Talk of Feb 23rd presented this work. You can read the slides online, or find the source and examples here.

Criterion uses statistically robust mechanisms for sampling and computing sound microbenchmark results, and is more stable in the presence of noise on the system than naive timings.

Criterion has in turn spawned some extensions:

  • Progression: compare different criterion graphs
  • NoSlow: a new array benchmark suite based on Criterion

In this talk I will present these tools, how to use them, and how to make your performance benchmarks in Haskell, or languages Haskell can talk to, more reliable. In addition, we’ll explore benchmarks using the new vector package, and GHC’s llvm backend.

Smoking fast Haskell code using GHC’s new LLVM codegen


In this post we’ll play with GHC’s new LLVM code generator backend, and see how much faster some Haskell programs are when compiled with LLVM instead of GCC.

For the kind of loops we get from stream fusion, the -fllvm backend produced a lot better code, up to 3x faster in some cases. There are pretty graphs, and some smoking hot new technology.

Overview

This week David Terei announced that his work on an LLVM code generator backend for the Glasgow Haskell Compiler was ready to try out. Initial reports from his undergraduate thesis held that the LLVM code generator was competitive with the current GHC native code generator, a bit slower than the C backend in general (which uses GCC for code generation), but, tantalisingly, should produce big speedups for particular Haskell programs. In particular, tight loops of the kind generated by the bytestring, vector, data parallel arrays or text libraries. David reported speedups of 25% over the previous best performance we’d got from GHC for data parallel code.

I was very keen to try it out on the vector library — a fast, fusible numerical arrays package (similar to NumPy), which generates some very tight loops. Under the C backend, GCC has been failing to spot that the code GHC generates were actually loops, and this lead to GCC optimizing the generated code pretty badly. The native code generator does ok, but doesn’t have a lot of the clever low-level optimizations we need for really good bare metal performance.

So how would the new LLVM backend do?

Setting up

To try out the LLVM backend I followed the instructions on the wiki.

  • Check out GHC HEAD from darcs.
  • Apply the LLVM patch.
  • Check out LLVM from svn
  • Apply the GHC patch
  • Build your GHC.

This worked out of the box, and I now have a GHC 6.13 with the -fllvm flag.

$ ghc --info
 [("Project name","The Glorious Glasgow Haskell Compilation System")
 ,("Project version","6.13.20100221")
 ,("Booter version","6.12.1")
 ,("Stage","2")
 ,("Have interpreter","YES")
 ,("Object splitting","YES")
 ,("Have native code generator","YES")
 ,("Have llvm code generator","YES")
 ,("Support SMP","YES")
 ,("Unregisterised","NO")
 ,("Tables next to code","NO")
 ,("Win32 DLLs","")
 ,("RTS ways","l debug  thr thr_debug thr_l  ")
 ,("Leading underscore","NO")
 ,("Debug on","False")
 ,("LibDir","/home/dons/lib/ghc-6.13.20100221")
 ]

Running on a dual core Core 2 laptop:

$ uname -msr
 Linux 2.6.32-ARCH x86_64

You can then install packages as normal, via cabal, and add the -fllvm flag to see GHC build things via the new backend:

$ cabal install primitive --ghc-options=-fllvm

The packages I’m interested in are:

And some helper code in:

I also modifed the ghc-core tool to support showing the LLVM generated assembly.

Warm up lap

Let’s check the backend is working (remember to add the -fllvm flag):

$ ghc -O2 --make A.hs -fllvm -fforce-recomp
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...
$ time ./A
"hey"
./A  0.00s user 0.00s system 61% cpu 0.005 total

Good! The LLVM backend is generating working code for x86_64/Linux. Now, something more ambitious … a program from the shootout.

A shootout program

So let’s find some code that’s already been optimized. I’l compile the pidgits shootout benchmarks (where Haskell’s already the fastest entry).

First, with the native code gen:

$ ghc -O2 -fasm A.hs –make -fforce-recomp

$ time ./A 10000 > /dev/null
./A 10000 > /dev/null 3.19s user 0.03s system 91% cpu 3.509 total

With the old GCC backend:

$ ghc -O2 -fvia-C -optc-O3 A.hs –make -fforce-recomp

$ time ./A 10000 > /dev/null
./A 10000 > /dev/null 2.89s user 0.03s system 97% cpu 2.988 total

And with the -fllvm backend:

$ ghc -O2 -fllvm A.hs –make -fforce-recomp

$ time ./A 10000 > /dev/null
./A 10000 > /dev/null 2.86s user 0.02s system 98% cpu 2.936 total

Woo. It runs, and we get a speedup! Now for some serious business.

The Vector Package

Vector is a Haskell library for working with arrays. It provides several array types (boxed, unboxed, C), with a rich interface similar to the lists library, and some functions reminiscent of Data Parallel Haskell. There’s a tutorial on how to use it.

The interface is built entirely around stream fusion combinators — a general form of classic loop fusion made possible by purity. When you do multiple passes over the data (e.g. sum/map/fold/filter/…) the compiler will common up the loops, and discard intermediate arrays, making the code potentially very fast.

The loops that are generated tend to be very register heavy, do no heap allocation, and benefit from clever imperative loop optimizations. Unfortunately, the GCC backend to GHC doesn’t spot that these are actually loops, so doesn’t get to fire many optimizations.

The promise of the LLVM backend is that it will recognize the loops GHC generates from fused code. Let’s see how it performs.

To benchmark these programs, I’ll use the criterion and progression benchmarking libraries. (I had to build the darcs version of gtk2hs, and compiler data accessor-template with the -ftemplate_2_4 flag)

Simple loops

To start off, let’s generate 1 billion ints, sum them, print the result. That should tell us if our loops are efficient:

import qualified Data.Vector as U
main = print . U.sum $ U.enumFromTo 1 (1000000000 :: Int)

There are two loops in this program. enumFromTo and sum.

The core

GHC compiles these two loops into a single loop, when compiled with -O2 or -Odph:

loop  :: Int# -> Int# -> Int#
loop x y =
     case <=# y 1000000000 of
         False -> x
         True  ->  loop (x +# y) (y +# 1)

This is perfect. We write “sum (enumFromTo 1 n)” and we get a non-allocating loop.

The native backend

GHC 6.13 with the native code generator generates the following assembly for the inner loop:

Main_mainzuzdszdwfoldlMzqzuloop_entry:
 .Lc21u:
 cmpq $1000000000,%rsi
 jle .Lc21x
 movq %r14,%rbx
 movq (%rbp),%rax
 jmp *(%rax)
 .Lc21x:
 addq %rsi,%r14
 incq %rsi
 jmp Main_mainzuzdszdwfoldlMzqzuloop_entry

which runs in:

$ time ./enum
 500000000500000000
 ./enum  1.00s user 0.00s system 99% cpu 1.008 total

The  C backend

GHC 6.12.1 with the C backend, (-fvia-C -optc-O3) (I’m having trouble linking programs with the C backend and GHC 6.13), yields a pretty small loop:

Main_mainzuzdszdwfoldlMzqzuloop_info:
 cmpq    $1000000000, %r14
 movq    %r14, %rax
 jle     .L2
 movq    %rsi, %rbx
 jmp     *(%rbp)
 .L2:
 leaq    1(%r14), %r14
 addq    %rax, %rsi
 jmp     Main_mainzuzdszdwfoldlMzqzuloop_info

Which runs slower than the native code generator:

$ time ./enum
 500000000500000000
 ./enum  1.09s user 0.00s system 99% cpu 1.100 total

The LLVM backend

With -O2 -fllvm we get very different code, and it is a bit harder to work out what is going on. LLVM transforms the code far more aggressively.

.LBB1_2:
 leaq    1(%rsi), %rax
 addq    %rsi, %r14
 cmpq    $1000000001, %rax
 jge     .LBB1_5                 # loop exit
 addq    $2, %rsi
 addq    %rax, %r14
 .LBB1_1:                        # %tailrecurse
 cmpq    $1000000001, %rsi
 jl      .LBB1_2

And the proof is in the pudding:

$ time ./enum
 500000000500000000
 ./enum  0.48s user 0.01s system 99% cpu 0.488 total

This is the fastest Haskell we’ve ever generated for this little benchmark (at least without manual loop unrolling)!

The LLVM backend more than halved the running time for this simple loop. But remember: general benchmarks aren’t seeing these kind of speedups — LLVM is really excelling itself at the tight numeric code.

Here’s the data presented in a slightly different form, with criterion and progression. The numbers are slightly different, since we won’t inline the length of the vector argument, and we’re wrapping the code in benchmarking wrappers. I wasn’t able to get -fvia-C programs to link under the HEAD, so we’ll exclude those from graphs, but report them in text form.

With the -fasm backend:

With the LLVM backend:

Or side-by-side with the progression package:

The -fasm backend under the progression tool ran around ~1s for each billion ints, while -fllvm was around 0.8s. Note that we get slightly different timings with the loops under each benchmarking tool, due to how the benchmark program and wrapper are optimized.

Zips

Zips are another good candidate, since they turn into nested loops. So, e.g.

import qualified Data.Vector as U
import Data.Bits
main = print . U.sum . U.map (`shiftL` 1) $ U.zipWith (*)
                        (U.enumFromTo 1 (100000000 :: Int))
                        (U.replicate (100000000 :: Int) 42)

Which fuses to this set of loops:

loop  :: Int# -> Int# -> Int# -> Int#
loop =
  \ (sc_s29b :: Int#)
    (sc1_s29c :: Int#)
    (sc2_s29d :: Int#) ->
    case <=# sc1_s29c 100000000 of _ {       False -> sc_s29b;
      True ->
        case <=# sc2_s29d 0 of _ {           False ->
            loop
              (+#
                 sc_s29b (uncheckedIShiftL# (*# sc1_s29c 42) 1))
              (+# sc1_s29c 1)
              (-# sc2_s29d 1);
          True -> sc_s29b
        }
    }

Which, again, is perfect Core. All those functions combined into a single non-allocating loop.

-fasm:

Main_mainzuzdszdwfoldlMzqzuloop_entry:
.Lc2aC:
        cmpq $100000000,%rsi
        jle .Lc2aE
        movq %r14,%rbx
        movq (%rbp),%rax
        jmp *(%rax)
.Lc2aE:
        testq %rdi,%rdi
        jle .Lc2aH
        movq %rsi,%rax
        imulq $42,%rax
        shlq $1,%rax
        addq %rax,%r14
        incq %rsi
        decq %rdi
        jmp Main_mainzuzdszdwfoldlMzqzuloop_entry
.Lc2aH:
        movq %r14,%rbx
        movq (%rbp),%rax
        jmp *(%rax)

Which is reasonable:

$ time ./zipwith
420000004200000000
./zipwith 0.24s user 0.00s system 99% cpu 0.246 total

With the -fvia-C -optc-O3 backend, just the inner loop, since that’s easy to read:

Main_mainzuzdszdwfoldlMzqzuloop_info:
        cmpq    $100000000, %rsi
        jg      .L6
.L2:
        testq   %r14, %r14
        jle     .L6
        leaq    (%rsi,%rsi,4), %rcx
        leaq    -1(%r14), %r14
        leaq    (%rsi,%rcx,4), %rcx
        leaq    1(%rsi), %rsi
        leaq    (%rdi,%rcx,4), %rdi
        jmp     Main_mainzuzdszdwfoldlMzqzuloop_info

Which runs in about the same time as the -fasm backend:

$ time ./zipwith
420000004200000000
./zipwith  0.25s user 0.00s system 99% cpu 0.251 total

With -fllvm the code is wildly different, and I find it pretty hard to reconstruct what transformatoins LLVM has done.

Main_mainzuzdszdwfoldlMzqzuloop_entry:
# BB#0:                                 # %c2cf
        subq    $8, %rsp
        imulq   $84, %rsi, %rax
        jmp     .LBB1_1
.LBB1_3:                                # %n2cN
                                        #   in Loop: Header=BB1_1 Depth=1
        incq    %rsi
        decq    %rdi
        addq    %rax, %r14
        addq    $84, %rax
.LBB1_1:                                # %tailrecurse
                                        # =>This Inner Loop Header: Depth=1
        cmpq    $100000001, %rsi        # imm = 0x5F5E101
        jge     .LBB1_4
                                        #   in Loop: Header=BB1_1 Depth=1
        testq   %rdi, %rdi
        jg      .LBB1_3
.LBB1_4:                                # %n2ck
        movq    (%rbp), %rax
        movq    %r14, %rbx
        movq    (%rax), %r11
        addq    $8, %rsp
        jmpq    *%r11  # TAILCALL

The “inner loop” is interesting. Nothing like what -fasm or -fvia-C generate. And it’s way faster:

$ time ./zipwith
420000004200000000
./zipwith 0.15s user 0.00s system 99% cpu 0.154 total

So yeah, 40% faster!

Criterion

Here, under criterion (same code, but different values of n), With the -fasm backend, mean execution time 186ms:

With the -fllvm backend, 135 ms  (27% improvement):

zipwith3

Heavily nested zips are probably the best cases for LLVM, and we see the -fllvm backend do some pretty wild stuff with this:

import qualified Data.Vector.Unboxed as U import Data.Bits main = print . U.sum $ U.zipWith3 (\x y z -> x * y * z) (U.enumFromTo 1 (100000000 :: Int)) (U.enumFromTo 2 (100000001 :: Int)) (U.enumFromTo 7 (100000008 :: Int))

Which fuses to:

main_$s$wfoldlM'_loop [Occ=LoopBreaker]
  :: Int#     -> Int# -> Int# -> Int# -> Int#

main_$s$wfoldlM'_loop =
  \ (sc_s2jh :: Int#)
    (sc1_s2ji :: Int#)
    (sc2_s2jj :: Int#)
    (sc3_s2jk :: Int#) ->
    case  sc_s2jh;
      True ->
        case  sc_s2jh;
          True ->
            case  sc_s2jh;
              True ->
                main_$s$wfoldlM'_loop
                  (+#
                     sc_s2jh (*# (*# sc1_s2ji sc2_s2jj) sc3_s2jk))
                  (+# sc1_s2ji 1)
                  (+# sc2_s2jj 1)
                  (+# sc3_s2jk 1)
            }
        }
    }

Great core. With the -fasm backend:

Main_mainzuzdszdwfoldlMzqzuloop_entry:
.Lc2lq:
        cmpq $100000000,%rsi
        jle .Lc2ls
        movq %r14,%rbx
        movq (%rbp),%rax
        jmp *(%rax)
.Lc2ls:
        cmpq $100000001,%rdi
        jle .Lc2lu
        movq %r14,%rbx
        movq (%rbp),%rax
        jmp *(%rax)
.Lc2lu:
        cmpq $100000008,%r8
        jle .Lc2lx
        movq %r14,%rbx
        movq (%rbp),%rax
        jmp *(%rax)
.Lc2lx:
        movq %rdi,%rax
        imulq %r8,%rax
        movq %rsi,%rcx
        imulq %rax,%rcx
        addq %rcx,%r14
        incq %rsi
        incq %rdi
        incq %r8
        jmp Main_mainzuzdszdwfoldlMzqzuloop_entry

Straight forward, and running it:

$ time ./zipwith3
3541230156834269568
./zipwith3  0.47s user 0.01s system 98% cpu 0.484 total

With -fvia-C -optc-O3:

Main_mainzuzdszdwfoldlMzqzuloop_info:
        .text
        .p2align 4,,15
.text
        .align 8
        .type Main_mainzuzdszdwfoldlMzqzuloop_info, @function
# 38 "/tmp/ghc10013_0/ghc10013_0.hc" 1
# 0 "" 2
        cmpq    $100000000, %rdi
        jg      .L9
.L4:
        cmpq    $100000001, %rsi
        jg      .L9
.L5:
        cmpq    $100000008, %r14
        .p2align 4,,5
        jg      .L9
.L7:
        movq    %rsi, %r10
        leaq    1(%rsi), %rsi
        imulq   %rdi, %r10
        leaq    1(%rdi), %rdi
        imulq   %r14, %r10
        leaq    1(%r14), %r14
        leaq    (%r10,%r8), %r8
        jmp     Main_mainzuzdszdwfoldlMzqzuloop_info

And we get a faster result:

$ time ./zipwith3
3541230156834269568
./zipwith3  0.34s user 0.00s system 99% cpu 0.344 total

-fllvm, looks like some heavy loop unrolling:

Main_mainzuzdszdwfoldlMzqzuloop_entry:  # @Main_mainzuzdszdwfoldlMzqzuloop_entry
# BB#0:                                 # %c2oz
        subq    $56, %rsp
        cmpq    $100000002, %rdi        # imm = 0x5F5E102
        movl    $100000002, %eax        # imm = 0x5F5E102
        movq    $-2, %rdx
        movq    %r9, 40(%rsp)           # 8-byte Spill
        movq    %r15, 48(%rsp)          # 8-byte Spill
        movq    $-3, %r9
        movq    %r12, 32(%rsp)          # 8-byte Spill
        movq    %r8, %rbx
        movq    %r13, 24(%rsp)          # 8-byte Spill
        movq    %r14, 16(%rsp)          # 8-byte Spill
        leaq    1(%rdi), %r13
        cmovgq  %rdi, %rax
        negq    %rax
        leaq    -1(%rdi,%rax), %rcx
        cmpq    $100000009, %r8         # imm = 0x5F5E109
        movl    $100000009, %eax        # imm = 0x5F5E109
        cmovgq  %r8, %rax
        negq    %rax
        leaq    -1(%r8,%rax), %rax
        cmpq    %rcx, %rax
        cmovaq  %rax, %rcx
        cmpq    $100000001, %rsi        # imm = 0x5F5E101
        movl    $100000001, %eax        # imm = 0x5F5E101
        cmovgq  %rsi, %rax
        negq    %rax
        leaq    -1(%rsi,%rax), %rax
        cmpq    %rax, %rcx
        cmovbeq %rax, %rcx
        imulq   %rdi, %rbx
        imulq   %rsi, %r13
        movq    %rcx, %r10
        subq    %rcx, %rdx
        subq    %rcx, %r9
        imulq   %rsi, %rbx
        addq    %rdi, %r13
        notq    %r10
        movq    %r10, %rax
        imulq   %r10, %rbx
        mulq    %rdx
        addq    16(%rsp), %rbx          # 8-byte Folded Reload
        movq    %rax, %r11
        movq    %rdx, %r15
        movq    %r15, %r12
        movq    %r11, %rax
        andq    $1, %r15
        imulq   %r9, %r12
        mulq    %r9
        shldq   $63, %r11, %r15
        leaq    (%r8,%rdi), %r9
        addq    %rdx, %r12
        movq    $-4, %rdx
        addq    %rsi, %r9
        subq    %rcx, %rdx
        movq    %r12, %r14
        andq    $1, %r12
        leaq    6(%r9,%r9), %r10
        movabsq $6148914691236517205, %r9 # imm = 0x5555555555555555
        movq    %rdx, 8(%rsp)           # 8-byte Spill
        imulq   %rdx, %r14
        leaq    1(%rdi,%rsi), %rdx
        shldq   $63, %rax, %r12
        imulq   %r8, %rdx
        imulq   %r12, %r10
        leaq    1(%rdx,%r13), %rdx
        imulq   %r10, %r9
        imulq   %r15, %rdx
        addq    %rdx, %rbx
        mulq    8(%rsp)                 # 8-byte Folded Reload
        subq    %r9, %rbx
        movq    %r8, %r9
        decq    %r8
        subq    %rcx, %r9
        addq    %rdx, %r14
        movq    %rdi, %rdx
        decq    %r9
        shldq   $62, %rax, %r14
        movq    %rsi, %rax
        subq    %rcx, %rdx
        andq    $-2, %r14
        subq    %rcx, %rax
        decq    %rdx
        addq    %rbx, %r14
        decq    %rax
        .align 16
.LBB2_1:                                # %tailrecurse
                                        # =>This Inner Loop Header: Depth=1
        cmpq    $100000001, %rsi        # imm = 0x5F5E101
        jge     .LBB2_4
# BB#2:                                 # %c2oD
                                        #   in Loop: Header=BB2_1 Depth=1
        cmpq    $100000002, %rdi        # imm = 0x5F5E102
        jge     .LBB2_4
# BB#3:                                 # %c2p5
                                        #   in Loop: Header=BB2_1 Depth=1
        incq    %rsi
        incq    %rdi
        incq    %r8
        cmpq    $100000009, %r8         # imm = 0x5F5E109
        jl      .LBB2_1
.LBB2_4:                                # %n2oE
        movq    (%rbp), %rcx
        movq    %r9, %r8
        movq    24(%rsp), %r13          # 8-byte Reload
        movq    32(%rsp), %r12          # 8-byte Reload
        movq    %r14, %rbx
        movq    %rax, %rsi
        movq    %rdx, %rdi
        movq    40(%rsp), %r9           # 8-byte Reload
        movq    48(%rsp), %r15          # 8-byte Reload
        movq    (%rcx), %r11
        addq    $56, %rsp
        jmpq    *%r11  # TAILCALL

And blows them all out of the water! 3x faster than -fasm! Twice as fast as -fvia-C -optc-O3.

$ time ./zipwith3
3541230156834269568
./zipwith3  0.16s user 0.00s system 99% cpu 0.158 total

From the Statistics package

The statistics package has some more “realistic” microbenchmarks. Let’s look at those. First, computing the mean of a large array of doubles (here all set to ‘pi’).

main = print (mean (V.replicate 1000000000 (pi :: Double)))

With the -fasm backend:

Main_mainzuzdszdwfoldlMzuloop_entry:
.Lc2b2:
        testq %rsi,%rsi
        jle .Lc2b5
        cvtsi2sdq %r14,%xmm0
        movsd .Ln2b8(%rip),%xmm7
        subsd %xmm5,%xmm7
        divsd %xmm0,%xmm7
        addsd %xmm7,%xmm5
        incq %r14
        decq %rsi
        jmp Main_mainzuzdszdwfoldlMzuloop_entry

Simple, easy.

$ time ./mean
3.141592653589793
./mean  5.58s user 0.01s system 99% cpu 5.599 total

With the -fllvm backend:

Main_mainzuzdszdwfoldlMzuloop_entry:    # @Main_mainzuzdszdwfoldlMzuloop_entry
# BB#0:                                 # %c28E
        subq    $8, %rsp
        movsd   .LCPI3_0(%rip), %xmm0
        jmp     .LBB3_1
        .align 16
.LBB3_3:                                # %n28K.i
                                        #   in Loop: Header=BB3_1 Depth=1
        movapd  %xmm0, %xmm5
        cvtsi2sdq       %rcx, %xmm8
        addq    $-2, %rsi
        addq    $2, %r14
        subsd   %xmm7, %xmm5
        divsd   %xmm8, %xmm5
        addsd   %xmm7, %xmm5
.LBB3_1:                                # %tailrecurse
                                        # =>This Inner Loop Header: Depth=1
        testq   %rsi, %rsi
        jle     .LBB3_5
# BB#2:                                 # %n28K
                                        #   in Loop: Header=BB3_1 Depth=1
        movapd  %xmm0, %xmm7
        cvtsi2sdq       %r14, %xmm8
        leaq    -1(%rsi), %rax
        leaq    1(%r14), %rcx
        subsd   %xmm5, %xmm7
        testq   %rax, %rax
        divsd   %xmm8, %xmm7
        addsd   %xmm5, %xmm7
        jg      .LBB3_3
# BB#4:                                 # %c28J.i
        movq    (%rbp), %rdx
        movq    %rcx, %rbx
        movq    %rcx, %r14
        movq    %rax, %rsi
        movapd  %xmm7, %xmm5
        movq    (%rdx), %r11
        addq    $8, %rsp
        jmpq    *%r11  # TAILCALL
.LBB3_5:                                # %c28J
        movq    (%rbp), %rax
        movq    %r14, %rbx
        movq    (%rax), %r11
        addq    $8, %rsp
        jmpq    *%r11  # TAILCALL

And running it:

$ time ./mean
3.141592653589793
./mean  5.55s user 0.01s system 99% cpu 5.585 total

Some pretty wacky code, but a little faster.

Conclusions

The LLVM backend seems to be holding up to what we hoped: it does a better (some times much better) job on tight loops. We get better code than GHC has ever produced before. It seems pretty robust, so far everything I’ve tried has worked.

David’s benchmarks indicate that with the current — first attempt — at an LLVM backend most large programs aren’t noticeably faster, but I think the promise we see in these small examples justifies spending more time working on the LLVM backend to GHC. It has much more potential than the GCC backend.

Currently we’re not experimenting with the LLVM optimization layer at all — I think there’s likely to be a lot of win just tweaking those settings (and exposing them to the Haskell programmer via GHC flags).

Follow

Get every new post delivered to your Inbox.

Join 52 other followers