The 8 Most Important Haskell.org GSoC Projects

While at ZuriHac, Johan Tibell, David Anderson, Duncan Coutts and I discussed what the highest priority projects for the Haskell community are, in the context of the Google Summer of Code, for which Haskell.org is a mentoring organization for the 5th year.

Here’s our top 8 most important projects, that we would really like to see good applications for. Some of these have tickets already, but some don’t. If you apply to work on projects like those below, you can expect strong support from the mentors, which ultimately determines if you’ll be funded.

For details on what we think you need to consider when applying to execute a project, see this earlier post.

A Package Versioning Policy Checker

Cabal relies on package version ranges to determine what Haskell software to install on your system. Version numbers are essentially “hashes” of the API of the package, and should be computed according to the package versioning policy. However, package authors don’t have a tool to automatically determine what the version number change to their package should be, when they release a new version, leading to mistakes, and needless dependency breakages.

This project would construct a tool that would be able to compute the correct package version number, given a package and an API change. As an extension, it would warn about errors in version ranges in .cabal files.

“cabal test”

Proper test support is essential for good software quality. By improving Cabal’s test support we can test all Cabal packages on continuous build machines which should help us detect breakages earlier. Making it easier to run the tests means that more people will run them and those who already do will run the more often.

Fast text/bytestring HTML combinators

We have Data.Binary for fast serialization of data structures to byte strings to be sent over the wire. High performance web servers need fast HTML generation too, and an approach based on Text.PrettyPrint combinators for filling unicode-friendly Data.Text buffers would be a killer app for web content generation in Haskell. This might mean working on BlazeHTML.

Threadscope with custom probes

ThreadScope is an amazing new tool in the Haskell universe for monitoring executing Haskell processes. It reveals detailed information about thread and GC performance. We’d like to extend the tool with support for new kinds of event hooks. Examples would be watching for MVar locks, STM contention, IO events, and more.

Combine Threadscope with Heap Profiling Tools

ThreadScope lets us monitor thread execution. The Haskell Heap Profiler lets us monitor the Haskell heap live. HPC lets us monitor which code is execution and when. These should all be in an integrated tool for monitoring executing Haskell processes.

LLVM Performance Study

GHC has an LLVM backend. The next step is to look closely at the kind of code we’re generating to LLVM, and the optimizations LLVM performs on GHC’s code, in order to further improve performance of Haskell code.

LLVM Cross Compiler

LLVM has support for many new backends, such as ARM. The challenge is to use this ability to generate native code for other architectures to turn GHC into a cross-compiler (so we could produce, e.g. ARM executables on an x86/Linux box). This will involve linker and build system hacking.

Hackage 2.0 Web Services

Hackage is the central repository for Haskell code.  It hosts around 2000 libraries, and is growing rapidly. It can be hard to determine which packages to use. We believe social mechanisms (comments, voting, …) can be very succesful in helping to both improve the quality of Hackage, and make it easier for developers to know which library to use. This project would bring Hackage 2.0 to a deployable state, and then consider better interfaces to search and sort packages.

These are the 8 projects we felt were the most important to the community. What do you think? Are there other key projects that need to be done , that will benefit large parts of the community, or enable the use of Haskell in new areas of importance?

After 3 years, my xmonad configuration now uses GNOME

Nearly three years ago, Spencer Janssen and I started work on xmonad, a tiling window manager for unix that would do what we want, automatically, so we could just concentrate on hacking code, without the window manager getting in the way. The project’s been quite successful — the most downloaded app on Hackage for the last couple of years, and thousands of users. It even has its own twitter, blogreddit and facebook accounts.

Originally I thought of this project something as the anti-GNOME: small, learn, and every part just does one thing only, but well – in the Unix tradition. And it has stayed lean. Around two thousand lines of Haskell for the core system, but with the benefit of hundreds of extensions in the contributor’s library — everyone’s config file is potentially a library module new users can import.

Over the years, GNOME and xmonad have started playing well together to the point that there’s relatively seemless interop between the two projects: you can use the full GNOME environment, and swap in xmonad as your window manager, or use a minimal environment with xmonad, adding in GNOME tools that help you.

Playing well with others is good for your open source software.

I’ve now finally switched my xmonad configuration to use a number of gnome apps, to support the core dynamic tiling provided by xmonad. Here’s my config file:

import XMonad
import XMonad.Config.Gnome
import XMonad.Layout.NoBorders
main = xmonad
    gnomeConfig {
            terminal = "term"
          , layoutHook  = smartBorders (layoutHook gnomeConfig)
    }

Yeah, that’s it. import XMonad.Config.Gnome, add smart borders, and overide the terminal to be my urxvt wrapper. xmonad is configured in Haskell, or languages Haskell can interoperate with.

My session is started up from .xinitrc as:

gitit &
gnome-panel &
gnome-power-manager &
dbus-launch --exit-with-session xmonad

I use gitit as my personal wiki, and then put a few things in the gnome-panel.

I’m really happy with how easy it now is to use xmonad with all the regular GNOME apps that people would like to see. This kind of friendliness to the dominate tools of the day is good for the project — and good for our users.

So you want to hack Haskell for the Google Summer of Code

The Google Summer of Code is a great chance to work on open source projects as a student, and get training from some experience hackers, wonderfully sponsored by Google. Haskell.org will be participating for its 5th year.

If you’re thinking about working on Haskell projects, you should certainly be reading:

Here are some of the things to think about before you decide to submit a proposal to help out Haskell.org this summer, to help you make stronger proposals. This is purely my opinion, and might not necessarily reflect the opinion of all other mentors.

We have limited resources as a community, and the Google Summer of Code has been instrumental in bringing new tools and libraries to our community. Some notable projects from the past few years include:

  • The GHCi debugger
  • Improvements to Haddock
  • Major work on Cabal
  • Generalizing Parsec (parsec3)
  • Shared Libraries for GHC
  • Language.C

These student projects have gone on to have wide impact, or brought new capabilities to the Haskell community. The pattern has been towards work on the most popular tools and libraries, or work on new areas that Haskell programmers are demanding work on. Rarely do we fund development of applications in Haskell, instead concentrating on general infrastructure that supports all Haskell projects.

To succeed with your project proposal, you need to propose doing work on the most important “blockers” for Haskell’s use. Work that:

  • Brings the most value to the community
  • Addresses widespread need
  • Will impact many other libraries and tools
  • Doesn’t rewrite things in Haskell for their own sake
  • Is feasible in 3 months

It can be hard to find projects of the right size — important enough the work will make a difference — and thus attract the attention of mentors (who vote on your proposal), but is still feasible to achieve in 3 months.

To help get a sense of the mood of what the community thinks is “hot” (though not necessarily important), we set up a crowd voting site for ideas on Reddit. But be wary, the crowd can get excited about silly things, and doesn’t necessarily have good business sense.

The list of projects can help you get a sense for what Haskellers are thinking about this year. Not everything here will is feasible for a summer project though, so be sure to get advice!

Some of the social factors to consider in your application:

  • You have one or two mentors with good expertise in the area. Ideally you’re already lining up the mentors to help watch your work.
  • You’re hanging out in #haskell or on the Haskell Reddit, your blog is
    on Planet Haskell, or you’re going to be at the hackathons.

The more involved you are in the community, the more likely you’ll have a sense of what the most important work to propose is.

And of course you need to have skills to pay the bills:

  • You should have demonstrated competence in Haskell (e.g. you’ve
    uploaded things to Hackage)
  • Demonstrated discipline in open source — self-motivation

As a guide to what reviewers will be thinking about when reading your proposal, you should be able to answer (at least) the following questions from the proposal alone:

  • What are you trying to accomplish?
  • How is it done in Haskell now, and with what limitations?
  • Are there similar projects in other languages? What related work is there?
  • If successful, what difference will it make? Will it enable new Haskell business cases? Performance improvements to many libraries? Better accessibility of Hackage code?
  • What are the mid-term and final results you’re hoping to achieve?
  • How will the result of your work affect other projects?

So, now you know what we’re interested in, start writing that proposal!

Evolving Faster Haskell Programs (now with LLVM!)

… In which I use genetic algorithms to search for optimal LLVM optimizer passes to make Haskell programs faster …

On a sunny day last Spring, I spent some time playing with genetic algorithms (GAs) for breeding faster Haskell programs, by improving the inlining hints suggested to GHC. The results were pretty cool: the GA found new inlining settings for existing Language Benchmark Game programs — that had already been heavily scrutinized — improving both programs I tried, and in one case, by 18%.

Now, 12 months later, it’s again a sunny Spring day, and we’ve got quite a few new tools to play with in the Haskell world, for working on interesting performance puzzles:

Today we’ll see what more we can extract from the LLVM backend.

GHC Overview

LLVM has a massive suite of low level optimizations, most of which were previously unavailable to the Haskell programmer. GHC has become, in a way, a Haskell to C– to LLVM compiler. (C– is translated to LLVM bytecode, which is then optimized using classic optimizations prior to code generation). To a first approximation, the LLVM optimizer implements Muchnik (recommended reading), and GHC has just imported 25 years of imperative language optimizations. This is exciting for me.

First GHC does masses of source to source transformations on the Haskell Core language (including, e.g. fusion and constructor specialization), before translating Core to C–, an imperative language interface between high-level compilers and optimizing code generators. Some transformations take place here, before spitting out LLVM bytecode using David Terei’s LLVM backend to GHC (PDF).

By default GHC just passes -O3 to the LLVM optimizer, which enables a number of passes, including:

  • interprocedural sparse conditional constant propagation.
  • combine instructions to form fewer, simple instructions
  • dead code elimination and basic block merging
  • bottom-up inlining of functions
  • promoting “by reference” arguments to be “by value” argument
  • transformeing tail recursive calls to local jumps
  • reassociating commutative expressions in an order that is designed to promote better constant propagation,
  • loop identification and simplification
  • a number of SSA analsyis passes
  • loop invariant code motion

Now, it is entirely unclear to me whether the default set of optimization passes enabled by llvm -O3 are suitable for the kind of code GHC generates, and I’m not alone in suspecting they’re not ideal. With the GCC backend, we were mostly out of luck in having control over the low level optimizations. The GCC optimization pipeline wasn’t terribly programmable, but LLVM lets us run any analysis or transformation pass in any order we want.

Which LLVM passes to use, and which order to run them, with what analysis in between, is a huge search problem, though. There are roughly 200 optimization and analysis flags, and these, mostly, can be run in any order, any number of times. If we want to run say, twenty of the passes (about what -O3 does), that’s what, about 10^46 arrangements!

So the challenge is to find a set of optimizations that catch some of the idioms in GHC’s generated code, and to hopefully find ways to make Haskell programs even faster.

Making it into a search problem

So I’ll use the approach I took in 2009 for the inlining problem, and use the genetic algorithm library, acovea, to breed the best set of flags for a few programs, where fitness for breeding is determined by total runtime. In this post I’m using GHC head, with the LLVM patch (more info in this post), and a custom patch to remove the default -O3 passed to LLVM.

I first have to give a specification of the flags to GHC + LLVM. Here’s the specification for GHC/LLVM flags that I am going to use — and you can use it too.

Note that LLVM allows us to specify the entire optimization pipeline order on the commandline, including duplicate phases. I don’t think Acovea will apply flags multiple times (no duplicate flags). We’ll see if that matters.

The acovea set of libraries are fairly widely available now, e.g. in Arch Linux. For more background on acovea, see the previous post on inlining — it’s easy to build if not packaged for your distro.

Computing fitness

The programs we want to improve have to report their fitness. To do this I’ll use criterion’s timing function, wrapping up the code we actually want to optimize in some measurement functions. This will just report total wall clock running time for the code under analysis.

We could make our fitness function more robust to noise by using criterion’s sampling and iteration framework, which will run the core code sufficiently often to remove noise. I have notes at the end on what my ideal tool for this looks like.

A warm up: sum of a sequence

To warm up, and see if this stuff works, I’ll begin with the toy program from last week’s fusion post. There, GHC fuses two loops into one, but it fails to collapse that final loop into a constant (something that GCC -O3 can do to manually fused C code). It’s a good test program to see how loops are handled in a new backend.

GHC + GCC isn’t able to spot that the loop is a constant either, however, but maybe GHC + LLVM can. I don’t know what flag I’ll need to have this happen, so let’s run the GA to explore the search space…

First, the code, wrapped in the measurement function:

import qualified Data.Vector as U

import Criterion.Measurement
import Text.Printf

-- code under analysis: sum [1..n] on vectors.
--
v = U.sum $ U.enumFromTo 1 (100000000 :: Int)

main = do
    d <- time_ $ v `seq` return ()
    putStrLn (secs' d)
secs' :: Double -> String
secs' k
    | k < 0      = '-' : secs' (-k)
    | otherwise  = printf "%.3f" k

We can compile and run this by hand:

$ ghc -Odph --make ex1.hs -fforce-recomp
[1 of 1] Compiling Main             ( ex1.hs, ex1.o )
Linking ex1 ...

$ time ./ex1
0.111
./ex1  0.12s user 0.01s system 101% cpu 0.122 total

While it computes the sum, it prints its fitness. Note that it removes any startup or IO cost incurred in measurement. When compiled with standard Haskell optimizations, and the existing native code generator, the program runs in 122ms. The LLVM backend does better, with its default optimizations added to the mix:

$ ghc -Odph --make ex1.hs -fforce-recomp -fllvm -optlo-O3
[1 of 1] Compiling Main             ( ex1.hs, ex1.o )
Linking ex1 ...

$ time ./ex1
0.051
./ex1  0.06s user 0.00s system 98% cpu 0.061 total

However, despite the value being a constant, llvm -O3 hasn’t removed the loop. Here’s the assembly of the inner loop:

.LBB1_2:
        leaq    1(%rsi), %rax
        addq    %rsi, %r14
        cmpq    $100000001, %rax
        jge     .LBB1_5
        addq    $2, %rsi
        addq    %rax, %r14
.LBB1_1:                                # %tailrecurse
        cmpq    $100000001, %rsi
        jl      .LBB1_2

So I take the code, and wrap it up in the GA tool, then go off and do other things for a few hours:

$ time runacovea -input ex1.hs
    -config /home/dons/share/libacovea/config/ghc-6_12.acovea

I chopped out the data and graphed it for you. Look at how the mean running times reduced in each generation.

It quickly drops down to the llvm -O3 plateau, then marches past that to turn the loop into a constant.

Optimistic options:
                        -optlo-globalopt  (1.723)
                    -optlo-loop-unswitch  (1.866)
                          -optlo-mem2reg  (2.536)
                         -optlo-prune-eh  (1.627)
Pessimistic options:
                             -optlo-adce  (-1.862)
                             -optlo-licm  (-1.623)
                               -optlc-O1  (-1.528)
                               -optlc-O3  (-2.149)
Common:
    -Odph -O2 -funbox-strict-fields
    -fllvm
    -optlo-O2 -optlo-globalopt -optlo-mem2reg -optlo-prune-eh

Once it was done, 4 hours later, I was pretty excited. Look what happened by generation 12! Instead of averaging say, 50ms a run, it was averaging 3ms a run. By generation 16, there was no run time at all.

LLVM did it!

It must have found a way to eliminate the loop entirely.

The flags it recommended were:

        -optlo-globalopt
    -optlo-loop-unswitch
          -optlo-mem2reg
         -optlo-prune-eh
 

So let’s try these out:

$  ghc ex1.hs -Odph --make -fforce-recomp -fllvm -optlo-O2
        -optlo-globalopt -optlo-loop-unswitch -optlo-mem2reg -optlo-prune-eh
$ time ./ex1
5000000050000000
./ex1  0.01s user 0.00s system 114% cpu 0.006 total

Bam! Digging around in the assembly, which is munged in all sorts of ways, there it is:

        movabsq $5000000050000000, %r14 # imm = 0x11C3793ADB7080

The sum from 1 to n was computed at compile time by GHC and LLVM. Now, it is unclear to me why these optimizations where what was needed — and I seem to get the constant produced with other flags after -optlo-O2 as well (loop-unswitching, mem2reg). So my theory is that many of these passes share a common clean up phase, which is actually doing the work. Anyway, there it is: better code than we’ve ever convinced either -fasm or -fvia-C to generate.

I think this is the limit case for what the GA can do. We know enough information was available statically to compute the entire computation, and standard optimization techniques would get us there. It was just a matter of getting the flags right. It will be interesting to see what improvements the GA can find in code that is less obviously amenable to tweaking.

More realistic?

Let’s trying summing a stream of random numbers — so there’s no clever closed form solution for LLVM to find. Can it just, in general, improve loops in Haskell?

First, the source:

import qualified Data.Vector as U
import qualified Data.Vector.Random.Mersenne as R
import System.Random.Mersenne.Pure64

-- compute the sum of a large vector of randoms generated
-- with the mersenne twister.
--
sums g = U.sum (R.randoms g 10000000 :: U.Vector Int)

main = do
    g <- newPureMT
    print (sums g)

A vector is generated using the Mersenne Twister, containing 10^7 random 64 bit integers. We then sum those values. GHC fuses this into a single loop where accumulation and random generation are interleaved. That means it is very simple code. Just a loop, a call to the mersenne twister, and an accumulator. There shouldn’t be that much to improve on.

I launch Acovea on it, as before, and 4 hours later we have:

Optimistic options:
                        -optlo-mem2reg  (1.914)
                       -optlo-scalarrepl  (2.413)
Pessimistic options:
                               -optlo-O1  (-2.577)
                               -optlo-O2  (-2.64)
                               -optlo-O3  (-2.515)

Very interesting. the default optimization flags reduce running times, according to Acovea. Let’s check that:

$ ghc --make -Odph -fllvm -optlo-O3 ex2.hs

$ time ./ex2
4460916339009010646
./ex2  0.31s user 0.00s system 90% cpu 0.347 total

And with the “common options” that were suggested:

$ ghc--make -Odph -O2 -funbox-strict-fields -fllvm -optlo-memdep
-optlo-abcd -optlo-loop-unroll -optlo-mem2reg -optlo-scalarrepl 

$ time ./ex2
4322192527286846546
./ex2  0.28s user 0.01s system 97% cpu 0.302 total

So it found a 13% improvement over the default llvm flags.

Sum of Squares

Now, something relatively serious, the sum of squares of a random vector. First, in Haskell.

-- compute the sum of squares of a vector.
--
sumsq g = U.sum (U.map (\x -> x * x) vector)
    where
        vector = R.randoms g 10000000 :: U.Vector Int

GHC then fuses the three loops here (generation, mapping and summing), into a single loop with no allocations, which is then ready for low level optimizations.

After crunching away for 4 hours, Acovea recommends:

Optimistic options:
                      -optlo-loop-reduce  (1.59)
                       -optlo-scalarrepl  (2.079)

Pessimistic options:
                               -optlo-O1  (-2.591)
                               -optlo-O2  (-2.537)
                               -optlo-O3  (-2.591)
                               -optlc-O0  (-2.591)
                               -optlc-O1  (-1.722)

With regular flags, we get:

$ ghc -v0 --make -fforce-recomp -O2 -funbox-strict-fields
-fllvm -optlo-O3 -optlc-O3 ex3.hs

$ time ./ex3
8241655411004882824       

ex3  0.30s user 0.01s system 96% cpu 0.315 total

And indeed, it’s “best of the best”, we get faster code:

$ ghc -v0 --make -fforce-recomp -Odph -O2 -funbox-strict-fields
-fllvm -optlo-disable-inlining -optlo-basicaa -optlo-basiccg
-optlo-count-aa -optlo-domfrontier -optlo-domtree
-optlo-globalsmodref-aa -optlo-memdep -optlo-no-aa
-optlo-postdomtree -optlo-codegenprepare -optlo-abcd
-optlo-functionattrs -optlo-block-placement -optlo-constmerge
-optlo-constprop -optlo-die -optlo-dse -optlo-globaldce
-optlo-globalopt -optlo-indvars -optlo-inline -optlo-ipconstprop
-optlo-ipsccp -optlo-lcssa -optlo-loop-deletion -optlo-loop-index-split
-optlo-loop-unroll -optlo-loop-unswitch -optlo-loopsimplify
-optlo-mem2reg -optlo-memcpyopt -optlo-scalarrepl
-optlo-tailcallelim ex3.hs

$ time ex3
4829650279590649126
ex3  0.29s user 0.01s system 100% cpu 0.296 total

Finding, again, 10% or so.

And though this is a little more complex (3 arguments in the loop, instead of two), GHC turns this:

sumsq g = U.sum (U.map (\x -> x * x) vector)
    where
        vector = U.enumFromN 1 10000000 :: U.Vector Int

with the same set of flags as the first version, into something that goes from running 228 ms with default LLVM flags to 14 ms with the flags found by Acovea in the first example (which seem to work really well for enumerations!).

Summary

The LLVM optimization layer for GHC adds a huge amount of mostly untapped potential. The optimization space is huge, and using a GA or similar approach to custom-optimize particular programs can be useful for finding common flags, or special purpose flags for niche applications (like scientific simulations). One downside is that it can take quite a while to find a solution. We could speed things up in at least two ways: Additionally, we don’t yet have a way to generate multiple passes (maybe we should run -optlo-$foo 10 times!). GHC just keeps running its optimizer until the code stops changing, maybe we need a similar approach with LLVM?

I also speculate that  the large loop bodies generated through significant fusion should give LLVM some good straight line code to work on.

My next step is to wrap up this approach into a simple Haskell library, modelled on Criterion, such than Haskeller can easily try this:

import Evolution

main = evolve main'

which will take ‘main’, and start evolving the flags used to compile it, against the current GHC/LLVM spec. It will use criterion to compute the fitness in a robust way (so you don’t need to worry about short or long running times), and at the end will generate a graph of the evolution, and a recommended set of flags to use in future. That’ll be useful to a few people.

We could then apply this tool to all of nofib to compute a general set of useful LLVM flags to use for everyone. Finally, there’s still much scope to improve the bytecode going into LLVM, so that more of the analysis phases will fire. Any Summer of Code students interested?

LLVM also supports plugins loaded on the command line. I’d love to be able to write new passes in Haskell, and load them into LLVM on the fly. This might really make it easy to try out some new FP-specific passes on the SSA-alike representation. Another good Summer of Code project…

The LLVM optimizing backend is fun!

Evaluation strategies and synchronization: things to watch for

A talk presented at the weekly Galois Developer Symposium.

The source of the talk and examples are all online.

This talk shows two areas relating to evaluation strategies in Haskell, and how they can subtly interact with threads, synchronization and performance. In addition, we briefly look at the interaction with asynchronous exceptions.

In particular,

  • Be careful with lazy values in synchronization variables
  • Consider using the strict-concurrency package for synchronization
  • All modify* functions should be very very carefully used
  • Interactions between async exceptions (e.g. block/killThread) and threads need expert eyes.

Fusion makes functional programming fun!

It’s deeply satisfying when the stream fusion optimization kicks in, in GHC Haskell. It is a powerful illustration of how the absence of side effects makes the compiler’s job so much easier, and lets it be far more aggressive.

Here’s a little example to show the real expressive power that we get when we have representation-changing optimizations like fusion to rely on. A simple example showing how good GHC is doing at this kind of stuff, these days: add a sequence of numbers. I’ve kept it super minimal so you can mentally compute each transformation to see GHC at work. (Of course there are lots of other ways to write trivial computations like this, but anything more substantial will make the assembly much harder to follow. So for teaching purposes, we’ll keep the source problem simple. Now, you can also use any of the 100+ other functions that fuse to construct all sorts of interesting programs with no closed form solution. Get at it!).

We begin with a pipeline of multiple recursive functions, brought together with function composition. As high level as it gets, and any junior Haskell program should be able to write stuff like this:

import qualified Data.Vector as U

    main = print . U.sum $ U.enumFromTo 1 (100000000 :: Int)

Now, semantically, there is an intermediate 100 million element array being created by enumFromTo. Try doing that in your favorite strict language. However…

… fusion kicks in, and GHC turns our program, via the source-to-source optimization phase, into a single tail recursive function, using unboxed values only .. with no intermediate array! No heap allocaiton, no stack, only one loop. A different algorithm!

    loop :: Int# -> Int# -> Int#
    loop x y = case y <= 100000000 of
          False -> x
          True  -> loop (x + y) (y + 1)

The compiler reordered our computations pretty wildly, and it can do this because there are no side effects by default — making large scale transformations like this far more feasible. I used ghc 6.13 with the llvm patches, and ghc -Odph -fllvm –make A.hs

The programmer didn’t have to think about recursion at all in the original program, let alone manual unboxing. Thanks GHC!

And now this tail recursive function gets compiled into an imperative loop via the codeGen phase. First, with the native code backend to GHC:

loop:
    .Lc216:
       cmpq $10000000,%rsi
       jle .Lc219
       movq %r14,%rbx
       movq (%rbp),%rax
       jmp *(%rax)
    .Lc219:
       addq %rsi,%r14
       incq %rsi
       jmp loop

And again, with the LLVM backend to GHC:

loop:
    leaq    1(%rsi), %rax
    addq    %rsi, %r14
    cmpq    $10000001, %rax
    jge     .LBB1_5
    addq    $2, %rsi
    addq    %rax, %r14
test:                                # %tailrecurse
    cmpq    $10000001, %rsi
    jl      loop

And that’s it. Our kernel code is as low level and un-functional as you get, but our input program was the most classic of high level approaches: a pipeline of functions held together with composition, with intermediate data structures all over the place.

And it performs!

With -fasm: mean: 92.9 ms:

With -fllvm: mean: 79.3ms (15% improvement with -fllvm):

Maybe our compiler is — sometimes — sufficiently smart.

To me, the fusion optimization is the epitome of what functional programming is about. I can write code at a very high level, throwing functions and composition around like candy, and the compiler will make stunning, complexity-changing transformations on my code, so it just runs damn fast. Can your language — functional or otherwise — do this?

Modern Benchmarking in Haskell

Thanks to work by Bryan O’Sullivan, there has been a rennaissance in performance benchmarking tools for Haskell, built upon Criterion.

Compared to most other benchmarking frameworks (for any programming language, not just Haskell), criterion focuses on being easy to use, informative, and robust.

The Galois Tech Talk of Feb 23rd presented this work. You can read the slides online, or find the source and examples here.

Criterion uses statistically robust mechanisms for sampling and computing sound microbenchmark results, and is more stable in the presence of noise on the system than naive timings.

Criterion has in turn spawned some extensions:

  • Progression: compare different criterion graphs
  • NoSlow: a new array benchmark suite based on Criterion

In this talk I will present these tools, how to use them, and how to make your performance benchmarks in Haskell, or languages Haskell can talk to, more reliable. In addition, we’ll explore benchmarks using the new vector package, and GHC’s llvm backend.

Follow

Get every new post delivered to your Inbox.

Join 59 other followers