Modern Benchmarking in Haskell

Thanks to work by Bryan O’Sullivan, there has been a rennaissance in performance benchmarking tools for Haskell, built upon Criterion.

Compared to most other benchmarking frameworks (for any programming language, not just Haskell), criterion focuses on being easy to use, informative, and robust.

The Galois Tech Talk of Feb 23rd presented this work. You can read the slides online, or find the source and examples here.

Criterion uses statistically robust mechanisms for sampling and computing sound microbenchmark results, and is more stable in the presence of noise on the system than naive timings.

Criterion has in turn spawned some extensions:

  • Progression: compare different criterion graphs
  • NoSlow: a new array benchmark suite based on Criterion

In this talk I will present these tools, how to use them, and how to make your performance benchmarks in Haskell, or languages Haskell can talk to, more reliable. In addition, we’ll explore benchmarks using the new vector package, and GHC’s llvm backend.

Smoking fast Haskell code using GHC’s new LLVM codegen


In this post we’ll play with GHC’s new LLVM code generator backend, and see how much faster some Haskell programs are when compiled with LLVM instead of GCC.

For the kind of loops we get from stream fusion, the -fllvm backend produced a lot better code, up to 3x faster in some cases. There are pretty graphs, and some smoking hot new technology.

Overview

This week David Terei announced that his work on an LLVM code generator backend for the Glasgow Haskell Compiler was ready to try out. Initial reports from his undergraduate thesis held that the LLVM code generator was competitive with the current GHC native code generator, a bit slower than the C backend in general (which uses GCC for code generation), but, tantalisingly, should produce big speedups for particular Haskell programs. In particular, tight loops of the kind generated by the bytestring, vector, data parallel arrays or text libraries. David reported speedups of 25% over the previous best performance we’d got from GHC for data parallel code.

I was very keen to try it out on the vector library — a fast, fusible numerical arrays package (similar to NumPy), which generates some very tight loops. Under the C backend, GCC has been failing to spot that the code GHC generates were actually loops, and this lead to GCC optimizing the generated code pretty badly. The native code generator does ok, but doesn’t have a lot of the clever low-level optimizations we need for really good bare metal performance.

So how would the new LLVM backend do?

Setting up

To try out the LLVM backend I followed the instructions on the wiki.

  • Check out GHC HEAD from darcs.
  • Apply the LLVM patch.
  • Check out LLVM from svn
  • Apply the GHC patch
  • Build your GHC.

This worked out of the box, and I now have a GHC 6.13 with the -fllvm flag.

$ ghc --info
 [("Project name","The Glorious Glasgow Haskell Compilation System")
 ,("Project version","6.13.20100221")
 ,("Booter version","6.12.1")
 ,("Stage","2")
 ,("Have interpreter","YES")
 ,("Object splitting","YES")
 ,("Have native code generator","YES")
 ,("Have llvm code generator","YES")
 ,("Support SMP","YES")
 ,("Unregisterised","NO")
 ,("Tables next to code","NO")
 ,("Win32 DLLs","")
 ,("RTS ways","l debug  thr thr_debug thr_l  ")
 ,("Leading underscore","NO")
 ,("Debug on","False")
 ,("LibDir","/home/dons/lib/ghc-6.13.20100221")
 ]

Running on a dual core Core 2 laptop:

$ uname -msr
 Linux 2.6.32-ARCH x86_64

You can then install packages as normal, via cabal, and add the -fllvm flag to see GHC build things via the new backend:

$ cabal install primitive --ghc-options=-fllvm

The packages I’m interested in are:

And some helper code in:

I also modifed the ghc-core tool to support showing the LLVM generated assembly.

Warm up lap

Let’s check the backend is working (remember to add the -fllvm flag):

$ ghc -O2 --make A.hs -fllvm -fforce-recomp
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...
$ time ./A
"hey"
./A  0.00s user 0.00s system 61% cpu 0.005 total

Good! The LLVM backend is generating working code for x86_64/Linux. Now, something more ambitious … a program from the shootout.

A shootout program

So let’s find some code that’s already been optimized. I’l compile the pidgits shootout benchmarks (where Haskell’s already the fastest entry).

First, with the native code gen:

$ ghc -O2 -fasm A.hs –make -fforce-recomp

$ time ./A 10000 > /dev/null
./A 10000 > /dev/null 3.19s user 0.03s system 91% cpu 3.509 total

With the old GCC backend:

$ ghc -O2 -fvia-C -optc-O3 A.hs –make -fforce-recomp

$ time ./A 10000 > /dev/null
./A 10000 > /dev/null 2.89s user 0.03s system 97% cpu 2.988 total

And with the -fllvm backend:

$ ghc -O2 -fllvm A.hs –make -fforce-recomp

$ time ./A 10000 > /dev/null
./A 10000 > /dev/null 2.86s user 0.02s system 98% cpu 2.936 total

Woo. It runs, and we get a speedup! Now for some serious business.

The Vector Package

Vector is a Haskell library for working with arrays. It provides several array types (boxed, unboxed, C), with a rich interface similar to the lists library, and some functions reminiscent of Data Parallel Haskell. There’s a tutorial on how to use it.

The interface is built entirely around stream fusion combinators — a general form of classic loop fusion made possible by purity. When you do multiple passes over the data (e.g. sum/map/fold/filter/…) the compiler will common up the loops, and discard intermediate arrays, making the code potentially very fast.

The loops that are generated tend to be very register heavy, do no heap allocation, and benefit from clever imperative loop optimizations. Unfortunately, the GCC backend to GHC doesn’t spot that these are actually loops, so doesn’t get to fire many optimizations.

The promise of the LLVM backend is that it will recognize the loops GHC generates from fused code. Let’s see how it performs.

To benchmark these programs, I’ll use the criterion and progression benchmarking libraries. (I had to build the darcs version of gtk2hs, and compiler data accessor-template with the -ftemplate_2_4 flag)

Simple loops

To start off, let’s generate 1 billion ints, sum them, print the result. That should tell us if our loops are efficient:

import qualified Data.Vector as U
main = print . U.sum $ U.enumFromTo 1 (1000000000 :: Int)

There are two loops in this program. enumFromTo and sum.

The core

GHC compiles these two loops into a single loop, when compiled with -O2 or -Odph:

loop  :: Int# -> Int# -> Int#
loop x y =
     case <=# y 1000000000 of
         False -> x
         True  ->  loop (x +# y) (y +# 1)

This is perfect. We write “sum (enumFromTo 1 n)” and we get a non-allocating loop.

The native backend

GHC 6.13 with the native code generator generates the following assembly for the inner loop:

Main_mainzuzdszdwfoldlMzqzuloop_entry:
 .Lc21u:
 cmpq $1000000000,%rsi
 jle .Lc21x
 movq %r14,%rbx
 movq (%rbp),%rax
 jmp *(%rax)
 .Lc21x:
 addq %rsi,%r14
 incq %rsi
 jmp Main_mainzuzdszdwfoldlMzqzuloop_entry

which runs in:

$ time ./enum
 500000000500000000
 ./enum  1.00s user 0.00s system 99% cpu 1.008 total

The  C backend

GHC 6.12.1 with the C backend, (-fvia-C -optc-O3) (I’m having trouble linking programs with the C backend and GHC 6.13), yields a pretty small loop:

Main_mainzuzdszdwfoldlMzqzuloop_info:
 cmpq    $1000000000, %r14
 movq    %r14, %rax
 jle     .L2
 movq    %rsi, %rbx
 jmp     *(%rbp)
 .L2:
 leaq    1(%r14), %r14
 addq    %rax, %rsi
 jmp     Main_mainzuzdszdwfoldlMzqzuloop_info

Which runs slower than the native code generator:

$ time ./enum
 500000000500000000
 ./enum  1.09s user 0.00s system 99% cpu 1.100 total

The LLVM backend

With -O2 -fllvm we get very different code, and it is a bit harder to work out what is going on. LLVM transforms the code far more aggressively.

.LBB1_2:
 leaq    1(%rsi), %rax
 addq    %rsi, %r14
 cmpq    $1000000001, %rax
 jge     .LBB1_5                 # loop exit
 addq    $2, %rsi
 addq    %rax, %r14
 .LBB1_1:                        # %tailrecurse
 cmpq    $1000000001, %rsi
 jl      .LBB1_2

And the proof is in the pudding:

$ time ./enum
 500000000500000000
 ./enum  0.48s user 0.01s system 99% cpu 0.488 total

This is the fastest Haskell we’ve ever generated for this little benchmark (at least without manual loop unrolling)!

The LLVM backend more than halved the running time for this simple loop. But remember: general benchmarks aren’t seeing these kind of speedups — LLVM is really excelling itself at the tight numeric code.

Here’s the data presented in a slightly different form, with criterion and progression. The numbers are slightly different, since we won’t inline the length of the vector argument, and we’re wrapping the code in benchmarking wrappers. I wasn’t able to get -fvia-C programs to link under the HEAD, so we’ll exclude those from graphs, but report them in text form.

With the -fasm backend:

With the LLVM backend:

Or side-by-side with the progression package:

The -fasm backend under the progression tool ran around ~1s for each billion ints, while -fllvm was around 0.8s. Note that we get slightly different timings with the loops under each benchmarking tool, due to how the benchmark program and wrapper are optimized.

Zips

Zips are another good candidate, since they turn into nested loops. So, e.g.

import qualified Data.Vector as U
import Data.Bits
main = print . U.sum . U.map (`shiftL` 1) $ U.zipWith (*)
                        (U.enumFromTo 1 (100000000 :: Int))
                        (U.replicate (100000000 :: Int) 42)

Which fuses to this set of loops:

loop  :: Int# -> Int# -> Int# -> Int#
loop =
  \ (sc_s29b :: Int#)
    (sc1_s29c :: Int#)
    (sc2_s29d :: Int#) ->
    case <=# sc1_s29c 100000000 of _ {       False -> sc_s29b;
      True ->
        case <=# sc2_s29d 0 of _ {           False ->
            loop
              (+#
                 sc_s29b (uncheckedIShiftL# (*# sc1_s29c 42) 1))
              (+# sc1_s29c 1)
              (-# sc2_s29d 1);
          True -> sc_s29b
        }
    }

Which, again, is perfect Core. All those functions combined into a single non-allocating loop.

-fasm:

Main_mainzuzdszdwfoldlMzqzuloop_entry:
.Lc2aC:
        cmpq $100000000,%rsi
        jle .Lc2aE
        movq %r14,%rbx
        movq (%rbp),%rax
        jmp *(%rax)
.Lc2aE:
        testq %rdi,%rdi
        jle .Lc2aH
        movq %rsi,%rax
        imulq $42,%rax
        shlq $1,%rax
        addq %rax,%r14
        incq %rsi
        decq %rdi
        jmp Main_mainzuzdszdwfoldlMzqzuloop_entry
.Lc2aH:
        movq %r14,%rbx
        movq (%rbp),%rax
        jmp *(%rax)

Which is reasonable:

$ time ./zipwith
420000004200000000
./zipwith 0.24s user 0.00s system 99% cpu 0.246 total

With the -fvia-C -optc-O3 backend, just the inner loop, since that’s easy to read:

Main_mainzuzdszdwfoldlMzqzuloop_info:
        cmpq    $100000000, %rsi
        jg      .L6
.L2:
        testq   %r14, %r14
        jle     .L6
        leaq    (%rsi,%rsi,4), %rcx
        leaq    -1(%r14), %r14
        leaq    (%rsi,%rcx,4), %rcx
        leaq    1(%rsi), %rsi
        leaq    (%rdi,%rcx,4), %rdi
        jmp     Main_mainzuzdszdwfoldlMzqzuloop_info

Which runs in about the same time as the -fasm backend:

$ time ./zipwith
420000004200000000
./zipwith  0.25s user 0.00s system 99% cpu 0.251 total

With -fllvm the code is wildly different, and I find it pretty hard to reconstruct what transformatoins LLVM has done.

Main_mainzuzdszdwfoldlMzqzuloop_entry:
# BB#0:                                 # %c2cf
        subq    $8, %rsp
        imulq   $84, %rsi, %rax
        jmp     .LBB1_1
.LBB1_3:                                # %n2cN
                                        #   in Loop: Header=BB1_1 Depth=1
        incq    %rsi
        decq    %rdi
        addq    %rax, %r14
        addq    $84, %rax
.LBB1_1:                                # %tailrecurse
                                        # =>This Inner Loop Header: Depth=1
        cmpq    $100000001, %rsi        # imm = 0x5F5E101
        jge     .LBB1_4
                                        #   in Loop: Header=BB1_1 Depth=1
        testq   %rdi, %rdi
        jg      .LBB1_3
.LBB1_4:                                # %n2ck
        movq    (%rbp), %rax
        movq    %r14, %rbx
        movq    (%rax), %r11
        addq    $8, %rsp
        jmpq    *%r11  # TAILCALL

The “inner loop” is interesting. Nothing like what -fasm or -fvia-C generate. And it’s way faster:

$ time ./zipwith
420000004200000000
./zipwith 0.15s user 0.00s system 99% cpu 0.154 total

So yeah, 40% faster!

Criterion

Here, under criterion (same code, but different values of n), With the -fasm backend, mean execution time 186ms:

With the -fllvm backend, 135 ms  (27% improvement):

zipwith3

Heavily nested zips are probably the best cases for LLVM, and we see the -fllvm backend do some pretty wild stuff with this:

import qualified Data.Vector.Unboxed as U import Data.Bits main = print . U.sum $ U.zipWith3 (\x y z -> x * y * z) (U.enumFromTo 1 (100000000 :: Int)) (U.enumFromTo 2 (100000001 :: Int)) (U.enumFromTo 7 (100000008 :: Int))

Which fuses to:

main_$s$wfoldlM'_loop [Occ=LoopBreaker]
  :: Int#     -> Int# -> Int# -> Int# -> Int#

main_$s$wfoldlM'_loop =
  \ (sc_s2jh :: Int#)
    (sc1_s2ji :: Int#)
    (sc2_s2jj :: Int#)
    (sc3_s2jk :: Int#) ->
    case  sc_s2jh;
      True ->
        case  sc_s2jh;
          True ->
            case  sc_s2jh;
              True ->
                main_$s$wfoldlM'_loop
                  (+#
                     sc_s2jh (*# (*# sc1_s2ji sc2_s2jj) sc3_s2jk))
                  (+# sc1_s2ji 1)
                  (+# sc2_s2jj 1)
                  (+# sc3_s2jk 1)
            }
        }
    }

Great core. With the -fasm backend:

Main_mainzuzdszdwfoldlMzqzuloop_entry:
.Lc2lq:
        cmpq $100000000,%rsi
        jle .Lc2ls
        movq %r14,%rbx
        movq (%rbp),%rax
        jmp *(%rax)
.Lc2ls:
        cmpq $100000001,%rdi
        jle .Lc2lu
        movq %r14,%rbx
        movq (%rbp),%rax
        jmp *(%rax)
.Lc2lu:
        cmpq $100000008,%r8
        jle .Lc2lx
        movq %r14,%rbx
        movq (%rbp),%rax
        jmp *(%rax)
.Lc2lx:
        movq %rdi,%rax
        imulq %r8,%rax
        movq %rsi,%rcx
        imulq %rax,%rcx
        addq %rcx,%r14
        incq %rsi
        incq %rdi
        incq %r8
        jmp Main_mainzuzdszdwfoldlMzqzuloop_entry

Straight forward, and running it:

$ time ./zipwith3
3541230156834269568
./zipwith3  0.47s user 0.01s system 98% cpu 0.484 total

With -fvia-C -optc-O3:

Main_mainzuzdszdwfoldlMzqzuloop_info:
        .text
        .p2align 4,,15
.text
        .align 8
        .type Main_mainzuzdszdwfoldlMzqzuloop_info, @function
# 38 "/tmp/ghc10013_0/ghc10013_0.hc" 1
# 0 "" 2
        cmpq    $100000000, %rdi
        jg      .L9
.L4:
        cmpq    $100000001, %rsi
        jg      .L9
.L5:
        cmpq    $100000008, %r14
        .p2align 4,,5
        jg      .L9
.L7:
        movq    %rsi, %r10
        leaq    1(%rsi), %rsi
        imulq   %rdi, %r10
        leaq    1(%rdi), %rdi
        imulq   %r14, %r10
        leaq    1(%r14), %r14
        leaq    (%r10,%r8), %r8
        jmp     Main_mainzuzdszdwfoldlMzqzuloop_info

And we get a faster result:

$ time ./zipwith3
3541230156834269568
./zipwith3  0.34s user 0.00s system 99% cpu 0.344 total

-fllvm, looks like some heavy loop unrolling:

Main_mainzuzdszdwfoldlMzqzuloop_entry:  # @Main_mainzuzdszdwfoldlMzqzuloop_entry
# BB#0:                                 # %c2oz
        subq    $56, %rsp
        cmpq    $100000002, %rdi        # imm = 0x5F5E102
        movl    $100000002, %eax        # imm = 0x5F5E102
        movq    $-2, %rdx
        movq    %r9, 40(%rsp)           # 8-byte Spill
        movq    %r15, 48(%rsp)          # 8-byte Spill
        movq    $-3, %r9
        movq    %r12, 32(%rsp)          # 8-byte Spill
        movq    %r8, %rbx
        movq    %r13, 24(%rsp)          # 8-byte Spill
        movq    %r14, 16(%rsp)          # 8-byte Spill
        leaq    1(%rdi), %r13
        cmovgq  %rdi, %rax
        negq    %rax
        leaq    -1(%rdi,%rax), %rcx
        cmpq    $100000009, %r8         # imm = 0x5F5E109
        movl    $100000009, %eax        # imm = 0x5F5E109
        cmovgq  %r8, %rax
        negq    %rax
        leaq    -1(%r8,%rax), %rax
        cmpq    %rcx, %rax
        cmovaq  %rax, %rcx
        cmpq    $100000001, %rsi        # imm = 0x5F5E101
        movl    $100000001, %eax        # imm = 0x5F5E101
        cmovgq  %rsi, %rax
        negq    %rax
        leaq    -1(%rsi,%rax), %rax
        cmpq    %rax, %rcx
        cmovbeq %rax, %rcx
        imulq   %rdi, %rbx
        imulq   %rsi, %r13
        movq    %rcx, %r10
        subq    %rcx, %rdx
        subq    %rcx, %r9
        imulq   %rsi, %rbx
        addq    %rdi, %r13
        notq    %r10
        movq    %r10, %rax
        imulq   %r10, %rbx
        mulq    %rdx
        addq    16(%rsp), %rbx          # 8-byte Folded Reload
        movq    %rax, %r11
        movq    %rdx, %r15
        movq    %r15, %r12
        movq    %r11, %rax
        andq    $1, %r15
        imulq   %r9, %r12
        mulq    %r9
        shldq   $63, %r11, %r15
        leaq    (%r8,%rdi), %r9
        addq    %rdx, %r12
        movq    $-4, %rdx
        addq    %rsi, %r9
        subq    %rcx, %rdx
        movq    %r12, %r14
        andq    $1, %r12
        leaq    6(%r9,%r9), %r10
        movabsq $6148914691236517205, %r9 # imm = 0x5555555555555555
        movq    %rdx, 8(%rsp)           # 8-byte Spill
        imulq   %rdx, %r14
        leaq    1(%rdi,%rsi), %rdx
        shldq   $63, %rax, %r12
        imulq   %r8, %rdx
        imulq   %r12, %r10
        leaq    1(%rdx,%r13), %rdx
        imulq   %r10, %r9
        imulq   %r15, %rdx
        addq    %rdx, %rbx
        mulq    8(%rsp)                 # 8-byte Folded Reload
        subq    %r9, %rbx
        movq    %r8, %r9
        decq    %r8
        subq    %rcx, %r9
        addq    %rdx, %r14
        movq    %rdi, %rdx
        decq    %r9
        shldq   $62, %rax, %r14
        movq    %rsi, %rax
        subq    %rcx, %rdx
        andq    $-2, %r14
        subq    %rcx, %rax
        decq    %rdx
        addq    %rbx, %r14
        decq    %rax
        .align 16
.LBB2_1:                                # %tailrecurse
                                        # =>This Inner Loop Header: Depth=1
        cmpq    $100000001, %rsi        # imm = 0x5F5E101
        jge     .LBB2_4
# BB#2:                                 # %c2oD
                                        #   in Loop: Header=BB2_1 Depth=1
        cmpq    $100000002, %rdi        # imm = 0x5F5E102
        jge     .LBB2_4
# BB#3:                                 # %c2p5
                                        #   in Loop: Header=BB2_1 Depth=1
        incq    %rsi
        incq    %rdi
        incq    %r8
        cmpq    $100000009, %r8         # imm = 0x5F5E109
        jl      .LBB2_1
.LBB2_4:                                # %n2oE
        movq    (%rbp), %rcx
        movq    %r9, %r8
        movq    24(%rsp), %r13          # 8-byte Reload
        movq    32(%rsp), %r12          # 8-byte Reload
        movq    %r14, %rbx
        movq    %rax, %rsi
        movq    %rdx, %rdi
        movq    40(%rsp), %r9           # 8-byte Reload
        movq    48(%rsp), %r15          # 8-byte Reload
        movq    (%rcx), %r11
        addq    $56, %rsp
        jmpq    *%r11  # TAILCALL

And blows them all out of the water! 3x faster than -fasm! Twice as fast as -fvia-C -optc-O3.

$ time ./zipwith3
3541230156834269568
./zipwith3  0.16s user 0.00s system 99% cpu 0.158 total

From the Statistics package

The statistics package has some more “realistic” microbenchmarks. Let’s look at those. First, computing the mean of a large array of doubles (here all set to ‘pi’).

main = print (mean (V.replicate 1000000000 (pi :: Double)))

With the -fasm backend:

Main_mainzuzdszdwfoldlMzuloop_entry:
.Lc2b2:
        testq %rsi,%rsi
        jle .Lc2b5
        cvtsi2sdq %r14,%xmm0
        movsd .Ln2b8(%rip),%xmm7
        subsd %xmm5,%xmm7
        divsd %xmm0,%xmm7
        addsd %xmm7,%xmm5
        incq %r14
        decq %rsi
        jmp Main_mainzuzdszdwfoldlMzuloop_entry

Simple, easy.

$ time ./mean
3.141592653589793
./mean  5.58s user 0.01s system 99% cpu 5.599 total

With the -fllvm backend:

Main_mainzuzdszdwfoldlMzuloop_entry:    # @Main_mainzuzdszdwfoldlMzuloop_entry
# BB#0:                                 # %c28E
        subq    $8, %rsp
        movsd   .LCPI3_0(%rip), %xmm0
        jmp     .LBB3_1
        .align 16
.LBB3_3:                                # %n28K.i
                                        #   in Loop: Header=BB3_1 Depth=1
        movapd  %xmm0, %xmm5
        cvtsi2sdq       %rcx, %xmm8
        addq    $-2, %rsi
        addq    $2, %r14
        subsd   %xmm7, %xmm5
        divsd   %xmm8, %xmm5
        addsd   %xmm7, %xmm5
.LBB3_1:                                # %tailrecurse
                                        # =>This Inner Loop Header: Depth=1
        testq   %rsi, %rsi
        jle     .LBB3_5
# BB#2:                                 # %n28K
                                        #   in Loop: Header=BB3_1 Depth=1
        movapd  %xmm0, %xmm7
        cvtsi2sdq       %r14, %xmm8
        leaq    -1(%rsi), %rax
        leaq    1(%r14), %rcx
        subsd   %xmm5, %xmm7
        testq   %rax, %rax
        divsd   %xmm8, %xmm7
        addsd   %xmm5, %xmm7
        jg      .LBB3_3
# BB#4:                                 # %c28J.i
        movq    (%rbp), %rdx
        movq    %rcx, %rbx
        movq    %rcx, %r14
        movq    %rax, %rsi
        movapd  %xmm7, %xmm5
        movq    (%rdx), %r11
        addq    $8, %rsp
        jmpq    *%r11  # TAILCALL
.LBB3_5:                                # %c28J
        movq    (%rbp), %rax
        movq    %r14, %rbx
        movq    (%rax), %r11
        addq    $8, %rsp
        jmpq    *%r11  # TAILCALL

And running it:

$ time ./mean
3.141592653589793
./mean  5.55s user 0.01s system 99% cpu 5.585 total

Some pretty wacky code, but a little faster.

Conclusions

The LLVM backend seems to be holding up to what we hoped: it does a better (some times much better) job on tight loops. We get better code than GHC has ever produced before. It seems pretty robust, so far everything I’ve tried has worked.

David’s benchmarks indicate that with the current — first attempt — at an LLVM backend most large programs aren’t noticeably faster, but I think the promise we see in these small examples justifies spending more time working on the LLVM backend to GHC. It has much more potential than the GCC backend.

Currently we’re not experimenting with the LLVM optimization layer at all — I think there’s likely to be a lot of win just tweaking those settings (and exposing them to the Haskell programmer via GHC flags).

Migrating from uvector to vector

Roman Leschinkskiy has released the 0.5 version of vector, the future standard non-parallel arrays library forGHC Haskell. This post covers some of the differences between it and uvector, and what to watch for when migrating code to use vector.

The summary is — as of Feb 15, 2010 — you can move to vector now.  In almost all cases you will get identical performance to uvector, but with a nicer interface. There are currently a few small gaps in the API, and a couple of performance tweaks are needed to particular functions, but they should not affect most people (and likely will be fixed in coming days). Note that you should use the -Odph optimization flag for the most reliable results.

Background

vector is one of the results of the multi-year Data Parallel Haskell project, to develop high performance parallel bulk array processing for Haskell, allowing us to do very fast arrays (that is, transparently multicore parallel array processing, outperforming C or C++ by using cores more efficiently.)

While this project concentrates on data parallelism, it has also lead to new approaches to flat, sequential arrays for Haskell. The code has been spun-off in two different packages, which replace the fifteen year old array package with faster, more flexible types and functions:

These two libraries share a common origin, but have different engineering goals. Both libraries make heavy use of loop fusion based on streams to achieve excellent performance. (You can read more about that in a separate post).

uvector is a conservative attempt to develop fast, unboxed arrays with a flexible interface based on fusion, to replace Data.Array in the short term, while vector was immature. uvector has been in active service for about two years now, filling a gap while we waited for vector to mature. uvector has several users now, including haskell-criterion haskell-histogram-fill haskell-hnn haskell-monte-carlo haskell-mwc-random haskell-pqueue-mtlhaskell-safe-freeze haskell-statistics haskell-statistics-fusion haskell-uvector-algorithms. These packages in the medium term should consider moving to vector.

The vector library is far more ambitious, aiming to be the standard array library for high performance problems in all circumstances. It cleanly supports:

  • mutable arrays
  • immutable arrays
  • boxed
  • unboxed
  • storable types

As for uvector, unboxed representations are specialized at compile time via type families, and fusion is used throughout the interface. Unlike uvector, vector supports boxed arrays, and provides inplace fusion of mutable array operations.

If you need transparently parallel arrays, you should consider the dph package, distributed with GHC.

Current status

uvector is stable, and has gone into maintainance mode only. If you like it, you can safely continue to use it for  the foreseeable future, though any performance improvements in the fusion or array types developed in vector will not be backported to uvector.

vector, as of 0.5, has been declared “beta”. You can begin migrating code to use it.

Migrating your code

I’ve just finished porting the uvector micro benchmark suite to vector, and have the following notes on how to migrate your code to use unboxed arrays in the vector package.

Namespaces

The old uvector namespace was Data.Array.Vector with U suffix appended to names. That goes away, and instead you should:

import qualified Data.Vector as U

Most function names are identical, so we have in vector the obvious counterparts to uvector. All these are basically unchanged:

U.length U.null U.empty U.cons U.snoc U.replicate U.head
U.last U.init U.tail U.take U.drop U.map U.filter U.elem U.notElem
U.product U.sum U.maximum U.minimum U.foldl1 U.foldl U.dropWhile U.break
U.find U.all U.any U.and U.or U.maximumBy U.minimumBy
U.enumFromTo

Some function names have changed:

U.++ replaces appendU
U.! replaces indexU

Some functions are missing:

lookupU repeatU

There are many new functions on vectors, in particular, on mutable arrays, and for bulk operations (backpermute, reverse, accum).

Better types

Notably, vector supports boxed types, so you can more easily store Haskell values in fusible arrrays (so you can have, e.g. Integer arrays).

Performance Wibbles

I found only a few differences in performance compared to uvector, and have notified Roman. These shouldn’t affect many users currently, and will likely disappear in coming days.

First, compile with -Odph instead of -O2, this fixes some optimization issues with zips, and probably other things.

Functions to watch out for:

  • zip, zipwith, zipwith3 — uvector was a lot faster (10x) in simple programs — however, moving to -Odph fixes zips entirely.
  • Different fusion results for ’empty’ (kind of a corner case)
  • eq seems to be about twice as slow. Unsure why.
  • Bools don’t seem to be bit packed? At least, Bool unboxed arrays seem a bit slower than in uvector.
  • U.last appears to be optimized differently, though doesn’t affect performance.
  • Pipelines ending in ‘null’ (another corner case) are fused differently (slightly worse performance).

And that’s about it.

As of this post, I’m officially declaring uvector to be in maintainance-only mode, and will be working to improve vector.

Playing with the new Haskell epoll event library

Bryan and Johan have been working hard on replacing the GHC runtime’s concurrency mechanism based on select, with a new one based on epoll. This should improve scalability of Haskell server code in a number of ways — more connections sec, more concurrent connections, and so on.

You can read about their work in a series of posts:

And you can find the current code for haskell-event on github: http://github.com/tibbe/event

After reading this post, on how epoll-based user land callbacks can outperform general purpose preemptive threads for non-compute bound IO work, I wanted to see what improvements we can hope for from an epoll-based version in Haskell.

The full code, which shows how to set up simple socket event notification using the event lib, is on the Haskell wiki, and will be useful for getting a flavor of the event lib use.

The results:

So while a simple forkIO/accept version, doing Handle and String IO would peak out around 7k req/sec, the epoll based version reached over 15k req/sec.

So that’s useful: good epoll-based event code for Haskell. The next step will be to see the redesign of Control.Concurrent based on epoll being merged into GHC.

Multicore Haskell Now! ACM Reflections | Projections 2009

Here are the

on approaches to multicore programming in Haskell, presented at the ACM Reflections | Projections conference at the University of Illinois.

Abstract

Multicore computers are here: is your programming language ready for it? Haskell is: you can take an off-the-shelf copy of GHC and write high performance parallel programs right now.

If you want to program a parallel machine, a purely functional language such as Haskell is a good choice: purity ensures the language is by-default safe for parallel execution, (whilst traditional imperative languages are by-default unsafe). This foundation has enabled Haskell to become something of a melting pot for high level approaches to concurrent and parallel programming, all available with an industrial strength compiler and language toolchain, available now for mainstream multicore programming.

This talk will introduce the features Haskell provides for writing high level parallel and concurrent programs. In particular we’ll focus on lightweight semi-explicit parallelism, software transactional memory, and nested data parallelism, so you can go to work writing multicore programs in Haskell.

I’d like to thank Michael Ilseman, Sameer Sundresh and Jeff Green for their hospitality during my visit.

LACSS 2009: Domain Specific Languages and Haskell

Here are the

on the use of Haskell for constructing EDSLs for high performance computing, along with a 10 minute overview of Haskell support for multicore programming, presented at the LACSS 2009 Workshop on Non-Traditional Programming Models.

Abstract

As the complexity of large-scale computing architecture increases, the effort needed to program these machines efficiently has grown dramatically. The challenge is how to bridge this “programmability gap”, making the hardware more accessible to domain experts. We argue for an approach based on
executable embedded domain specific languages (EDSLs)—small languages with focused expressive power hosted directly in existing high-level programming languages such as Haskell. We provide examples of EDSLs in use in industry today, and describe the advantages EDSLs have over general purpose languages in productivity, performance, correctness and cost.

Self-optimizing data structures: using types to make lists faster

There are lots of optimization opportunities in a statically typed, purely functional language. This post describes an approach that improves the performance of off-the-shelf Haskell list functions by about 25% on average. The code is available as a drop-in replacement for Data.List.inkscape

This is part of an ongoing series on high-performance Haskell programming. Related posts include stream fusion for lists, programming the inliner, Judy arrays and Haskell, making binary parsing faster and applying stream fusion to arrays.

Background

Statically typed, purely functional code is a sweet spot for optimization hackers. All that information available at compile time lets us write some pretty radical transformations, and know they are safe. Stream fusion, for example, relies on having type and effect information available statically, and uses that to make complexity-changing rewrites of your code. Clever algebraic transformations such as call-pattern specialization can make some pretty aggressive performance improvements. We can also make the runtime more efficient: by introducing parallelism, or again, exploiting type and effect information via pointer tagging. All this is easier in a strongly, statically typed and pure setting — there’s just less stuff you have to worry about when doing the transformations.

This post introduces an optimization technique based on types: self-optimizing polymorphic data types. When instantiated to a particular type, the code presented will generate both a custom, specialized data structure, and specialized functions over that type. The result is a general mechanism to improve the data density, and thus performance, of parametrically polymorphic data structures (where the same code operates generically on any type), such as lists, trees and finite maps. The general concept should be useful in any language with parametrically polymorphic types, and its implementation in Haskell, via class-associated data types is particularly elegant (requiring no compiler modification).

Polymorphism

Parametric polymorphism is a great programmer tool. It improves expressivity, by letting the programmer use the same code at any possible type, yet still retains type safety. For example, the list type is a polymorphic type, and the length function computes the length of any lists, no matter what the element type is.

-- The list type: either empty, or cons, holds any type 'a'
data [] a = [] | a : [a]

Prelude> :t length
length :: [a] -> Int

The type system ensures that we can never write length to depend on a property of the element type, ensuring the code is maximally generic. It is simply not possible to mess with the elements of the list, given the type. Any attempt to do so will be a compiler error. This gives us some strong guarantees of abstraction, which we will take advantage of for optimization purposes.

Such polymorphic types and functions are flexible, but we pay a cost for this. In order for the code to work on any type ‘a’, we must ensure that all values have a uniform representation. In most languages, this is via an object or closure – some reference to heap allocated data – and Haskell is no exception. All polymorphic components will be stored as pointers to heap allocated thunks.

You can see this tradeoff most vividly in this picture of a list of pairs:

The uniform representation means that all values are represented via pointers to values, so the same code (e.g. map, filter) will work on any element type. But look .. we have cons nodes with pointers to pair nodes, which in turn point to integer nodes. (Also note the explicit sharing that is visible – the integer 1 and 2 are shared between multiple uses – made possible by purity).

Such a structure is generic, but has poor data density. If the compiler could see that the integer nodes hold word-sized values, it could unpack them into their constructors. GHC can already unpack monomorphic types, removing all indirection and ‘thunkiness’ associated with the data, via a manual unpack pragma at the data declaration site:

data PairIntInt = Pair {-# UNPACK #-} Int {-# UNPACK #-} Int

The Int values will now just be stored directly as word-sized fields in the Pair heap object. GHC can also infer the location of these pragmas if you use strict fields, and enable the -funbox-strict-fields optimization:

data PairIntInt = Pair !Int !Int

However, this technique only works on monomorphic fields. It does not make sense for polymorphic fields, since they have variable size, and the compiler does not (except in very specific circumstances) have access to information about what types are in use.

To unpack, say, all elements of a list of pairs, we need to instead generate a custom type:

data ListPairs  = Empty
                | Cons !Int !Int ListPairs

This will happily represent each node with all elements stored locally, but is no longer polymorphic – so any functions have to be written specially for this type. The result however, is a much denser data representation:

Note that unpacking changes the evaluation strategy of the data. Those elements are now strict. The spine of the list is still lazy, however, so it is still just as useful a type for control structures and generators. GHC is remarkably flexible about the control the programmer has over the evaluation strategy (strict, lazy, mixture), so we’ll take advantage of that in this library to produce a spine-lazy, element strict list type that is able to do this unpacking and specialization for us. I encourage you to explore mixed strict and lazy structures as well.

(An aside: these graphs were generated directly from the live Haskell heap via vacuum – another great new tool for Haskell development. I’ve not written about this yet, but you can see a video of it in use here. You can trace any Haskell heap structure to a graph representation, saving it as svg or png, or inspecting it live from the REPL).

So, how do we teach GHC to do this transformation automatically, and how do we do that without sacrificing code reuse — I only want to write the list library once!

Class-associated data types

Our problem can be stated as:

  • when we use a data structure at a particular element type, we require both custom functions, and a custom data type, to be generated

This gives us an insight into how to produce a solution. Type classes in Haskell are precisely the mechanism for having per-type functions. And a recent major extension to type classes has been the addition of class-associated data types (in their most general form, as type families).

These two features together will allow us to specialize the container type at each element type, and generate specialized functions for each use, removing “the uniform-representation tax” and improving performance across the board.

Implementation

Where previously, the data type for lists is defined as:

data [] a = [] | a : [a]

we will instead associate the data type with a class, and leave the implementation to later: We also have to lift  the introduction and elimination forms for the type into the class (as we have no explicit constructors or pattern matching anymore):

class AdaptList a where
     data List a
     -- | The empty list
     empty   :: List a

     -- | Prepend a value onto a list
     cons    :: a -> List a -> List a

     -- | Is the list empty?
     null    :: List a -> Bool

     -- | The first element of the list
     head    :: List a -> a

     -- | The tail of the list
     tail    :: List a -> List a

Now we just need to write some functions over this abstract list type. First, conversion, to and from “real” lists:

-- | /O(n)/, convert an adaptive list to a regular list
toList :: AdaptList a => List a -> [a]
toList xs
 | null xs   = []
 | otherwise = head xs : toList (tail xs)

-- | /O(n)/, convert an adaptive list to a regular list
fromList :: AdaptList a => [a] -> List a
fromList []     = empty
fromList (x:xs) = x `cons` fromList xs

Note how we have to replace explicit pattern matching with calls to the elimination forms (head, tail, uncons), and constructors are replaced with calls into the type class methods (cons, empty). We will see that GHC is able to inline away all the type class dictionaries, resulting in specialized code when used at any specific type.

Some general functions on lists: map, fold, filter:

-- | /O(n)/. 'map' @f xs@ is the list obtained by applying @f@ to each element
-- of @xs@
--
map :: (AdaptList a, AdaptList b) => (a -> b) -> List a -> List b
map f as = go as
 where
 go xs
   | null xs   = empty
   | otherwise = f (head xs) `cons` go (tail xs)
{-# INLINE map #-}
-- | /O(n)/. 'foldl', applied to a binary operator, a starting value (typically
-- the left-identity of the operator), and a list, reduces the list
-- using the binary operator, from left to right
--
foldl :: AdaptList b => (a -> b -> a) -> a -> List b -> a
foldl f z0 xs0 = go z0 xs0
 where
 go !z xs
   | null xs   = z
   | otherwise = go (f z (head xs)) (tail xs)
{-# INLINE foldl #-}

-- | /O(n)/. 'filter', applied to a predicate and a list, returns the list of
-- those elements that satisfy the predicate
--
filter :: AdaptList a => (a -> Bool) -> List a -> List a
filter p xs0
 | null xs0  = empty
 | otherwise = go xs0
 where
  go xs
   | null xs     = empty
   | p x         = x `cons` go ys
   | otherwise   =          go ys
   where x  = head xs
         ys = tail xs
{-# INLINE filter #-}

These implementations are essentially identical to the Haskell98 Prelude definitions, but with all constructors replaced with type class methods, and pattern matching replaced with “smart deconstructors”. Also, all recursive functions are written in worker/wrapper style, so they may be inlined (crucial to enable the function to be specialized).

So now we have an interface to lists and functions on them, but no concrete list type defined. We will do this via a class-associated type, with one instance for each element type we care about. This is a generic strategy, and so can be automated via some SYB-style generic code generation tricks. The result is a set of instances enumerating all the different specialized data representations statically:

instance AdaptList Int where

 data List Int = EmptyInt | ConsInt {-# UNPACK #-}!Int (List Int)

 empty = EmptyInt
 cons  = ConsInt

 null EmptyInt = True
 null _ = False

 head EmptyInt = errorEmptyList "head"
 head (ConsInt x _) = x

 tail EmptyInt = errorEmptyList "tail"
 tail (ConsInt _ x) = x

And so on for all the other types. Now, when we write a function on say, List Int, GHC will replace all our code with the specialized data type, where the Int elements are unpacked. All calls to the smart constructors and deconstructors will be replaced with the specialized versions as well.

We should thus get denser, faster list types and functions.

You can see the full implementation here, as well as the same technique applied to sums and products, and with the full list API filled out. I’m calling these “adaptive” data structures, as they adapt to a different shape based on how you use them.

Let’s benchmark and see if things get any faster. As usual, I will use criterion to do the measurement and graphing. The best open source benchmarking library around – makes benchmarking fun, the way QuickCheck makes testing fun.

Benchmarks: map

In these benchmarks, I’m measuring the standard GHC 6.10 Data.List functions, and the base list data type, against the version on Hackage. In each case, I generate 100k random integers (using the mersenne-random generator), build a list from those numbers, and measure the time to traverse the structure using each pair of functions. Criterion takes care of running the functions hundreds of times until meaningful data can be extracted. The execution time probability graphs are presented here. Note that they’re on different scales each time.

So, the numbers. Specialized map is 34% faster. Mean goes from 31.9ms to 21.2ms. Garbage collection is reduced by 40% (which seems typical across the board for these functions.

Regular Haskell list type and functions:

data.list.map-densities-800x200

Specializing list map::

data.adaptive.list.map-densities-800x200

Looking at the core, we see that when used at an Int type, we get a specialized loop with unboxed parameters generated (no additional un- or re-boxing is required to store values back into the data structure!). This is a transformation GHC can’t do automatically, due to the uniform representation requirement.

Here’s the core for ‘adaptive sum’, taken from ghc-core,

$wgo :: Data.Adaptive.List.List Int -> Int# -> Int#
$wgo xs n = case xs of
    EmptyInt     -> n
    ConsInt y ys -> $wgo ys (+# n y)

Compared to regular lists, we see the additional unboxing of the list parameter that is required:

    $wlgo :: Int# -> [Int] -> Int#
    $wlgo n xs = case xs of
        []     -> n
        (y:ys) -> case y of
                     I# y# -> $wlgo (+# n y#) ys

Benchmarks: filter

Filter is 15% faster. Mean goes from 42.5ms to 36.2ms.

data.list.filter-densities-800x200Specialized version:

data.adaptive.list.filter-densities-800x200Benchmarks: foldl

Reducing a list is 23% faster (sum, length, maximum, product, minimum). Mean goes from 11.2ms to 8.7ms.

data.list.foldl'-densities-800x200and when specialized:

data.adaptive.list.foldl'-densities-800x200

Notes

One of the obivous questions is: what about fusion? Stream fusion is an entirely different automatic optimization that optimizes multiple traversals over a data structure into a single, more complicated traversal, dramatically improving time and space usage. The data structures and functions shown in this post complement fusion well — we can still fuse loops, on speciazlied data and functions, removing unnecessary traversals. Stream fusion also enables opportunities for specialization, which will no longer be of much use, as our data types have specialized themselves.

We also lose sharing in this approach. Sorry, but your Int fields will be copied by value. The approach makes most sense for small atomic types anyway, which may be unpacked, and these are precisely the types where sharing is unlikely to be helpful.

Essentially, we now treat the list type and constructors as an abstract view on to some custom, specialized representation. We may be able to regain pattern matching on the abstract structures via GHC’s view patterns. I’ve not tried this yet, but it looks easy enough.

The list implementation is a useful proof of concept — you can use it in your list-heavy programs today if you want. The most promising direction, though, in my view, is for container types such as IntMap and Sequence, where already the constructors are abstract, and there’s a small set of primitive operations, and we really care about data density. There may be some big wins for coalescing of nodes in IntMap, for example.

Results

In summary, we have a general approach to allow, for the first time, uniform specialization of both parametrically polymorphic functions and data, resulting in good speedups with only library modifications required. The Haskell type system, via class-associated type classes, is sufficiently expressive to describe abstract and concrete representation mappings, and the GHC optimizer (particular the inliner, and the unpacking pragma) are enough to precisely control the layout of data, along with the generation of custom functions for each type. The approach should give speedups in any language with a uniform representation policy, though it is certainly easier when you don’t need to modify the compiler.The approach here is very specific, and may be subsumed by more general global information optimization techniques, such as super compilation (coming to GHC). We also rely on the magic of the GHC Inliner to remove all traces of type class dictionaries.

The next steps will be to combine this with list fusion, to have the fastest lazy lists I know of, and to try it out on IntMap and friends, to yield faster polymorphic container types.

The result is satisfying: your Haskell code goes faster, while remaining high level and pure.

Further reading

This is the first time I’ve written down this approach, but the idea has been kicking around for a few months. You can see my WG2.8 talk on the subject here, and a video of my talking about initial experiments in this direction at the Utrecht hackathon. I’d be interested if readers know of any related work for this joint-specialization of data and function code, to eliminate the indirection cost of parametric polymorphism.

Follow

Get every new post delivered to your Inbox.

Join 64 other followers