Fusion makes functional programming fun!

It’s deeply satisfying when the stream fusion optimization kicks in, in GHC Haskell. It is a powerful illustration of how the absence of side effects makes the compiler’s job so much easier, and lets it be far more aggressive.

Here’s a little example to show the real expressive power that we get when we have representation-changing optimizations like fusion to rely on. A simple example showing how good GHC is doing at this kind of stuff, these days: add a sequence of numbers. I’ve kept it super minimal so you can mentally compute each transformation to see GHC at work. (Of course there are lots of other ways to write trivial computations like this, but anything more substantial will make the assembly much harder to follow. So for teaching purposes, we’ll keep the source problem simple. Now, you can also use any of the 100+ other functions that fuse to construct all sorts of interesting programs with no closed form solution. Get at it!).

We begin with a pipeline of multiple recursive functions, brought together with function composition. As high level as it gets, and any junior Haskell program should be able to write stuff like this:

```﻿﻿﻿﻿import qualified Data.Vector as U

main = print . U.sum \$ U.enumFromTo 1 (100000000 :: Int)```
```
```

Now, semantically, there is an intermediate 100 million element array being created by enumFromTo. Try doing that in your favorite strict language. However…

… fusion kicks in, and GHC turns our program, via the source-to-source optimization phase, into a single tail recursive function, using unboxed values only .. with no intermediate array! No heap allocaiton, no stack, only one loop. A different algorithm!

```    loop :: Int# -> Int# -> Int#
loop x y = case y <= 100000000 of
False -> x
True  -> loop (x + y) (y + 1)```
```
```

The compiler reordered our computations pretty wildly, and it can do this because there are no side effects by default — making large scale transformations like this far more feasible. I used ghc 6.13 with the llvm patches, and ghc -Odph -fllvm –make A.hs

The programmer didn’t have to think about recursion at all in the original program, let alone manual unboxing. Thanks GHC!

And now this tail recursive function gets compiled into an imperative loop via the codeGen phase. First, with the native code backend to GHC:

```loop:
.Lc216:
cmpq \$10000000,%rsi
jle .Lc219
movq %r14,%rbx
movq (%rbp),%rax
jmp *(%rax)
.Lc219:
incq %rsi
jmp loop```
```
```

And again, with the LLVM backend to GHC:

```loop:
leaq    1(%rsi), %rax
cmpq    \$10000001, %rax
jge     .LBB1_5
test:                                # %tailrecurse
cmpq    \$10000001, %rsi
jl      loop```
```
```

And that’s it. Our kernel code is as low level and un-functional as you get, but our input program was the most classic of high level approaches: a pipeline of functions held together with composition, with intermediate data structures all over the place.

And it performs!

With -fasm: mean: 92.9 ms:

With -fllvm: mean: ﻿79.3ms (15% improvement with -fllvm):

Maybe our compiler is — sometimes — sufficiently smart.

To me, the fusion optimization is the epitome of what functional programming is about. I can write code at a very high level, throwing functions and composition around like candy, and the compiler will make stunning, complexity-changing transformations on my code, so it just runs damn fast. Can your language — functional or otherwise — do this?

Modern Benchmarking in Haskell

﻿﻿Thanks to work by Bryan O’Sullivan, there has been a rennaissance in performance benchmarking tools for Haskell, built upon Criterion.

Compared to most other benchmarking frameworks (for any programming language, not just Haskell), criterion focuses on being easy to use, informative, and robust.

The Galois Tech Talk of Feb 23rd presented this work. You can read the slides online, or find the source and examples here.

Criterion uses statistically robust mechanisms for sampling and computing sound microbenchmark results, and is more stable in the presence of noise on the system than naive timings.

Criterion has in turn spawned some extensions:

• Progression: compare different criterion graphs
• NoSlow: a new array benchmark suite based on Criterion

In this talk I will present these tools, how to use them, and how to make your performance benchmarks in Haskell, or languages Haskell can talk to, more reliable. In addition, we’ll explore benchmarks using the new vector package, and GHC’s llvm backend.

Smoking fast Haskell code using GHC’s new LLVM codegen

In this post we’ll play with GHC’s new LLVM code generator backend, and see how much faster some Haskell programs are when compiled with LLVM instead of GCC.

For the kind of loops we get from stream fusion, the -fllvm backend produced a lot better code, up to 3x faster in some cases. There are pretty graphs, and some smoking hot new technology.

Overview

This week David Terei announced that his work on an LLVM code generator backend for the Glasgow Haskell Compiler was ready to try out. Initial reports from his undergraduate thesis held that the LLVM code generator was competitive with the current GHC native code generator, a bit slower than the C backend in general (which uses GCC for code generation), but, tantalisingly, should produce big speedups for particular Haskell programs. In particular, tight loops of the kind generated by the bytestring, vector, data parallel arrays or text libraries. David reported speedups of 25% over the previous best performance we’d got from GHC for data parallel code.

I was very keen to try it out on the vector library — a fast, fusible numerical arrays package (similar to NumPy), which generates some very tight loops. Under the C backend, GCC has been failing to spot that the code GHC generates were actually loops, and this lead to GCC optimizing the generated code pretty badly. The native code generator does ok, but doesn’t have a lot of the clever low-level optimizations we need for really good bare metal performance.

So how would the new LLVM backend do?

Setting up

To try out the LLVM backend I followed the instructions on the wiki.

• Check out GHC HEAD from darcs.
• Apply the LLVM patch.
• Check out LLVM from svn
• Apply the GHC patch
• Build your GHC.

This worked out of the box, and I now have a GHC 6.13 with the -fllvm flag.

```\$ ghc --info
[("Project name","The Glorious Glasgow Haskell Compilation System")
,("Project version","6.13.20100221")
,("Booter version","6.12.1")
,("Stage","2")
,("Have interpreter","YES")
,("Object splitting","YES")
,("Have native code generator","YES")
,("Have llvm code generator","YES")
,("Support SMP","YES")
,("Unregisterised","NO")
,("Tables next to code","NO")
,("Win32 DLLs","")
,("RTS ways","l debug  thr thr_debug thr_l  ")
,("Debug on","False")
,("LibDir","/home/dons/lib/ghc-6.13.20100221")
]```

Running on a dual core Core 2 laptop:

```\$ uname -msr
Linux 2.6.32-ARCH x86_64```

You can then install packages as normal, via cabal, and add the -fllvm flag to see GHC build things via the new backend:

`\$ cabal install primitive --ghc-options=-fllvm`

The packages I’m interested in are:

And some helper code in:

I also modifed the ghc-core tool to support showing the LLVM generated assembly.

Warm up lap

Let’s check the backend is working (remember to add the -fllvm flag):

```\$ ghc -O2 --make A.hs -fllvm -fforce-recomp
[1 of 1] Compiling Main             ( A.hs, A.o )
\$ time ./A
"hey"
./A  0.00s user 0.00s system 61% cpu 0.005 total```

Good! The LLVM backend is generating working code for x86_64/Linux. Now, something more ambitious … a program from the shootout.

A shootout program

So let’s find some code that’s already been optimized. I’l compile the pidgits shootout benchmarks (where Haskell’s already the fastest entry).

First, with the native code gen:

\$ ghc -O2 -fasm A.hs –make -fforce-recomp

\$ time ./A 10000 > /dev/null
./A 10000 > /dev/null 3.19s user 0.03s system 91% cpu 3.509 total

With the old GCC backend:

\$ ghc -O2 -fvia-C -optc-O3 A.hs –make -fforce-recomp

\$ time ./A 10000 > /dev/null
./A 10000 > /dev/null 2.89s user 0.03s system 97% cpu 2.988 total

And with the -fllvm backend:

\$ ghc -O2 -fllvm A.hs –make -fforce-recomp

\$ time ./A 10000 > /dev/null
./A 10000 > /dev/null 2.86s user 0.02s system 98% cpu 2.936 total

Woo. It runs, and we get a speedup! Now for some serious business.

The Vector Package

Vector is a Haskell library for working with arrays. It provides several array types (boxed, unboxed, C), with a rich interface similar to the lists library, and some functions reminiscent of Data Parallel Haskell. There’s a tutorial on how to use it.

The interface is built entirely around stream fusion combinators — a general form of classic loop fusion made possible by purity. When you do multiple passes over the data (e.g. sum/map/fold/filter/…) the compiler will common up the loops, and discard intermediate arrays, making the code potentially very fast.

The loops that are generated tend to be very register heavy, do no heap allocation, and benefit from clever imperative loop optimizations. Unfortunately, the GCC backend to GHC doesn’t spot that these are actually loops, so doesn’t get to fire many optimizations.

The promise of the LLVM backend is that it will recognize the loops GHC generates from fused code. Let’s see how it performs.

To benchmark these programs, I’ll use the criterion and progression benchmarking libraries. (I had to build the darcs version of gtk2hs, and compiler data accessor-template with the -ftemplate_2_4 flag)

Simple loops

To start off, let’s generate 1 billion ints, sum them, print the result. That should tell us if our loops are efficient:

```import qualified Data.Vector as U
main = print . U.sum \$ U.enumFromTo 1 (1000000000 :: Int)```

There are two loops in this program. enumFromTo and sum.

The core

GHC compiles these two loops into a single loop, when compiled with -O2 or -Odph:

```loop  :: Int# -> Int# -> Int#
loop x y =
case <=# y 1000000000 of```
`         False -> x`
`         True  ->  loop (x +# y) (y +# 1)`

This is perfect. We write “sum (enumFromTo 1 n)” and we get a non-allocating loop.

The native backend

GHC 6.13 with the native code generator generates the following assembly for the inner loop:

```Main_mainzuzdszdwfoldlMzqzuloop_entry:
.Lc21u:
cmpq \$1000000000,%rsi
jle .Lc21x
movq %r14,%rbx
movq (%rbp),%rax
jmp *(%rax)
.Lc21x:
incq %rsi
jmp Main_mainzuzdszdwfoldlMzqzuloop_entry```

which runs in:

```\$ time ./enum
500000000500000000
./enum  1.00s user 0.00s system 99% cpu 1.008 total```

The  C backend

GHC 6.12.1 with the C backend, (-fvia-C -optc-O3) (I’m having trouble linking programs with the C backend and GHC 6.13), yields a pretty small loop:

```Main_mainzuzdszdwfoldlMzqzuloop_info:
cmpq    \$1000000000, %r14
movq    %r14, %rax
jle     .L2
movq    %rsi, %rbx
jmp     *(%rbp)
.L2:
leaq    1(%r14), %r14
jmp     Main_mainzuzdszdwfoldlMzqzuloop_info```

Which runs slower than the native code generator:

```\$ time ./enum
500000000500000000
./enum  1.09s user 0.00s system 99% cpu 1.100 total```

The LLVM backend

With -O2 -fllvm we get very different code, and it is a bit harder to work out what is going on. LLVM transforms the code far more aggressively.

```.LBB1_2:
leaq    1(%rsi), %rax
cmpq    \$1000000001, %rax
jge     .LBB1_5                 # loop exit
.LBB1_1:                        # %tailrecurse
cmpq    \$1000000001, %rsi
jl      .LBB1_2```

And the proof is in the pudding:

```\$ time ./enum
500000000500000000
./enum  0.48s user 0.01s system 99% cpu 0.488 total```

This is the fastest Haskell we’ve ever generated for this little benchmark (at least without manual loop unrolling)!

The LLVM backend more than halved the running time for this simple loop. But remember: general benchmarks aren’t seeing these kind of speedups — LLVM is really excelling itself at the tight numeric code.

Here’s the data presented in a slightly different form, with criterion and progression. The numbers are slightly different, since we won’t inline the length of the vector argument, and we’re wrapping the code in benchmarking wrappers. I wasn’t able to get -fvia-C programs to link under the HEAD, so we’ll exclude those from graphs, but report them in text form.

With the -fasm backend:

With the LLVM backend:

Or side-by-side with the progression package:

The -fasm backend under the progression tool ran around ~1s for each billion ints, while -fllvm was around 0.8s. Note that we get slightly different timings with the loops under each benchmarking tool, due to how the benchmark program and wrapper are optimized.

Zips

Zips are another good candidate, since they turn into nested loops. So, e.g.

```import qualified Data.Vector as U
import Data.Bits
main = print . U.sum . U.map (`shiftL` 1) \$ U.zipWith (*)
(U.enumFromTo 1 (100000000 :: Int))
(U.replicate (100000000 :: Int) 42)```

Which fuses to this set of loops:

```loop  :: Int# -> Int# -> Int# -> Int#
loop =
\ (sc_s29b :: Int#)
(sc1_s29c :: Int#)
(sc2_s29d :: Int#) ->
case <=# sc1_s29c 100000000 of _ {       False -> sc_s29b;
True ->
case <=# sc2_s29d 0 of _ {           False ->
loop
(+#
sc_s29b (uncheckedIShiftL# (*# sc1_s29c 42) 1))
(+# sc1_s29c 1)
(-# sc2_s29d 1);
True -> sc_s29b
}
}```

Which, again, is perfect Core. All those functions combined into a single non-allocating loop.

-fasm:

```Main_mainzuzdszdwfoldlMzqzuloop_entry:
.Lc2aC:
cmpq \$100000000,%rsi
jle .Lc2aE
movq %r14,%rbx
movq (%rbp),%rax
jmp *(%rax)
.Lc2aE:
testq %rdi,%rdi
jle .Lc2aH
movq %rsi,%rax
imulq \$42,%rax
shlq \$1,%rax
incq %rsi
decq %rdi
jmp Main_mainzuzdszdwfoldlMzqzuloop_entry
.Lc2aH:
movq %r14,%rbx
movq (%rbp),%rax
jmp *(%rax)```

Which is reasonable:

\$ time ./zipwith
420000004200000000
./zipwith 0.24s user 0.00s system 99% cpu 0.246 total

With the -fvia-C -optc-O3 backend, just the inner loop, since that’s easy to read:

```Main_mainzuzdszdwfoldlMzqzuloop_info:
cmpq    \$100000000, %rsi
jg      .L6
.L2:
testq   %r14, %r14
jle     .L6
leaq    (%rsi,%rsi,4), %rcx
leaq    -1(%r14), %r14
leaq    (%rsi,%rcx,4), %rcx
leaq    1(%rsi), %rsi
leaq    (%rdi,%rcx,4), %rdi
jmp     Main_mainzuzdszdwfoldlMzqzuloop_info```

Which runs in about the same time as the -fasm backend:

```\$ time ./zipwith
420000004200000000
./zipwith  0.25s user 0.00s system 99% cpu 0.251 total```

With -fllvm the code is wildly different, and I find it pretty hard to reconstruct what transformatoins LLVM has done.

```Main_mainzuzdszdwfoldlMzqzuloop_entry:
# BB#0:                                 # %c2cf
subq    \$8, %rsp
imulq   \$84, %rsi, %rax
jmp     .LBB1_1
.LBB1_3:                                # %n2cN
#   in Loop: Header=BB1_1 Depth=1
incq    %rsi
decq    %rdi
.LBB1_1:                                # %tailrecurse
# =>This Inner Loop Header: Depth=1
cmpq    \$100000001, %rsi        # imm = 0x5F5E101
jge     .LBB1_4
#   in Loop: Header=BB1_1 Depth=1
testq   %rdi, %rdi
jg      .LBB1_3
.LBB1_4:                                # %n2ck
movq    (%rbp), %rax
movq    %r14, %rbx
movq    (%rax), %r11
jmpq    *%r11  # TAILCALL```

The “inner loop” is interesting. Nothing like what -fasm or -fvia-C generate. And it’s way faster:

\$ time ./zipwith
420000004200000000
./zipwith 0.15s user 0.00s system 99% cpu 0.154 total

So yeah, 40% faster!

Criterion

Here, under criterion (same code, but different values of n), With the -fasm backend, mean execution time 186ms:

With the -fllvm backend, 135 ms  (27% improvement):

zipwith3

Heavily nested zips are probably the best cases for LLVM, and we see the -fllvm backend do some pretty wild stuff with this:

import qualified Data.Vector.Unboxed as U import Data.Bits main = print . U.sum \$ U.zipWith3 (\x y z -> x * y * z) (U.enumFromTo 1 (100000000 :: Int)) (U.enumFromTo 2 (100000001 :: Int)) (U.enumFromTo 7 (100000008 :: Int))

Which fuses to:

```main_\$s\$wfoldlM'_loop [Occ=LoopBreaker]
:: Int#     -> Int# -> Int# -> Int# -> Int#

main_\$s\$wfoldlM'_loop =
\ (sc_s2jh :: Int#)
(sc1_s2ji :: Int#)
(sc2_s2jj :: Int#)
(sc3_s2jk :: Int#) ->
case  sc_s2jh;
True ->
case  sc_s2jh;
True ->
case  sc_s2jh;
True ->
main_\$s\$wfoldlM'_loop
(+#
sc_s2jh (*# (*# sc1_s2ji sc2_s2jj) sc3_s2jk))
(+# sc1_s2ji 1)
(+# sc2_s2jj 1)
(+# sc3_s2jk 1)
}
}
}```

Great core. With the -fasm backend:

```Main_mainzuzdszdwfoldlMzqzuloop_entry:
.Lc2lq:
cmpq \$100000000,%rsi
jle .Lc2ls
movq %r14,%rbx
movq (%rbp),%rax
jmp *(%rax)
.Lc2ls:
cmpq \$100000001,%rdi
jle .Lc2lu
movq %r14,%rbx
movq (%rbp),%rax
jmp *(%rax)
.Lc2lu:
cmpq \$100000008,%r8
jle .Lc2lx
movq %r14,%rbx
movq (%rbp),%rax
jmp *(%rax)
.Lc2lx:
movq %rdi,%rax
imulq %r8,%rax
movq %rsi,%rcx
imulq %rax,%rcx
incq %rsi
incq %rdi
incq %r8
jmp Main_mainzuzdszdwfoldlMzqzuloop_entry```

Straight forward, and running it:

```\$ time ./zipwith3
3541230156834269568
./zipwith3  0.47s user 0.01s system 98% cpu 0.484 total```

With -fvia-C -optc-O3:

```Main_mainzuzdszdwfoldlMzqzuloop_info:
.text
.p2align 4,,15
.text
.align 8
.type Main_mainzuzdszdwfoldlMzqzuloop_info, @function
# 38 "/tmp/ghc10013_0/ghc10013_0.hc" 1
# 0 "" 2
cmpq    \$100000000, %rdi
jg      .L9
.L4:
cmpq    \$100000001, %rsi
jg      .L9
.L5:
cmpq    \$100000008, %r14
.p2align 4,,5
jg      .L9
.L7:
movq    %rsi, %r10
leaq    1(%rsi), %rsi
imulq   %rdi, %r10
leaq    1(%rdi), %rdi
imulq   %r14, %r10
leaq    1(%r14), %r14
leaq    (%r10,%r8), %r8
jmp     Main_mainzuzdszdwfoldlMzqzuloop_info```

And we get a faster result:

```\$ time ./zipwith3
3541230156834269568
./zipwith3  0.34s user 0.00s system 99% cpu 0.344 total```

-fllvm, looks like some heavy loop unrolling:

```Main_mainzuzdszdwfoldlMzqzuloop_entry:  # @Main_mainzuzdszdwfoldlMzqzuloop_entry
# BB#0:                                 # %c2oz
subq    \$56, %rsp
cmpq    \$100000002, %rdi        # imm = 0x5F5E102
movl    \$100000002, %eax        # imm = 0x5F5E102
movq    \$-2, %rdx
movq    %r9, 40(%rsp)           # 8-byte Spill
movq    %r15, 48(%rsp)          # 8-byte Spill
movq    \$-3, %r9
movq    %r12, 32(%rsp)          # 8-byte Spill
movq    %r8, %rbx
movq    %r13, 24(%rsp)          # 8-byte Spill
movq    %r14, 16(%rsp)          # 8-byte Spill
leaq    1(%rdi), %r13
cmovgq  %rdi, %rax
negq    %rax
leaq    -1(%rdi,%rax), %rcx
cmpq    \$100000009, %r8         # imm = 0x5F5E109
movl    \$100000009, %eax        # imm = 0x5F5E109
cmovgq  %r8, %rax
negq    %rax
leaq    -1(%r8,%rax), %rax
cmpq    %rcx, %rax
cmovaq  %rax, %rcx
cmpq    \$100000001, %rsi        # imm = 0x5F5E101
movl    \$100000001, %eax        # imm = 0x5F5E101
cmovgq  %rsi, %rax
negq    %rax
leaq    -1(%rsi,%rax), %rax
cmpq    %rax, %rcx
cmovbeq %rax, %rcx
imulq   %rdi, %rbx
imulq   %rsi, %r13
movq    %rcx, %r10
subq    %rcx, %rdx
subq    %rcx, %r9
imulq   %rsi, %rbx
notq    %r10
movq    %r10, %rax
imulq   %r10, %rbx
mulq    %rdx
addq    16(%rsp), %rbx          # 8-byte Folded Reload
movq    %rax, %r11
movq    %rdx, %r15
movq    %r15, %r12
movq    %r11, %rax
andq    \$1, %r15
imulq   %r9, %r12
mulq    %r9
shldq   \$63, %r11, %r15
leaq    (%r8,%rdi), %r9
movq    \$-4, %rdx
subq    %rcx, %rdx
movq    %r12, %r14
andq    \$1, %r12
leaq    6(%r9,%r9), %r10
movabsq \$6148914691236517205, %r9 # imm = 0x5555555555555555
movq    %rdx, 8(%rsp)           # 8-byte Spill
imulq   %rdx, %r14
leaq    1(%rdi,%rsi), %rdx
shldq   \$63, %rax, %r12
imulq   %r8, %rdx
imulq   %r12, %r10
leaq    1(%rdx,%r13), %rdx
imulq   %r10, %r9
imulq   %r15, %rdx
mulq    8(%rsp)                 # 8-byte Folded Reload
subq    %r9, %rbx
movq    %r8, %r9
decq    %r8
subq    %rcx, %r9
movq    %rdi, %rdx
decq    %r9
shldq   \$62, %rax, %r14
movq    %rsi, %rax
subq    %rcx, %rdx
andq    \$-2, %r14
subq    %rcx, %rax
decq    %rdx
decq    %rax
.align 16
.LBB2_1:                                # %tailrecurse
# =>This Inner Loop Header: Depth=1
cmpq    \$100000001, %rsi        # imm = 0x5F5E101
jge     .LBB2_4
# BB#2:                                 # %c2oD
#   in Loop: Header=BB2_1 Depth=1
cmpq    \$100000002, %rdi        # imm = 0x5F5E102
jge     .LBB2_4
# BB#3:                                 # %c2p5
#   in Loop: Header=BB2_1 Depth=1
incq    %rsi
incq    %rdi
incq    %r8
cmpq    \$100000009, %r8         # imm = 0x5F5E109
jl      .LBB2_1
.LBB2_4:                                # %n2oE
movq    (%rbp), %rcx
movq    %r9, %r8
movq    24(%rsp), %r13          # 8-byte Reload
movq    32(%rsp), %r12          # 8-byte Reload
movq    %r14, %rbx
movq    %rax, %rsi
movq    %rdx, %rdi
movq    40(%rsp), %r9           # 8-byte Reload
movq    48(%rsp), %r15          # 8-byte Reload
movq    (%rcx), %r11
jmpq    *%r11  # TAILCALL```

And blows them all out of the water! 3x faster than -fasm! Twice as fast as -fvia-C -optc-O3.

```\$ time ./zipwith3
3541230156834269568
./zipwith3  0.16s user 0.00s system 99% cpu 0.158 total```

From the Statistics package

The statistics package has some more “realistic” microbenchmarks. Let’s look at those. First, computing the mean of a large array of doubles (here all set to ‘pi’).

`main = print (mean (V.replicate 1000000000 (pi :: Double)))`

With the -fasm backend:

```Main_mainzuzdszdwfoldlMzuloop_entry:
.Lc2b2:
testq %rsi,%rsi
jle .Lc2b5
cvtsi2sdq %r14,%xmm0
movsd .Ln2b8(%rip),%xmm7
subsd %xmm5,%xmm7
divsd %xmm0,%xmm7
incq %r14
decq %rsi
jmp Main_mainzuzdszdwfoldlMzuloop_entry```

Simple, easy.

```\$ time ./mean
3.141592653589793
./mean  5.58s user 0.01s system 99% cpu 5.599 total```

With the -fllvm backend:

```Main_mainzuzdszdwfoldlMzuloop_entry:    # @Main_mainzuzdszdwfoldlMzuloop_entry
# BB#0:                                 # %c28E
subq    \$8, %rsp
movsd   .LCPI3_0(%rip), %xmm0
jmp     .LBB3_1
.align 16
.LBB3_3:                                # %n28K.i
#   in Loop: Header=BB3_1 Depth=1
movapd  %xmm0, %xmm5
cvtsi2sdq       %rcx, %xmm8
subsd   %xmm7, %xmm5
divsd   %xmm8, %xmm5
.LBB3_1:                                # %tailrecurse
# =>This Inner Loop Header: Depth=1
testq   %rsi, %rsi
jle     .LBB3_5
# BB#2:                                 # %n28K
#   in Loop: Header=BB3_1 Depth=1
movapd  %xmm0, %xmm7
cvtsi2sdq       %r14, %xmm8
leaq    -1(%rsi), %rax
leaq    1(%r14), %rcx
subsd   %xmm5, %xmm7
testq   %rax, %rax
divsd   %xmm8, %xmm7
jg      .LBB3_3
# BB#4:                                 # %c28J.i
movq    (%rbp), %rdx
movq    %rcx, %rbx
movq    %rcx, %r14
movq    %rax, %rsi
movapd  %xmm7, %xmm5
movq    (%rdx), %r11
jmpq    *%r11  # TAILCALL
.LBB3_5:                                # %c28J
movq    (%rbp), %rax
movq    %r14, %rbx
movq    (%rax), %r11
jmpq    *%r11  # TAILCALL```

And running it:

```\$ time ./mean
3.141592653589793
./mean  5.55s user 0.01s system 99% cpu 5.585 total```

Some pretty wacky code, but a little faster.

Conclusions

The LLVM backend seems to be holding up to what we hoped: it does a better (some times much better) job on tight loops. We get better code than GHC has ever produced before. It seems pretty robust, so far everything I’ve tried has worked.

David’s benchmarks indicate that with the current — first attempt — at an LLVM backend most large programs aren’t noticeably faster, but I think the promise we see in these small examples justifies spending more time working on the LLVM backend to GHC. It has much more potential than the GCC backend.

Currently we’re not experimenting with the LLVM optimization layer at all — I think there’s likely to be a lot of win just tweaking those settings (and exposing them to the Haskell programmer via GHC flags).

Migrating from uvector to vector

Roman Leschinkskiy has released the 0.5 version of vector, the future standard non-parallel arrays library forGHC Haskell. This post covers some of the differences between it and uvector, and what to watch for when migrating code to use vector.

The summary is — as of Feb 15, 2010 — you can move to vector now.  In almost all cases you will get identical performance to uvector, but with a nicer interface. There are currently a few small gaps in the API, and a couple of performance tweaks are needed to particular functions, but they should not affect most people (and likely will be fixed in coming days). Note that you should use the -Odph optimization flag for the most reliable results.

Background

vector is one of the results of the multi-year Data Parallel Haskell project, to develop high performance parallel bulk array processing for Haskell, allowing us to do very fast arrays (that is, transparently multicore parallel array processing, outperforming C or C++ by using cores more efficiently.)

While this project concentrates on data parallelism, it has also lead to new approaches to flat, sequential arrays for Haskell. The code has been spun-off in two different packages, which replace the fifteen year old array package with faster, more flexible types and functions:

These two libraries share a common origin, but have different engineering goals. Both libraries make heavy use of loop fusion based on streams to achieve excellent performance. (You can read more about that in a separate post).

The vector library is far more ambitious, aiming to be the standard array library for high performance problems in all circumstances. It cleanly supports:

• mutable arrays
• immutable arrays
• boxed
• unboxed
• storable types

As for uvector, unboxed representations are specialized at compile time via type families, and fusion is used throughout the interface. Unlike uvector, vector supports boxed arrays, and provides inplace fusion of mutable array operations.

If you need transparently parallel arrays, you should consider the dph package, distributed with GHC.

Current status

uvector is stable, and has gone into maintainance mode only. If you like it, you can safely continue to use it for  the foreseeable future, though any performance improvements in the fusion or array types developed in vector will not be backported to uvector.

vector, as of 0.5, has been declared “beta”. You can begin migrating code to use it.

I’ve just finished porting the uvector micro benchmark suite to vector, and have the following notes on how to migrate your code to use unboxed arrays in the vector package.

Namespaces

The old uvector namespace was Data.Array.Vector with U suffix appended to names. That goes away, and instead you should:

`import qualified Data.Vector as U`

Most function names are identical, so we have in vector the obvious counterparts to uvector. All these are basically unchanged:

`U.length U.null U.empty U.cons U.snoc U.replicate U.head`
`U.last U.init U.tail U.take U.drop U.map U.filter U.elem U.notElem`
`U.product U.sum U.maximum U.minimum U.foldl1 U.foldl U.dropWhile U.break`
`U.find U.all U.any U.and U.or U.maximumBy U.minimumBy`
`U.enumFromTo`

Some function names have changed:

`U.++ replaces appendU`
`U.! replaces indexU`

Some functions are missing:

`lookupU repeatU`

There are many new functions on vectors, in particular, on mutable arrays, and for bulk operations (backpermute, reverse, accum).

Better types

Notably, vector supports boxed types, so you can more easily store Haskell values in fusible arrrays (so you can have, e.g. Integer arrays).

Performance Wibbles

I found only a few differences in performance compared to uvector, and have notified Roman. These shouldn’t affect many users currently, and will likely disappear in coming days.

First, compile with -Odph instead of -O2, this fixes some optimization issues with zips, and probably other things.

Functions to watch out for:

• zip, zipwith, zipwith3 — uvector was a lot faster (10x) in simple programs — however, moving to -Odph fixes zips entirely.
• Different fusion results for ’empty’ (kind of a corner case)
• eq seems to be about twice as slow. Unsure why.
• Bools don’t seem to be bit packed? At least, Bool unboxed arrays seem a bit slower than in uvector.
• U.last appears to be optimized differently, though doesn’t affect performance.
• Pipelines ending in ‘null’ (another corner case) are fused differently (slightly worse performance).

And that’s about it.

As of this post, I’m officially declaring uvector to be in maintainance-only mode, and will be working to improve vector.

Playing with the new Haskell epoll event library

Bryan and Johan have been working hard on replacing the GHC runtime’s concurrency mechanism based on select, with a new one based on epoll. This should improve scalability of Haskell server code in a number of ways — more connections sec, more concurrent connections, and so on.

You can read about their work in a series of posts:

And you can find the current code for haskell-event on github: http://﻿github.com/tibbe/event

After reading this post, on how epoll-based user land callbacks can outperform general purpose preemptive threads for non-compute bound IO work, I wanted to see what improvements we can hope for from an epoll-based version in Haskell.

The full code, which shows how to set up simple socket event notification using the event lib, is on the Haskell wiki, and will be useful for getting a flavor of the event lib use.

The results:

So while a simple forkIO/accept version, doing Handle and String IO would peak out around 7k req/sec, the epoll based version reached over 15k req/sec.

So that’s useful: good epoll-based event code for Haskell. The next step will be to see the redesign of Control.Concurrent based on epoll being merged into GHC.

Multicore Haskell Now! ACM Reflections | Projections 2009

Here are the

on approaches to multicore programming in Haskell, presented at the ACM Reflections | Projections conference at the University of Illinois.

Abstract

Multicore computers are here: is your programming language ready for it? Haskell is: you can take an off-the-shelf copy of GHC and write high performance parallel programs right now.

If you want to program a parallel machine, a purely functional language such as Haskell is a good choice: purity ensures the language is by-default safe for parallel execution, (whilst traditional imperative languages are by-default unsafe). This foundation has enabled Haskell to become something of a melting pot for high level approaches to concurrent and parallel programming, all available with an industrial strength compiler and language toolchain, available now for mainstream multicore programming.

This talk will introduce the features Haskell provides for writing high level parallel and concurrent programs. In particular we’ll focus on lightweight semi-explicit parallelism, software transactional memory, and nested data parallelism, so you can go to work writing multicore programs in Haskell.

I’d like to thank Michael Ilseman, Sameer Sundresh and Jeff Green for their hospitality during my visit.

LACSS 2009: Domain Specific Languages and Haskell

Here are the

on the use of Haskell for constructing EDSLs for high performance computing, along with a 10 minute overview of Haskell support for multicore programming, presented at the LACSS 2009 Workshop on Non-Traditional Programming Models.

Abstract

As the complexity of large-scale computing architecture increases, the effort needed to program these machines efficiently has grown dramatically. The challenge is how to bridge this “programmability gap”, making the hardware more accessible to domain experts. We argue for an approach based on
executable embedded domain specific languages (EDSLs)—small languages with focused expressive power hosted directly in existing high-level programming languages such as Haskell. We provide examples of EDSLs in use in industry today, and describe the advantages EDSLs have over general purpose languages in productivity, performance, correctness and cost.