Friday, May 31, 2013

Type-Safe Runtime Code Generation with (Typed) Template Haskell

Over the past several weeks I have implemented most of Simon Peyton Jones' proposal for a major revision to Template Haskell. This brings several new features to Template Haskell, including:

  1. Typed Template Haskell brackets and splices.
  2. Pattern splices and local declaration splices.
  3. The ability to add (and use) new top-level declarations from within top-level splices.

I will concentrate on the first new feature because it allows us to generate and compile Haskell values at run-time without sacrificing type safety. The code in this post is available on github; the github repository README contains instructions for building the th-new branch of GHC, which is where work on typed Template Haskell is being done. The GHC wiki contains more details about the current implementation status. The plan is that this work will land in HEAD before the 7.8 release of GHC.

The staged power function: untyped

The driving example I will use is the staged power function. The idea is that we want to compute \(x^n\) where \(n\) is known statically (that is, it is a constant we know ahead of time), but \(x\) may vary. John Lato wrote a blog post, Runtime Meta-programming in Haskell, that shows how to implement the staged power function while providing a veneer of type-safety on top of unsafe runtime code generation using "classic" Template Haskell. Examining his implementation is instructive, so I'll start there.

John's "safe" Template Haskell expression type is a newtype wrapper around the Template Haskell expression type, ExpQ, that adds a phantom type parameter.

newtype Meta a = Meta { unMeta :: ExpQ }

This definition means that for all α, we can construct a Meta α from any ExpQ by applying the constructor Meta. For example

e :: Meta Int
e = Meta [|'c'|]

Not very safe! However, using the Meta type, we can write combinators that are safe in the sense that if their inputs are "well-typed" Meta values, then their outputs will be "well-typed" Meta values. John defines one such safe combinator as follows

compose :: Meta (a -> b) -> Meta (b -> c) -> Meta (a -> c)
compose (Meta l) (Meta r) = Meta [| $r . $l |]

Although the compose function uses unsafe operations internally, if its two arguments really are Meta-wrapped Template Haskell expressions with the types a -> b and b -> c, respectively, then the result really will be a Meta-wrapped Template Haskell expression of type a -> c. This isn't enforced by the type system, but it's true nonetheless.

John defines the staged power function as follows

mkPow :: Int -> Meta (Int -> Int)
mkPow 0 = Meta [| const 1 |]
mkPow n = Meta [| \x -> x * $(unMeta $ mkPow (n-1)) x |]

Again, the type system doesn't enforce safety in the sense that it doesn't guarantee that the bit of Template Haskell that mkPow produces really has type Int -> Int, but we can see by inspection that it will always return a Template Haskell expression that has type Int -> Int. With classic Template Haskell, this is just about all that we can hope for.

The staged power function: typed

With the new version of Template Haskell, we can do better. Here is the definition of compose using the new Template Haskell typed brackets and typed splices (I have also changed the definition of compose so that the arguments are in the proper order :)).

compose :: TExpQ (b -> c) -> TExpQ (a -> b) -> TExpQ (a -> c)
compose f g = [|| $$f . $$g ||]

We can see a new data type, TExpQ a, and new syntax for splices and brackets. TExpQ a is a type alias for Q (TExp a), just as ExpQ is an alias for Q Exp. A TExp a is a typed Template Haskell expression—the type checker guarantees that it is the abstract syntax representation of a Haskell term of type a. This is in contrast to Exp, the existing untyped expression representation used by Template Haskell. A value of type Exp is the abstract syntax of a Haskell expression of some type, but a value of type TExp a is the abstract syntax of a Haskell expression of type a.

Values of type TExp a can only be constructed using typed brackets and typed splices. The syntax [||e||] is a typed bracket, and the syntax $$e is a typed splice. No other varieties of splice may occur in a typed bracket. In other words, [||e||] is the only introduction form for TExpQ, and $$e is the elimination form. It not the only elimination form; Template Haskell also provides a constant unType :: TExp a -> Exp that allows coercion from the type Template Haskell world to the untyped world; it simply throws away the type information

We can now write a completely safe version of the mkPow function.

mkPow :: Num a => Int -> TExpQ (a -> a)
mkPow 0 = [|| const 1 ||]
mkPow n = [|| \x -> x * $$(mkPow (n-1)) x ||]

This version is type-checked by the compiler; there is no unsafe newtype wrapping/unwrapping going on.

Under the covers

If you look at the definition of TExp in the template-haskell library, you'll see that it actually is a newtype wrapper.

newtype TExp a = TExp { unType :: Exp }

Why is this safe? Because the compiler is checking that a typed bracket is well-typed and then creating a TExp whose type index is that type. Of course you can always subvert type safety by importing Language.Haskell.TH.Syntax and mucking about with TExp's. But as long as you stick to typed brackets and splices, you will get type safety.

Run time code generation and compilation

Now that we can generate type safe Haskell expressions at run time, can we also safely compile them at run time? The answer is a qualified yes.

First, how can we compile a TExp a? John uses the plugins package, originally written by Don Stewart. I will use the GHC API directly. My implementation strategy is the same as his: print a Template Haskell expression to a file and compile and load the file. For this to work, we need two invariants to hold:

  1. The term that we pretty-print must be a faithful representation of the term that was type checked.
  2. The API that we use to compile the term we print out must assign it the same type that GHC did when it checked the term.

You might think that pretty-printing is safe, but in fact I had to fix a few bugs that caused the pretty-printed versions of some Template Haskell terms to not even parse as valid Haskell! I assume that the underlying expression value is a faithful representation of the term that GHC type checked.

The second property does not obviously hold. First of all, we need to make sure (approximately) that the type environment we use to compile the term is the same as the type environment used to type check it. Template Haskell resolves some names so that they are fully qualified. Perhaps we could walk over the term to figure out exactly which modules we need to import to compile the term, but I'm not sure that is sufficient. At the moment I've punted, so run time compilation may fail at run time. See the github repository for the details. Furthermore, even if we got all the imports right, defaulting and such might conceivably get in the way, so we slap a type signature on out term when we print it out to make sure the type of out compiled term really matches the type of the typed bracket.

Given the function compile :: forall a . Typeable a => TExp a -> IO a, which is defined in the github repository accompanying this post, we can now compile and run our staged power function. Note that the Typeable constraint is what allows us to print out the type of the term when we feed it to the GHC API.

main :: IO ()
main = do
  powTExp :: (TExp (Double -> Double)) <- runQ (mkPow 23)
  putStrLn . pprint . unType $ powTExp
  pow <- compile powTExp
  print $ pow 2.0
  print $ pow 10.0

Related work

Typed brackets bring much of the power of MetaML1 to Haskell. They are also similar to F#'s code quotations2; however, Haskell's untyped brackets are now completely untyped—the type checker doesn't check them at all. In classic Template Haskell, the compiler attempted to type check all varieties of bracket, but an expression bracket still resulted in a plain old ExpQ. F#'s untyped quotations are type checked, although their precise semantics are unclear to me. I like the untyped/typed distinction that is now made by the new Template Haskell; typed brackets really are typed, and untyped brackets are only parsed, never type checked.

Typed Template Haskell is quite different from MetaHaskell3, which provides a mechanism for quoting many different object languages while maintaining type safety. MetaHaskell also supports quotation and antiquotation using open code, which introduces a host of difficulties.

Footnotes:

1

W. Taha and T. Sheard, "Multi-stage programming with explicit annotations," in Proceedings of the 1997 ACM SIGPLAN symposium on Partial Evaluation and Semantics-Based Program Manipulation (PEPM '97), Amsterdam, The Netherlands, 1997, pp. 203–217.

3

G. Mainland, "Explicitly heterogeneous metaprogramming with MetaHaskell," in Proceedings of the 17th ACM SIGPLAN International Conference on Functional Programming (ICFP '12), Copenhagen, Denmark, 2012, pp. 311–322.

Wednesday, April 18, 2012

Adding SIMD Support to Data Parallel Haskell

Adding SIMD Support to Data Parallel Haskell

In my previous post I described my work to support for SIMD instructions in GHC and to exploit this support in the high-level vector library. Despite some issues with the code generated by LLVM (still unresolved), SIMD instructions did lead to a significant performance increase on some small floating-point intensive microbenchmarks.

This performance gain comes at a syntactic cost: code must be re-written to use new SIMD operations. For consumers of the vector library, this is a small cost because high-level SIMD operators are available, but it would be nice not to have to pay any syntactic cost at all. There is an additional "deficiency" in the vector approach to computing with arrays—a lack of parallelism. It's nice to be able to leverage SIMD operations from Haskell code, but it would be even better if we could leverage them in parallel code! There are two primary frameworks for computing with parallel arrays in Haskell: Repa and Data Parallel Haskell (DPH). DPH includes a vectorizer that converts scalar code to vector code, so it seems like there's a good chance we could support SIMD operations in DPH without requiring users to rewrite code. The rest of this post describes (positive) preliminary results in this direction.

My running example will be computing the dot product of two vectors, this time using DPH instead of the vector library. Here is the code:

import Data.Array.Parallel
import qualified Data.Array.Parallel.Prelude.Double as D

dotp :: PArray Double -> PArray Double -> Double
{-# NOINLINE dotp #-}
dotp v w = dotp' (fromPArrayP v) (fromPArrayP w)

dotp' :: [:Double:] -> [:Double:] -> Double
dotp' v w = D.sumP (zipWithP (D.*) v w)

DPH introduces a new parallel array data type (PArray), some new syntax for parallel arrays ([:...:]), and some new combinators for operating on parallel arrays (zipWithP). DPH doesn't quite yet support type classes, so we have to import a special version of the Prelude specialized to the Double type. You can read more about the details on the DPH wiki page.

Using the simd branches of ghc, the vector library, and the dph library, we can change a single import statement and our code will use SIMD instructions to compute the dot product:

1:  import Data.Array.Parallel
2:  import qualified Data.Array.Parallel.Prelude.MultiDouble as D
3:  
4:  dotp :: PArray Double -> PArray Double -> Double
5:  {-# NOINLINE dotp #-}
6:  dotp v w = dotp' (fromPArrayP v) (fromPArrayP w)
7:  
8:  dotp' :: [:Double:] -> [:Double:] -> Double
9:  dotp' v w = D.sumP (zipWithP (D.*) v w)

Note than only the import statement on line 2 changed. If I had simply rolled SIMD support directly into Data.Array.Parallel.Prelude.Double, exploitation of SIMD instructions would be automatic, but for now I've put my changes into a separate module (this also makes comparisons easier).

Here are benchmarks taken on a 2.70GHz Intel® Core™ i7-2620M CPU with frequency scaling disabled. I'm using a perf-llvm build of the simd branch of GHC (compiled with GHC 7.4.1—earlier versions have trouble using LLVM to compile GHC). The benchmark program was compiled with the options -rtsopts -threaded -Odph -O2 -fllvm -optlo-O3 -optc-O3 -optc-march=corei7. I did not use -fcpr-off or -fno-liberate-case as suggested by comments in the README in the dph library because these options prevented fusion of SIMD operations. It looks like we really need Constructed product result analysis for efficient code, which is what I would expect. Errors bars mark one standard deviation, and times are averaged over 100 runs.

Timings for the dotp function, 224 elements, i7.

The C and vector versions both take advantage of SIMD instructions—in the former case, via intrinsics. I've included three DPH versions: the two versions I already gave code for, and a version that uses DPH's parallel array syntax. The code for this version is:

dotp :: PArray Double -> PArray Double -> Double
{-# NOINLINE dotp #-}
dotp v w = dotp' (fromPArrayP v) (fromPArrayP w)

dotp' :: [:Double:] -> [:Double:] -> Double
dotp' xs ys = D.sumP [:x D.* y | x <- xs | y <- ys:]

Strangely, using syntactic sugar results in a significant performance hit. This is a pre-release version of DPH though, and I am told that this performance gap should be eliminated in time for the next release.

Using SIMD instructions in DPH gives us a big gain over the baseline DPH code, but it looks like we're limited by memory bandwidth, so we're not taking advantage of our extra cores. Fortunately, I have access to a 32-core 2.0GHz AMD Opteron™ 6128. Here is the same benchmark run there:

Timings for the dotp function, 224 elements, AMD Opteron.

Now we're cooking—much better scaling behavior! With DPH we can take advantage of multiple cores and we don't have to re-write our code to make use of SIMD instructions. Before GHC 7.6.1, I'd like to integrate all of this work back into the mainline branches of GHC, vector, and dph. Once that is done, all you will have to do to take advantage of SIMD versions of floating point operations in your DPH code is move to the new release.

Tuesday, March 27, 2012

SIMD Support for the vector library

SIMD Support for the vector library

Now that the ICFP deadline is past, I've returned to working on adding SIMD support to GHC and associated libraries. The short-term goal is to be able to leverage SIMD instructions from the vector and—hopefully transparently—from Data Parallel Haskell. Back in November I added a fairly complete set of primops to GHC that generate SSE instructions when using the LLVM back-end. If you're interested in the original design for SIMD support in GHC and the state of implementation, you can read about it here. It's a bit of a hack and requires the LLVM back-end, but it does work.

Primops are necessary, but what we'd really like a higher-level interface to these low-level instructions. This post is a very short introduction to my recent efforts in that direction. All the code I describe is public—you can find directions for getting up and running with the GHC simd branch on github.

Because this is Haskell, we'll start by introducing a data type for a SIMD vector that is indexed by the type of the scalar values it contains. The term "vector" is already overloaded, so, at Simon Peyton Jones' suggestion, we call a SIMD vector containing scalars of type a a Multi a. Because we want to choose a different primitive representation for each different a, Multi is a type family (actually an associated type family). Along with Multi, we define a type class MultiPrim that allows us to treat primitive operations on Multi's in a uniform way, just as the Prim type class defined by the primitive library allows for scalars. Here's the first part of the definition of the MultiPrim type class and the Multi associated family. You can see that it defines functions for replicating scalars across Multi's, folding a function over the scalar elements of a Multi, reading a Multi out of a ByteArray#, etc. Right now there are instance definitions for Multi Float, Multi Double, Multi Int32, Multi Int64, and Multi Int. This type class and the rest of the code I'll be showing are actually part of the simd branch of the vector library that I've but on github. You can go look there for further details, like the Num instances defined for the Multi's.

class (Prim a, Prim (Multi a)) => MultiPrim a where
    data Multi a

    -- | The number of elements of type @a@ in a @Multi a@.
    multiplicity :: Multi a -> Int

    -- | Replicate a scalar across a @Multi a@.
    multireplicate :: a -> Multi a

    -- | Map a function over the elements of a @Multi a@.
    multimap :: (a -> a) -> Multi a -> Multi a

    -- | Fold a function over the elements of a @Multi a@.
    multifold :: (b -> a -> b) -> b -> Multi a -> b

    -- | Read a multi-value from the array. The offset is in elements of type
    -- @a@ rather than in elements of type @Multi a@.
    indexByteArrayAsMulti# :: ByteArray# -> Int# -> Multi a

Now that we have the Multi type, we would like to use it operate over Vector's—that is, vector types from the vector library. A Vector has scalar elements, so for us to be able to use SIMD operations on these scalars we need to know something about the representation the Vector uses, namely that it lays out scalars contiguously in memory. The PackedVector type class lets us express this constraint in Haskell's type system, and I won't say anything more about it here, but instances are defined for the appropriate vector types in the Data.Vector.Unboxed and Data.Vector.Storable modules.

Of course the next step is to define appropriate versions of our old friends, map, zip, and fold, that will let us exploit SIMD operations. Here they are.

mmap :: (PackedVector v a, PackedVector v b)
     => (a -> b)
     -> (Multi a -> Multi b)
     -> v a
     -> v b
mzipWith :: (PackedVector v a, PackedVector v b, PackedVector v c)
         => (a -> b -> c)
         -> (Multi a -> Multi b -> Multi c)
         -> v a
         -> v b
         -> v c
mfoldl' :: PackedVector v b
        => (a -> b -> a)
        -> (a -> Multi b -> a)
        -> a
        -> v b
        -> a

If you're familiar with the vector library, you may know it uses stream fusion to generate very efficient code—many operations are typically compiled to tight loops similar to what one would get from a C compiler. Stream fusion works by re-expressing high-level operations like map, zip, and fold in terms of "step" functions. Each step function takes some state and an element and produces either some new state and a new element, just some new state, or a value that says it is done processing elements. To support computing over vectors using SIMD operations, I have added a new "stream" variant so that step functions can receive not just scalar elements, but Multi elements. That is, at every step, the stream consumer could be handed either a scalar or a Multi and must be prepared for either case. mmap, mzipWith, and mfoldl are almost exactly like their scalar-only counterparts, but they each take an extra function argument for handling Multi's.

Let's see if this stuff actually works by starting off with something easy--- summing up all the elements in a vector. The following code uses the new vector library primitives multifold and U.mfoldl to exploit SIMD instructions.

import qualified Data.Vector.Unboxed as U

import Data.Primitive.Multi

multisum :: U.Vector Float -> Float
multisum v =
    multifold (+) s ms
  where
    s  :: Float
    ms :: Multi Float
    (s, ms) = U.mfoldl' plus1 plusm (0, 0) v
    plusm (x, mx) my = (x, mx + my)
    plus1 (x, mx) y  = (x + y, mx)

We'll compare it with five other versions. "Scalar" and "Scalar (C)" are plain old scalar versions written in Haskell and C, respectively. "Manual" and "Manual (C)" are hand-written Haskell and C versions, respectively. The Haskell version explicitly iterates over the vector instead of using a fold. The vector version is the code we just saw, and the multivector version is based on a library I wrote to test out fusion when I first added SSE support to GHC. It implements a small subset of the vector library API. Here we go 1

Not bad. The following table gives the timings for vectors with 224 elements. In this case, Haskell is as fast as C. This isn't too surprising, as we've seen before that Haskell can be as fast as C.

Timings for the sum function. Summing vectors of size 224.
VariantTime (ms)
Scalar19.7 ± 0.2
Scalar (C)19.7 ± 0.4
Manual4.62 ± 0.03
Manual (C)4.58 ± 0.02
vector4.62 ± 0.02
multivector4.62 ± 0.02

Of course, summing up the elements in a vector isn't so hard. The great thing about the vector library is that you can write high-level Haskell code and, through the magic of fusion, you end up with a tight inner loop that looks like what you might have gotten out of a C compiler had you chosen to write in C. Let's try s slightly more difficult computation that will require fusion—dot product.

Computing the dot product efficiently requires fusing two loops to perform a combined addition and multiplication. Here is the scalar version in Haskell

import qualified Data.Vector.Unboxed as U

dotp :: U.Vector Float -> U.Vector Float -> Float
dotp v w =
    U.sum $ U.zipWith (*) v w

And here is our first cut at a SIMD version.

import qualified Data.Vector.Unboxed as U

import Data.Primitive.Multi

multidotp :: U.Vector Float -> U.Vector Float -> Float
multidotp v w =
    multifold (+) s ms
  where
    s  :: Float
    ms :: Multi Float
    (s, ms)          = U.mfoldl' plus1 plusm (0, 0) $ U.mzipWith (*) (*) v w
    plusm (x, mx) my = (x,     mx + my)
    plus1 (x, mx) y  = (x + y, mx)

Let's look at performance once more. Again, "Manual" is a Haskell version that manually iterates over the vector once and fuses the addition and multiplication, the idea being that this is what we would hope to get out of GHC after fusion, inlining, constructor specialization, etc.

For reference, here are the timings for the case with n = 224 again.

Timings for the dotp function. Calculating the dot product of vectors of size 224.
VariantTime (ms)
Scalar16.98 ± 0.08
Scalar (C)16.63 ± 0.09
Manual8.87 ± 0.03
Manual (C)8.64 ± 0.02
vector13.03 ± 0.07
multivector9.5 ± 0.1

Not so hot. Although our hand-written Haskell implementation ("Manual" in the plot and table) is competitive with C, the vector version is not. Interestingly, the "multivector" version is competitive. What could be going on?

The first things that jumps to mind is that fusion might not be kicking in: I could've screwed up the implementation of the SIMD-enabled combinators! To check this hypothesis, let's look at the GHC core generated for the main loop in multidotp (this is the loop that iterates over elements SIMD-vector-wise):

 1:  letrec {
 2:    $s$wmfoldlM_loopm_s4ri [Occ=LoopBreaker]
 3:      :: GHC.Prim.Int#
 4:         -> GHC.Prim.Int#
 5:         -> GHC.Prim.~#
 6:              *
 7:              Data.Primitive.Multi.FloatX4.FloatX4
 8:              (Data.Primitive.Multi.Multi GHC.Types.Float)
 9:         -> GHC.Prim.FloatX4#
10:         -> GHC.Prim.Float#
11:         -> (# GHC.Types.Float,
12:               Data.Primitive.Multi.Multi GHC.Types.Float #)
13:    [LclId, Arity=5, Str=DmdType LLLLL]
14:    $s$wmfoldlM_loopm_s4ri =
15:      \ (sc_s4nR :: GHC.Prim.Int#)
16:        (sc1_s4nS :: GHC.Prim.Int#)
17:        (sg_s4nT
18:           :: GHC.Prim.~#
19:                *
20:                Data.Primitive.Multi.FloatX4.FloatX4
21:                (Data.Primitive.Multi.Multi GHC.Types.Float))
22:        (sc2_s4nU :: GHC.Prim.FloatX4#)
23:        (sc3_s4nV :: GHC.Prim.Float#) ->
24:        case GHC.Prim.>=# sc1_s4nS ipv7_aHm of _ {
25:          GHC.Types.False ->
26:            case GHC.Prim.indexFloatArrayAsFloatX4#
27:                   ipv2_s4kn (GHC.Prim.+# ipv_s4kl sc1_s4nS)
28:            of wild_a4j3 { __DEFAULT ->
29:            case GHC.Prim.>=# sc_s4nR ipv6_XI3 of _ {
30:              GHC.Types.False ->
31:                case GHC.Prim.indexFloatArrayAsFloatX4#
32:                       ipv5_s4l7 (GHC.Prim.+# ipv3_s4l5 sc_s4nR)
33:                of wild3_X4jF { __DEFAULT ->
34:                $s$wmfoldlM_loopm_s4ri
35:                  (GHC.Prim.+# sc_s4nR 4)
36:                  (GHC.Prim.+# sc1_s4nS 4)
37:                  @~ (Sym (Data.Primitive.Multi.NTCo:R:MultiFloat) ; Sym
38:                                                                       (Data.Primitive.Multi.TFCo:R:MultiFloat))
39:                  (GHC.Prim.plusFloatX4#
40:                     sc2_s4nU (GHC.Prim.timesFloatX4# wild_a4j3 wild3_X4jF)) 
41:                  sc3_s4nV
42:                };
43:              GHC.Types.True -> ...
44:            }
45:            };
46:          GHC.Types.True -> ...
47:        };

We can see that the two loops have been fused. I won't show the core for the other Haskell implementations, but I'll note that it looks pretty much the same except for one thing: multidotp is carrying around two pieces of state during the fold it performs, a scalar Float and a Multi Float. That shouldn't make a difference though—these guys should just live in two separate registers. There's only one reasonable thing left to do: look at some assembly.

Just so we have an idea of what we want to see, let's examine the inner loop of the C version first:

.L3:
        movaps  (%rdi,%rax), %xmm0
        mulps   (%rdx,%rax), %xmm0
        addq    $16, %rax
        cmpq    %r8, %rax
        addps   %xmm0, %xmm1
        jne     .L3

Cool. Our array pointers live in rdi and rdx, our index in rax, and the array bounds in r8. Now on to the "manual" Haskell version.

.LBB5_3:                                # %n5oi
                                        # =>This Inner Loop Header: Depth=1
        movups  (%rcx), %xmm2
        movups  (%rdx), %xmm1
        mulps   %xmm2, %xmm1
        addps   %xmm1, %xmm0
        addq    $16, %rcx
        addq    $16, %rdx
        addq    $4, %r14
        cmpq    %r14, %rax
        jg      .LBB5_3

Still pretty good. This time our array pointers live in rcx and rdx, our index in r14, and our bounds in rax. Note that the index is now measured in float's instead of bytes. How about the "multivector" version?

 1:  .LBB1_2:                                # %n3JW.i
 2:                                          #   in Loop: Header=BB1_1 Depth=1
 3:          cmpq    %rax, %r8
 4:          jle     .LBB1_5
 5:  # BB#3:                                 # %n3K9.i
 6:                                          #   in Loop: Header=BB1_1 Depth=1
 7:          movq    8(%rcx), %rdx
 8:          addq    %rax, %rdx
 9:          movq    16(%rcx), %rdi
10:          movups  16(%rdi,%rdx,4), %xmm2
11:          movups  (%rbx), %xmm1
12:          mulps   %xmm2, %xmm1
13:          addps   %xmm1, %xmm0
14:          movups  %xmm0, -56(%rcx)        #
15:          addq    $16, %rbx
16:          addq    $4, %rax
17:  .LBB1_1:                                # %tailrecurse.i
18:                                          # =>This Inner Loop Header: Depth=1
19:          cmpq    %rax, %r9
20:          jg      .LBB1_2

There is definitely more junk here. Still, not horrible except for line 14 where we spill the result to the stack. Apparently the spill doesn't cost us much 2. Now the "vector" version that had performance issues.

 1:  .LBB4_2:                                # %n4H3
 2:                                          #   in Loop: Header=BB4_1 Depth=1
 3:          cmpq    %r14, 43(%rbx)
 4:          jle     .LBB4_5
 5:  # BB#3:                                 # %n4Hw
 6:                                          #   in Loop: Header=BB4_1 Depth=1
 7:          movq    35(%rbx), %rdx
 8:          addq    %r14, %rdx
 9:          movq    3(%rbx), %rcx
10:          movq    11(%rbx), %rdi
11:          movups  16(%rdi,%rdx,4), %xmm0
12:          movq    27(%rbx), %rdx
13:          addq    %rsi, %rdx
14:          movups  16(%rcx,%rdx,4), %xmm1
15:          mulps   %xmm0, %xmm1
16:          movups  (%rbp), %xmm0           #
17:          addps   %xmm1, %xmm0
18:          movups  %xmm0, (%rbp)           #
19:          addq    $4, %r14
20:          addq    $4, %rsi
21:  .LBB4_1:                                # %tailrecurse
22:                                          # =>This Inner Loop Header: Depth=1
23:          cmpq    %rsi, %rax
24:          jg      .LBB4_2

Ah-hah, there's our likely culprit: our accumulator is loaded from the stack in line 16 and spilled back in line 18. Yuck! It looks like carrying around that extra bit of state really cost us. I'm not sure why LLVM didn't spill the Float portion of the state to the stack temporarily so that it could use the register for the main loop, but it seems likely that it is related to the GHC calling convention used by the LLVM back-end.

I'm disappointed that we weren't able to get C-competitive performance from our high-level Haskell code, especially since it seems so tantalizingly close. At least there is hope that with some prodding we can convince LLVM to keep our accumulating parameter in a register.

Footnotes:

1 All timings were done on a laptop with a 2.70GHz Intel® Core™ i7-2620M CPU with frequency scaling disabled. 100 trials were performed at each data point. C code was compiled by GCC at -O3, and llc and opt were invoked with -O3.

2 On an AMD machine I have access to this spill does incur a penalty.