Thursday, June 14, 2012

Getting Started with Simulations

To start numerical simulations, we first need the ability to set the initial condition. It will also be a good excercise for Paraiso building blocks, so let's give it a try. Please look at the example.

table :: Named (StaticValue TArray Double)
table = "table" `isNameOf` StaticValue TArray undefined

createBuilder :: Builder Vec2 Int Annotation ()
createBuilder = do
  x01 <- bind $ (cast $ loadIndex (Axis 0)) 
              / (cast $ broadcast $ loadSize (Axis 0))
  y01 <- bind $ (cast $ loadIndex (Axis 1)) 
              / (cast $ broadcast $ loadSize (Axis 1))
  x <- bind $ 4 * (x01-0.5)
  y <- bind $ 5 * (y01-0.5)
  z <- bind $ atan((1- x^2 - (y-(x^2)**(1/3))^2)*10)
  store table z

The Generator.hs makes a bit tedious use of casts and broadcasts we have just learned, to calculate x and y coordinates in real number inbetween 0 and 1 (they are namely x01 and y01.) See how input and output types for casts and broadcasts are inferred. The index and size are of type Int (as specified in the type annotation for createBuilder) while x01, y01, x, y and z are all of type (Builder of) Double because table has that type and we cannot convert type without an explicit cast. The careful reader also might have noticed that 1/3 is treated as Double instead of just being truncated to 0.

Now, Paraiso and the author give hearts to the readers! Thank you!

Thursday, June 7, 2012

Life in Paraiso

Now, let's try Paraiso in a real-world problem. Simulation of artificial life will be a good starting point.

The sample program, like every other Haskell programs, starts with some language extension and imports.

#!/usr/bin/env runhaskell
{-# LANGUAGE NoImplicitPrelude, OverloadedStrings #-}
{-# OPTIONS -Wall #-}

import           Control.Monad
import           Data.Tensor.TypeLevel
import           Language.Paraiso.Annotation (Annotation)
import qualified Language.Paraiso.Annotation.Boundary as Boundary
import           Language.Paraiso.Generator (generateIO)
import qualified Language.Paraiso.Generator.Native as Native
import           Language.Paraiso.Name
import           Language.Paraiso.OM
import           Language.Paraiso.OM.Builder
import           Language.Paraiso.OM.Builder.Boolean (select,eq,ge,le)
import           Language.Paraiso.OM.DynValue as DVal
import           Language.Paraiso.OM.Realm 
import qualified Language.Paraiso.OM.Reduce as Reduce
import           Language.Paraiso.OM.Value (StaticValue(..))
import           Language.Paraiso.Optimization
import           Language.Paraiso.Prelude
import           NumericPrelude hiding ((||),(&&))

-- the main program
main :: IO ()
main = do
  _ <- generateIO mySetup myOM
  return ()

Let's say our computation region to be 80x48 and with cyclic boundary conditions. The new Orthotope Machine will have the name "Life".

-- the code generation setup
mySetup :: Native.Setup Vec2 Int
mySetup = 
  (Native.defaultSetup $ Vec :~ 80 :~ 48)
  { Native.directory = "./dist/" ,
    Native.boundary = compose $ const Boundary.Cyclic
  }

-- the orthotope machine to be generated
myOM :: OM Vec2 Int Annotation
myOM = optimize O3 $
  makeOM (mkName "Life") [] myVars myKernels

We use an array variable cell for the cell states, population for counting the number of alive cells and generation for keeping track of the time.

-- the variables we use
cell :: Named (StaticValue TArray Int)
cell = "cell" `isNameOf` StaticValue TArray  undefined

population :: Named (StaticValue TScalar Int)
population = "population" `isNameOf` StaticValue TScalar undefined

generation :: Named (StaticValue TScalar Int)
generation = "generation" `isNameOf` StaticValue TScalar undefined

myVars :: [Named DynValue]
myVars = [f2d cell, f2d population, f2d generation]

Then, let's make two kernels, one for the initialization of the state and the other for updating the state for one generation.

-- our kernel
myKernels :: [Named (Builder Vec2 Int Annotation ())]
myKernels = ["init" `isNameOf` initBuilder,
             "proceed" `isNameOf` proceedBuilder]

Initialization, is just trivial.

initBuilder :: Builder Vec2 Int Annotation ()
initBuilder = do 
  -- store the initial states.
  store cell 0
  store population 0
  store generation 0

Now, how shall we write the rule of Conway's game of Life? Let's first define the adjacency. In Conway's game of Life, one cell has eight neighbours:

adjVecs :: [Vec2 Int]
adjVecs = zipWith (\x y -> Vec :~ x :~ y)
          [-1, 0, 1,-1, 1,-1, 0, 1]
          [-1,-1,-1, 0, 0, 1, 1, 1]

A timestep of the game begins by loading a Array variable called "cell" as the old state of the simulation.

proceedBuilder :: Builder Vec2 Int Annotation ()
proceedBuilder = do 
  oldCell <- bind $ load cell

We also load a Scalar variable called "generation."

  gen  <- bind $ load generation

shiftedCell <- bind $ shift v oldCell is an expression that yields a cell pattern shifted by amount v from the old cell. Then mapping it over every v in adjVecs creates the neighbour list.

  neighbours <- forM adjVecs $
    \v -> bind $ shift v oldCell

Count the neighbour cells that are alive.

  num <- bind $ sum neighbours

Apply the rule of Conway's game of Life.

  isAlive <- bind $
             (oldCell `eq` 0) && (num `eq` 3) ||
             (oldCell `eq` 1) && (num `ge` 2) && (num `le` 3) 

Update the cell according to the rule. The expression select isAlive 1 0, means isAlive ? 1 : 0 in C.

  newCell <- bind $ select isAlive 1 0

Count the number of alive cells, increment the generation, and store the new cell state.

  store population $ reduce Reduce.Sum newCell
  store generation $ gen + 1
  store cell $ newCell

Wednesday, June 6, 2012

Boundary Condition in Paraiso

The shift instruction of Orthotope Machine is displace the array by a constant vector. Because every array has a finite size, the users must set the boundary condition to specify how the out-of-bound access are treated.
Current Paraiso supports two boundary conditions; open boundary and cyclic boundary.
You can choose boundary condition for every dimension independently
by setting a value of the type vector of Condition in boundary field of the Setup data.
Of the two, the cyclic boundary condition is simple. Out-of-bound accesses see the other side of the array.
The open boundary condition is rather complicated. First, the size of margin regions are determined so that any Paraiso kernel can cover the region you said you wanted to calculate. You have specified the region 0 <= i < localSize in the Setup. Next, each kernel start the calculation from the maximum region including the margins, and calculates as large region as possible. The array values out of that region are undefined. Also, the reduction operators always reduce over the entire margined region. You need to manually mask the margins if you want to.
Please look at the sample program to understand how Paraiso deals with the boundary conditions. This program has three kernels:
myKernels :: [Named (Builder Vec1 Int Annotation ())]
myKernels =
  ["init" `isNameOf` do
      store table $ loadIndex (Axis 0)

  ,"increment" `isNameOf` do
      store table $ 1 + load table

  ,"calculate" `isNameOf` do
      center <- bind $ load table
      right  <- bind $ shift (Vec :~ (-1)) center
      left   <- bind $ shift (Vec :~ ( 1)) center
      ret    <- bind $ 10000 * left + 100 * center + right
      store table ret
      store total $ reduce Reduce.Sum ret
  ]
The first kernel initializes the array elements with their respective indices. The second adds 1 to every element. The third reads adjacent cells and perform some calculations. Let's give it a go:
  maker.init();
  dump();
  maker.increment();
  dump();
  maker.calculate();
  dump();
  cout << "total: " << maker.total() << endl;
  cout << endl;
Say make and two program will be generated. They are the same but for the boundary conditions. Under the cyclic boundary condition, you can read across the boundary, as follows:
$ ./main-cyclic.out
index:      0      1      2      3      4      5      6      7
value:      0      1      2      3      4      5      6      7
index:      0      1      2      3      4      5      6      7
value:      1      2      3      4      5      6      7      8
index:      0      1      2      3      4      5      6      7
value:  80102  10203  20304  30405  40506  50607  60708  70801
total: 363636
On the other hand if you use the open boundary condition, Paraiso adds the margin regions for you, and you can access indices beyond 0 and localSize.
$ ./main-open.out
index:     -1      0      1      2      3      4      5      6      7      8
value:     -1      0      1      2      3      4      5      6      7      8
index:     -1      0      1      2      3      4      5      6      7      8
value:      0      1      2      3      4      5      6      7      8      9
index:     -1      0      1      2      3      4      5      6      7      8
value:      0    102  10203  20304  30405  40506  50607  60708  70809      0
total: 283644

Tuesday, June 5, 2012

Builder Monad

The data-flow graph of OM takes three type parameters vector,gauge and anot, so does the Builder Monads. vector represents the dimension of the array (say, three-dimension,) gauge the type of the index of the array (say, Int,) and anot the annotations given to the graph for analysis and optimization purposes. The combination vector gauge represents an index vector (in this case, three-dimensional vector of Int) for accessing the arrays.
type Graph vector gauge anot = Gr (Node vector gauge anot) Edge
Paraiso programs are described by combining Builder Monad s, which are actually State monads each carrying a half-built data-flow graph. Builder Monads take a few vertices from the data-flow graph, and return (usually one) new vertex. The vertices are of type Value, which carry their Realm information --- whether they are Array or Scalar --- and their content type information.
-- | value type, with its realm and content type discriminated in 
--   type level data
  Value rea con = 
  -- | data obtained from the data-flow graph.
  -- 'realm' carries a type-level realm information, 
  -- 'content' carries only type information and its ingredient is 
  -- insignificant and can be 'undefined'.
  FromNode {realm :: rea, content :: con, node :: G.Node} | 
  -- | data obtained as an immediate value.
  -- 'realm' carries a type-level realm information, 
  -- 'content' is the immediate value to be stored.
  FromImm {realm :: rea, content :: con} deriving (Eq, Show)
The monadic building blocks we use in Paraiso are defined in Language.Paraiso.OM.Builder module. Please go to the page, and tap the "Synopsis" tab.

bind trick

We have already seen several Paraiso programs, and you may have noticed the too frequent use of the term bind,
bind :: (Monad m, Functor m) => m a -> m (m a)
bind = fmap return
Like this:
  x <- bind $ loadIndex (Axis 0) 
  y <- bind $ loadIndex (Axis 1) 
  z <- bind $ x*y
  w <- bind $ x+y
The above program originally looked like this:
  x <- loadIndex (Axis 0) 
  y <- loadIndex (Axis 1) 
  z <- return x * return y
  w <- return x + return y
The right hand side expressions are built of Builder monads while the bound names at the left hand side like x,y,z,w are of type Value. So we need to convert pure values to monads using returns. This cannot be helped, because we cannot have x,y and z of the same type in such expressions like z <- x*y. However, there's a solution. By using bind = fmap return, we apply return only once at the binding timing, instead of everywhere else later on.

OM instructions

The instruction set of the Orthotope Machine is defined as algebraic data type Inst in Language.Paraiso.OM.Graph module, and Language.Paraiso.OM.Builder provides the (almost) corresponding Builder Monad combinators. Here is the table of them. Read B as the general monad symbol m. (The actual definition is type B ret = forall v g a. Builder v g a ret.)
Here is the instruction set:
data Inst vector gauge 
  = Load StaticIdx
  | Store StaticIdx
  | Reduce R.Operator 
  | Broadcast 
  | LoadIndex (Axis vector) 
  | LoadSize (Axis vector) 
  | Shift (vector gauge) 
  | Imm Dynamic 
  | Arith A.Operator 
and the corresponding monad combinators:
--           options                    input nodes            output nodes
load      :: Named (StaticValue r c)                        -> B (Value r c)
store     :: Named (StaticValue r c) -> B (Value r c)       -> B ()
reduce    :: Reduce.Operator         -> B (Value TArray c)  -> B (Value TScalar c)
broadcast ::                            B (Value TScalar c) -> B (Value TArray c)
loadIndex :: Axis v                                         -> B (Value TArray g)
loadSize  :: Axis v                                         -> B (Value TScalar g)
shift     :: v g                     -> B (Value TArray c)  -> B (Value TArray c)
imm       :: c                                              -> B (Value r c)
In short,
  • Load takes a static variable, and loads from it, starting the data-flow graph.
  • Store ends the data-flow graph by storing the result to the specified static variable.
  • Reduce turns an array into a scalar value by use of the specified reduction operator.
  • Broadcast does the opposite and turns a scalar value into an array of the same type.
  • LoadIndex is used to retrieve the array index in the specified direction. The result is always an array.
  • LoadSize is used to retrieve the array size in the specified direction. The result is a scalar, of course.
  • Shift is used to move an array by a certain vector v g.
  • Imm is to introduce an immediate value.
  • Arith is for arithmetic operations.
We will gradually see the detail via examples.

Arithmetic

Where are the combinators for the Arith instructions? Well, thanks to algebraic type classes defiend in numeric-prelude, we can write Builder expression using the usual arithmetic operators, in the same manner as writing ordinary expressions.
(+) :: B (Value r c) -> B (Value r c) -> B (Value r c) 
sin :: B (Value r c) -> B (Value r c)
One important exception to this are Boolean operators. Since Haskell's Boolean operators are of type, say, (==) :: Eq a => a -> a -> Bool, we cannot construct a Builder of Bool out of them. Instead, use functions defined in modules Language.Paraiso.OM.Builder.Boolean and Language.Paraiso.Prelude. Also if cannot be re-used, so select instead.
eq :: B (Value r c) -> B (Value r c) -> B (Value r Bool) 
ne :: B (Value r c) -> B (Value r c) -> B (Value r Bool) 
select :: B (Value r Bool) -> B (Value r c) -> B (Value r c) -> B (Value r c)
And here's a cast operator for type conversion. Here, c2 provides only the type information but its value is never used.
cast ::  c2 -> B (Value r c1) -> B (Value r c2)

Monday, May 28, 2012

The first GPU programming in Paraiso

Now let's move on to the GPU. We need only a few modifications to turn the code generator that uses C++ into that that uses CUDA.
$ diff HelloWorld/Generator.hs HelloGPU/Generator.hs
40c40,42
<   { Native.directory = "./dist/"
---
>   { Native.directory = "./dist/" ,
>     Native.language  = Native.CUDA,
>     Native.cudaGridSize = (32,1)
The following Setup specifies that we want to generate CUDA code, and also specifies the CUDA grid dimensions. (The following code will be executed with dimensions <<<32,1>>>.)
mySetup :: Native.Setup Vec2 Int
mySetup =
  (Native.defaultSetup $ Vec :~ 10 :~ 20)
  { Native.directory = "./dist/" ,
    Native.language  = Native.CUDA,
    Native.cudaGridSize = (32,1)
  }
The generated code has been tested to produce the same result as the C++ version using nvcc 4.1 (with its built-in thrust) on TSUBAME 2.0 at Tokyo Institute of Technology.

Wednesday, May 23, 2012

API for Paraiso-generated codes

Our HelloWorld sample generates C++ code under dist folder. First have a look at the header file:

class TableMaker{
private: std::vector  om_s0_table;
private: int om_s1_total;
private: std::vector  om_m0_5;
public:  TableMaker () : om_s0_table(om_memory_size()),
  om_m0_5(om_memory_size()) {

}

Here we declare and allocate the memory for the static variables. Next comes the accessors for the OM size:

public: int om_size ()  {
return 200;
}
public: int om_size_0 ()  {
return 10;
}
public: int om_size_1 ()  {
return 20;
}

Note that all the auxiliary functions that Paraiso generates have name prefix om_. Users should give their variables and kernels unique names not prefixed with om_ to avoid name collisions.

The next series of size functions, om_memory_size(), returns the actual size of memory allocated. If the translated OM contains shift instructions, Paraiso allocates margin region for communications, and om_memory_size() will be greater than om_size().

public: int om_memory_size ()  {
return 200;
}
public: int om_memory_size_0 ()  {
return 10;
}
public: int om_memory_size_1 ()  {
return 20;
}

And here's the margin functions.

public: int om_lower_margin_0 ()  {
return 0;
}
public: int om_lower_margin_1 ()  {
return 0;
}
public: int om_upper_margin_0 ()  {
return 0;
}
public: int om_upper_margin_1 ()  {
return 0;
}

You get a nice element-wise accessor for each static array variable of the OM. There are accessors for scalar variables, too.

public: int & table (int i0, int i1)  {
return (om_s0_table)[((om_lower_margin_0()) + (i0)) + 
  ((om_memory_size_0()) * ((om_lower_margin_1()) + (i1)))];
}
public: int & total ()  {
return om_s1_total;
}

You also get a direct access to the raw data structure of each Array although, what the term "raw" means depends on the implementation detail.

public: std::vector  & table ()  {
return om_s0_table;
}

Finally, there are class methods that correspond to OM kernels.

public: void create () ;
};

Tuesday, May 22, 2012

Writing your first Orthotope Machine Program

Paraiso is language embedded in Haskell, specialized in the domain of explicit solvers of partial differential equations on uniform mesh. The computation of the domain is represented as the programs of the Orthotope Machine(OM). The Orthotope Machine is design to capture all of the parallelism inherent in the original problem. It is free from the order the elements of the array is stored and accessed. The dependencies between the instructions of the machine are clearly specified and independent instructions can be issued in any order or in parallel. Paraiso is the framework built around this OM; it can translate mathematical descriptions of the PDE-solving algorithm to OM programs; it can convert OM programs to native programs in C++ and CUDA; it can automatically tune them using evolutionary computation. Please refer to Paraiso paper for more details.

But in one short sentence, all you need to do is to create a value of type OM and pass it to the code generator.

Dive into the code

Our first sample code examples/HelloWorld/Generator.hs includes everything you need to tell Paraiso to let him generate code for you. Let's look inside it!

The first few lines are something like #include <stdio.h>.

#!/usr/bin/env runhaskell
{-# LANGUAGE NoImplicitPrelude, OverloadedStrings #-}
{-# OPTIONS -Wall #-}

import           Data.Dynamic
import           Data.Tensor.TypeLevel
import qualified Data.Text.IO as T
import           Language.Paraiso.Annotation (Annotation)
import           Language.Paraiso.Generator (generateIO)
import qualified Language.Paraiso.Generator.Native as Native
import           Language.Paraiso.Name
import           Language.Paraiso.OM
import           Language.Paraiso.OM.Builder
import           Language.Paraiso.OM.DynValue as DVal
import           Language.Paraiso.OM.PrettyPrint (prettyPrintA1)
import           Language.Paraiso.OM.Realm 
import qualified Language.Paraiso.OM.Reduce as Reduce
import           Language.Paraiso.Optimization
import           NumericPrelude
import           System.Process (system)

The first part of the sample defines a few names for the code we are going to generate. The type Name, defined in the module Language.Paraiso.Name, is just a newtype for Text defined in Data.Text module. We newtype it to distinguish the identifiers from the rest of the text.

-- the names we use
table, total, create, tableMaker :: Name
table = mkName "table"
total = mkName "total"
create = mkName "create"
tableMaker = mkName "TableMaker"

First, let's look at main routine. The first line creates an output folder. The second line is for pretty-printing the data-flow graphs of our first Orthotope Machine, namely myOM. This is just for your interest; it's not a necessary step for the code generation. The third line is the most important; it combines the myOM with mySetup and generates the C++ code.

main :: IO ()
main = do
  _ <- system "mkdir -p output"
  T.writeFile "output/OM.txt" $ prettyPrintA1 $ myOM
  _ <- generateIO mySetup myOM
  return () 

The data type Setup, defined in Language.Paraiso.Generator.Native, is responsible for all kinds of data needed at native-code generation timing but not needed in defining the abstract Orthotope Machine. Here, we specify the size of the actual OM to be generated, and the folder for the generated library. The default target language for code generation is C++.

mySetup :: Native.Setup Vec2 Int
mySetup = 
  (Native.defaultSetup $ Vec :~ 10 :~ 20)
  { Native.directory = "./dist/" 
  }

Next, let's look at the definition of myOM. To define an OM, pass four arguments, namely its name, the list of its global annotations, the list of its static variables, the list of its kernels, to function makeOM.

myOM :: OM Vec2 Int Annotation
myOM = optimize O3 $
  makeOM tableMaker [] myVars myKernels

myOM has two static variables. One is the array for the multiplication table; the other is for calculating the total of the table. They are Array andScalar Realm, respectively.

myVars :: [Named DynValue]
myVars = [Named table $ DynValue Array (typeOf (undefined::Int)),
          Named total $ DynValue Scalar (typeOf (undefined::Int))]

myOM has one kernel, called create. A kernel of an OM is a bunch of computation that is performed at once.

myKernels :: [Named (Builder Vec2 Int Annotation ())]
myKernels = [Named create createBuilder]

Finally, you can make OM kernels by using Builder Monad combinators. Here, let's make a two-dimensional OM kernel that loads x and y indices and multiply them. Let's also calculate the grand-sum of the table.

createBuilder :: Builder Vec2 Int Annotation ()
createBuilder = do 
  x <- bind $ loadIndex (undefined::Int) (Axis 0) 
  y <- bind $ loadIndex (undefined::Int) (Axis 1) 
  z <- bind $ x*y
  store table z
  store total $ reduce Reduce.Sum z