Daves Software Blog: April 2013

Tuesday, 23 April 2013

GPU Tree Code examples and applications

I haven't been very active using CUDA recently. I need a lot of energy to actually work on CUDA programs because my cuda coding environment is based in windows xp and everything is running really slowly with MSVC 2010. Also I have problems tracking down errors via the compiler, so sometimes a project will have an error that won't show up in the list of build incidents.

Anyway I thought I'd share some information about GPU tree codes.

A good resource for tree based N-Body is here
http://castle.strw.leidenuniv.nl/software/bonsai-gpu-tree-code.html

This is more like a multi-grid (I think). The multi-grid method is useful in computational fluid dynamics (CFD) and computational cosmology (CC). It is often convenient to only process N-body interactions between particles or vortons inhabiting the same grid sector, or nearby grid sectors, because the center of gravity or center of mass will be a good enough approximation for distant grid cells. For this reason these algorithms have become very popular on CUDA, as they allow users to compute real-time systems with many more particles than a pure brute force N-body calculation.

Another thing to search for is the Barnes-Hut algorithm on GPU. This is a famous multi-grid tree algorithm. In fact I think the code in the link above is based on Barnes-Hut.

Another interesting CUDA application for simulation is named OpenCurrent (see google). Basically this is an industrial strength CFD solver built in CUDA, and so its a seriously useful building block for any CFD simulations, including smoke, fire and others. Incidently, CudaFire is built upon OpenCurrent - I spent a long time trying to compile CudaFire and eventually started a project based on a copy of the CUDA project "FluidsGL", because the title was mentioned in the cuda fire build, and added all the files into the project. I had to change the directory structure, had to download TinyXML and then I had to copy the entire OpenCurrent source codes into the project directory and add those. Then I had to build without debug information - because my computer ran out of memory for building the .exe. The program had a couple of errors too, a thrust functor was constructed with 3 floats as the template argument list instead of 3 float4 objects. Also there was a big linking problem that I solved with some trickery, including forcing it to build with multiply defined symbols. Eventually the program worked, but broke instantly. I saw a bunch of particles in a quad, then the screen went blank, I could cycle through modes to see the slice renderer, the cube etc, but there was nothing going on - i.e. no simulation just a blank space.

Gotta go for a while, anyway watch this space for more information.

Thursday, 18 April 2013

Stack Recursion

CUDA doesn't like recursion, so programmers need to use other techniques to perform recursion. The best way I know about is to use a stack object and a loop. Here is an example from this page ...

http://stackoverflow.com/questions/159590/way-to-go-from-recursion-to-iteration

void foo(Node* node)
{
    if(node == NULL)
       return;
    // Do something with node...
    foo(node->left);
    foo(node->right);
}
can be converted to

void foo(Node* node)
{
    if(node == NULL)
       return;

    // Do something with node...

    stack.push(node->right);
    stack.push(node->left);

    while(!stack.empty()) {
         node1 = stack.pop();
         if(node1 == NULL)
            continue;
         // Do something with node1...
         stack.push(node1->right);
         stack.push(node1->left);
    }

}

I really think this best illustrates the method I would want to use for recursion. The stack object in the loop version is a FILO storage structure (first in last out) and that is why the order of pushing the objects onto the stack is reversed compared to the recursive version.

Pretty simple, and very accurate explanation.

Friday, 5 April 2013

CUDA examples

I have been working on some CUDA examples for the last month, I have made some pretty good progress in the region of Computational Fluid Dynamics and Particle Systems. The techniques used in the example programs are useful for other applications, e.g the NBody particles example can be modified for any N^2 algorithm that will fit on the GPU. I also looked at the example named "Particles" from the CUDA sdk, this features collision of a large set of spheres based in a grid. The program computes a grid index for each particle based on their spacial location

e.g.
if the grid was size 8 and the scale of the grid was 16, then the tile size is 2
a particle at the point 1.5
would have the coordinate given by
gx = (int) (x / tile_size) = (int)(1.5 / 2) = (int)(0.75) = 0 (*see note below)
and therefore resides in node zero, this follows because each grid block is of size 2;

and also a hash number based on the grid index ... something like

hash = (grid_x * width + grid_y) * depth + grid_z;

this is just the volume array expressed in linear numerical order. The particles are then sorted using a GPU parallel radix sort using the hash as the number to sort them by. This list of particles can then be examined in parallel by an algorithm that loads (tile + 1) particles into shared memory and then finds the boundaries between the grid indexes, so that the start and end integers that index the sorted particle array can be stored to allow access to the individual particles based on grid index.

Next the grid is accessed in parallel and the particles that inhabit grid cells are compared for collision.

Clearly this offers a performance improvement to using an N-Body algorithm for particle collision (although maybe not if you want to compute NBody gravity too). The downside to the procedure outlined above is that the grid cells are hard coded to contain a maximum of 4 particles, this is because the particle radius is set to be half the grid cell size, so at most there could be 4 particles overlapping. More information is available in the Nvidia whitepaper for the demo.

I downloaded a legacy version of the same demo that allows more flexible (although less optimal) usage of the code, including an alternative method that sums particles per grid cell using a parallel method called atomicAdd() that basically tells the GPU to add in serial for the variable used. It is a much slower operation, however the results of playing with the particle system, varying the particle size and count per cell, were much clearer and easier using the old code. Perhaps I did not completely understand how to modify the newer and more optimized Kernel to perform the same operations.

Anyway ... I also rendered the grid into a volume texture, and rendered it using a ray march algorithm in CG ... the particles look much better as a set of spheres.

(* to be more precise, the formula could say gx = floor( x / tile_size ); instead of just casting to int)