Daves Software Blog

Sunday, 12 May 2013

C++ 11

Well, I took a quick break from CUDA and GPU programming to look at C++ 11 programming.

The new C++ features are actually quite interesting. The things that I really appreciated included

- the auto keyword,
- the ability to define class member variables in the class definition,
- lamda's,
- "using" instead of "typedef"
- inbuilt types for int32_t and int64_t (and the others)
- the constexpr keyword
- the "move" operator - this is really useful
- pointer types ( shared_prt<> ) with the special destruction behavior
- the concurrency model, C++ really needed decent threading support for a long time

to anyone reading this who needs more ... I really suggest downloading a complete guide or cheat sheet for the language.

Tuesday, 23 April 2013

GPU Tree Code examples and applications

I haven't been very active using CUDA recently. I need a lot of energy to actually work on CUDA programs because my cuda coding environment is based in windows xp and everything is running really slowly with MSVC 2010. Also I have problems tracking down errors via the compiler, so sometimes a project will have an error that won't show up in the list of build incidents.

Anyway I thought I'd share some information about GPU tree codes.

A good resource for tree based N-Body is here
http://castle.strw.leidenuniv.nl/software/bonsai-gpu-tree-code.html

This is more like a multi-grid (I think). The multi-grid method is useful in computational fluid dynamics (CFD) and computational cosmology (CC). It is often convenient to only process N-body interactions between particles or vortons inhabiting the same grid sector, or nearby grid sectors, because the center of gravity or center of mass will be a good enough approximation for distant grid cells. For this reason these algorithms have become very popular on CUDA, as they allow users to compute real-time systems with many more particles than a pure brute force N-body calculation.

Another thing to search for is the Barnes-Hut algorithm on GPU. This is a famous multi-grid tree algorithm. In fact I think the code in the link above is based on Barnes-Hut.

Another interesting CUDA application for simulation is named OpenCurrent (see google). Basically this is an industrial strength CFD solver built in CUDA, and so its a seriously useful building block for any CFD simulations, including smoke, fire and others. Incidently, CudaFire is built upon OpenCurrent - I spent a long time trying to compile CudaFire and eventually started a project based on a copy of the CUDA project "FluidsGL", because the title was mentioned in the cuda fire build, and added all the files into the project. I had to change the directory structure, had to download TinyXML and then I had to copy the entire OpenCurrent source codes into the project directory and add those. Then I had to build without debug information - because my computer ran out of memory for building the .exe. The program had a couple of errors too, a thrust functor was constructed with 3 floats as the template argument list instead of 3 float4 objects. Also there was a big linking problem that I solved with some trickery, including forcing it to build with multiply defined symbols. Eventually the program worked, but broke instantly. I saw a bunch of particles in a quad, then the screen went blank, I could cycle through modes to see the slice renderer, the cube etc, but there was nothing going on - i.e. no simulation just a blank space.

Gotta go for a while, anyway watch this space for more information.

Thursday, 18 April 2013

Stack Recursion

CUDA doesn't like recursion, so programmers need to use other techniques to perform recursion. The best way I know about is to use a stack object and a loop. Here is an example from this page ...

http://stackoverflow.com/questions/159590/way-to-go-from-recursion-to-iteration

void foo(Node* node)
{
    if(node == NULL)
       return;
    // Do something with node...
    foo(node->left);
    foo(node->right);
}
can be converted to

void foo(Node* node)
{
    if(node == NULL)
       return;

    // Do something with node...

    stack.push(node->right);
    stack.push(node->left);

    while(!stack.empty()) {
         node1 = stack.pop();
         if(node1 == NULL)
            continue;
         // Do something with node1...
         stack.push(node1->right);
         stack.push(node1->left);
    }

}

I really think this best illustrates the method I would want to use for recursion. The stack object in the loop version is a FILO storage structure (first in last out) and that is why the order of pushing the objects onto the stack is reversed compared to the recursive version.

Pretty simple, and very accurate explanation.

Friday, 5 April 2013

CUDA examples

I have been working on some CUDA examples for the last month, I have made some pretty good progress in the region of Computational Fluid Dynamics and Particle Systems. The techniques used in the example programs are useful for other applications, e.g the NBody particles example can be modified for any N^2 algorithm that will fit on the GPU. I also looked at the example named "Particles" from the CUDA sdk, this features collision of a large set of spheres based in a grid. The program computes a grid index for each particle based on their spacial location

e.g.
if the grid was size 8 and the scale of the grid was 16, then the tile size is 2
a particle at the point 1.5
would have the coordinate given by
gx = (int) (x / tile_size) = (int)(1.5 / 2) = (int)(0.75) = 0 (*see note below)
and therefore resides in node zero, this follows because each grid block is of size 2;

and also a hash number based on the grid index ... something like

hash = (grid_x * width + grid_y) * depth + grid_z;

this is just the volume array expressed in linear numerical order. The particles are then sorted using a GPU parallel radix sort using the hash as the number to sort them by. This list of particles can then be examined in parallel by an algorithm that loads (tile + 1) particles into shared memory and then finds the boundaries between the grid indexes, so that the start and end integers that index the sorted particle array can be stored to allow access to the individual particles based on grid index.

Next the grid is accessed in parallel and the particles that inhabit grid cells are compared for collision.

Clearly this offers a performance improvement to using an N-Body algorithm for particle collision (although maybe not if you want to compute NBody gravity too). The downside to the procedure outlined above is that the grid cells are hard coded to contain a maximum of 4 particles, this is because the particle radius is set to be half the grid cell size, so at most there could be 4 particles overlapping. More information is available in the Nvidia whitepaper for the demo.

I downloaded a legacy version of the same demo that allows more flexible (although less optimal) usage of the code, including an alternative method that sums particles per grid cell using a parallel method called atomicAdd() that basically tells the GPU to add in serial for the variable used. It is a much slower operation, however the results of playing with the particle system, varying the particle size and count per cell, were much clearer and easier using the old code. Perhaps I did not completely understand how to modify the newer and more optimized Kernel to perform the same operations.

Anyway ... I also rendered the grid into a volume texture, and rendered it using a ray march algorithm in CG ... the particles look much better as a set of spheres.

(* to be more precise, the formula could say gx = floor( x / tile_size ); instead of just casting to int)

Friday, 8 February 2013

Problems that CUDA can't solve

The answer was obvious, of course there are problems that CUDA isn't good for.

One example is Chess. The classic case of answering the question as to whether a move is legal, cannot be solved without the move history information, or the current board state.

For example, the sequence

1.e4 e5 2.Nf3 Nc6 3.Bc4 Qxa1

has an obvious illegal move, Qxa1. There is no real point in making a thread for each move individually
because the state of the board before Qxa1 determines the moves legality. The same is true for many types of state machine, and also for things similar to state machines, including Markov chains.

Sunday, 3 February 2013

Dynamic Programming

I think learning about Dynamic Programming (see wikipedia) would be a very optimal technique to learn and put into practice, because so many problems are solvable with these techniques. The problem with coding using these techniques is similar to the problem of applying optimization algorithms to a real world problem - working out what to use for inputs can be quite confusing, and the general best way to learn it seems to be to do loads of examples. I still can't whip out a GA with an auto-tuning NN to solve any random problem I encounter (e.g Rage-style Mega-textures, or optimal rendering).

Making code parallel

Various algorithms would not seem to be good on parallel architectures, or at least are not using the parallel architecture to its full potential.

This is one reason why the current computer architecture using a (multi-core) CPU and parallel GPU (s) is particularly elegant. One CPU core can be reserved for processing things like the operating system background processes, the user interface, and the ability to halt execution of a process. Also some processes would be best on the CPU ... right ?

In my search for examples, one example I thought would be best suited to a CPU process was simulated annealing, because the algorithm is iterative and each loop depends on the last, so the loops would not be easily computed in parallel. I based my assumption on this nice implementation on google code. However running a google search shows that Simulated annealing can be optimized for CUDA

Link

So I was wrong. I am still searching for an example program that should not be in parallel