Daves Software Blog: CUDA examples

I have been working on some CUDA examples for the last month, I have made some pretty good progress in the region of Computational Fluid Dynamics and Particle Systems. The techniques used in the example programs are useful for other applications, e.g the NBody particles example can be modified for any N^2 algorithm that will fit on the GPU. I also looked at the example named "Particles" from the CUDA sdk, this features collision of a large set of spheres based in a grid. The program computes a grid index for each particle based on their spacial location

e.g.
if the grid was size 8 and the scale of the grid was 16, then the tile size is 2
a particle at the point 1.5
would have the coordinate given by
gx = (int) (x / tile_size) = (int)(1.5 / 2) = (int)(0.75) = 0 (*see note below)
and therefore resides in node zero, this follows because each grid block is of size 2;

and also a hash number based on the grid index ... something like

hash = (grid_x * width + grid_y) * depth + grid_z;

this is just the volume array expressed in linear numerical order. The particles are then sorted using a GPU parallel radix sort using the hash as the number to sort them by. This list of particles can then be examined in parallel by an algorithm that loads (tile + 1) particles into shared memory and then finds the boundaries between the grid indexes, so that the start and end integers that index the sorted particle array can be stored to allow access to the individual particles based on grid index.

Next the grid is accessed in parallel and the particles that inhabit grid cells are compared for collision.

Clearly this offers a performance improvement to using an N-Body algorithm for particle collision (although maybe not if you want to compute NBody gravity too). The downside to the procedure outlined above is that the grid cells are hard coded to contain a maximum of 4 particles, this is because the particle radius is set to be half the grid cell size, so at most there could be 4 particles overlapping. More information is available in the Nvidia whitepaper for the demo.

I downloaded a legacy version of the same demo that allows more flexible (although less optimal) usage of the code, including an alternative method that sums particles per grid cell using a parallel method called atomicAdd() that basically tells the GPU to add in serial for the variable used. It is a much slower operation, however the results of playing with the particle system, varying the particle size and count per cell, were much clearer and easier using the old code. Perhaps I did not completely understand how to modify the newer and more optimized Kernel to perform the same operations.

Anyway ... I also rendered the grid into a volume texture, and rendered it using a ray march algorithm in CG ... the particles look much better as a set of spheres.

(* to be more precise, the formula could say gx = floor( x / tile_size ); instead of just casting to int)

Daves Software Blog

Friday, 5 April 2013

CUDA examples

No comments:

Post a Comment