Daves Software Blog

Friday, 5 April 2013

CUDA examples

I have been working on some CUDA examples for the last month, I have made some pretty good progress in the region of Computational Fluid Dynamics and Particle Systems. The techniques used in the example programs are useful for other applications, e.g the NBody particles example can be modified for any N^2 algorithm that will fit on the GPU. I also looked at the example named "Particles" from the CUDA sdk, this features collision of a large set of spheres based in a grid. The program computes a grid index for each particle based on their spacial location

e.g.
if the grid was size 8 and the scale of the grid was 16, then the tile size is 2
a particle at the point 1.5
would have the coordinate given by
gx = (int) (x / tile_size) = (int)(1.5 / 2) = (int)(0.75) = 0 (*see note below)
and therefore resides in node zero, this follows because each grid block is of size 2;

and also a hash number based on the grid index ... something like

hash = (grid_x * width + grid_y) * depth + grid_z;

this is just the volume array expressed in linear numerical order. The particles are then sorted using a GPU parallel radix sort using the hash as the number to sort them by. This list of particles can then be examined in parallel by an algorithm that loads (tile + 1) particles into shared memory and then finds the boundaries between the grid indexes, so that the start and end integers that index the sorted particle array can be stored to allow access to the individual particles based on grid index.

Next the grid is accessed in parallel and the particles that inhabit grid cells are compared for collision.

Clearly this offers a performance improvement to using an N-Body algorithm for particle collision (although maybe not if you want to compute NBody gravity too). The downside to the procedure outlined above is that the grid cells are hard coded to contain a maximum of 4 particles, this is because the particle radius is set to be half the grid cell size, so at most there could be 4 particles overlapping. More information is available in the Nvidia whitepaper for the demo.

I downloaded a legacy version of the same demo that allows more flexible (although less optimal) usage of the code, including an alternative method that sums particles per grid cell using a parallel method called atomicAdd() that basically tells the GPU to add in serial for the variable used. It is a much slower operation, however the results of playing with the particle system, varying the particle size and count per cell, were much clearer and easier using the old code. Perhaps I did not completely understand how to modify the newer and more optimized Kernel to perform the same operations.

Anyway ... I also rendered the grid into a volume texture, and rendered it using a ray march algorithm in CG ... the particles look much better as a set of spheres.

(* to be more precise, the formula could say gx = floor( x / tile_size ); instead of just casting to int)

Friday, 8 February 2013

Problems that CUDA can't solve

The answer was obvious, of course there are problems that CUDA isn't good for.

One example is Chess. The classic case of answering the question as to whether a move is legal, cannot be solved without the move history information, or the current board state.

For example, the sequence

1.e4 e5 2.Nf3 Nc6 3.Bc4 Qxa1

has an obvious illegal move, Qxa1. There is no real point in making a thread for each move individually
because the state of the board before Qxa1 determines the moves legality. The same is true for many types of state machine, and also for things similar to state machines, including Markov chains.

Sunday, 3 February 2013

Dynamic Programming

I think learning about Dynamic Programming (see wikipedia) would be a very optimal technique to learn and put into practice, because so many problems are solvable with these techniques. The problem with coding using these techniques is similar to the problem of applying optimization algorithms to a real world problem - working out what to use for inputs can be quite confusing, and the general best way to learn it seems to be to do loads of examples. I still can't whip out a GA with an auto-tuning NN to solve any random problem I encounter (e.g Rage-style Mega-textures, or optimal rendering).

Making code parallel

Various algorithms would not seem to be good on parallel architectures, or at least are not using the parallel architecture to its full potential.

This is one reason why the current computer architecture using a (multi-core) CPU and parallel GPU (s) is particularly elegant. One CPU core can be reserved for processing things like the operating system background processes, the user interface, and the ability to halt execution of a process. Also some processes would be best on the CPU ... right ?

In my search for examples, one example I thought would be best suited to a CPU process was simulated annealing, because the algorithm is iterative and each loop depends on the last, so the loops would not be easily computed in parallel. I based my assumption on this nice implementation on google code. However running a google search shows that Simulated annealing can be optimized for CUDA

Link

So I was wrong. I am still searching for an example program that should not be in parallel

Monday, 14 January 2013

So massive parallelism is the way forward. I am so surprised that society does not conform to the philosophy in its treatment of the unemployed. The work sector in computing is either 40k+ for normal jobs, game jobs for 20k and less, or nothing. Why not employ more coders for less pay? That would get the job done faster, especially if they get a bonus for competition.

Saturday, 12 January 2013

Upgraded GForce 210 to GForce 440 on the desktop computer, cost £40, had to buy 2nd hand because of potential compatibility issues with the PCI-E 3.0 cards in the shops, still using nForce4 chipset on the desktop computer, so 2nd hand was the best option. Fortunately there was a Fermi card available in the 2nd hand hardware store.

I am actually quite out of date in video card programming terms, the old 210 had limited functionality. Another downside to using the old computer is the lack of UVA (unified virtual memory access), I really think that would be interesting. The reason I can't use it is that UVA needs a 64 bit operating system, so even if it is convenient on the nForce4 it is currently impossible since I am running XP. I am considering investigating UVA on linux however.

The software model I am interested in has UVA with the database in virtual memory, however the VM will need to be loaded in megabyte chunks because you need to free up VM sometimes, this is fine because you can simple have an index file open all the time, then you read from your index file to access blocks stored in the "Tiled" virtual memory files, then after a block has been accessed, it is either kept open for future use or removed from use.

There is a lot of low level coding involved with this project.

And its my birthday.

Friday, 4 January 2013

I have arrived at a conclusion about computer programming .... google wins. To make a program, simply open google, type the name of whatever it is you want to create and hit the button, you should see a list of pages with programs and alternatives. No need to code, simply download :)