CUDA: CPU code in parallel to GPU code

Question

I have a program where I do a bunch of calculations on GPU, then I do memory operations with those results on CPU, then I take the next batch if data and do the same all over. Now it would be a lot faster if I could do the first set of calculations and then start with the second batch whilst my CPU churned away at the memory operations. How would I do that?

"memory operations" - you mean the transfer of the data to/from the card, or something independent of the card? Don't you just need one thread to manage the card then you can do whatever you like with the rest of the CPU? — Rup, Jun 22 '11 at 9:32
Well both I guess. I transfer the results to the host and then do something with the data, so it's not really independent. — foges, Jun 22 '11 at 10:43

harrism · Accepted Answer · 2011-06-22 09:35:12Z

All CUDA kernel calls (e.g. function<<>>()) are asynchronous -- they return control immediately to the calling host thread. Therefore you can always perform CPU work in parallel with GPU work just by putting the CPU work after the kernel call.

If you also need to transfer data from GPU to CPU at the same time, you will need a GPU that has the deviceOverlap field set to true (check using cudaGetDeviceProperties()), and you need to use cudaMemcpyAsync() from a separate CUDA stream.

There are examples to demonstrate this functionality in the NVIDIA CUDA SDK -- For example the "simpleStreams" and "asyncAPI" examples.

Ok, thanks didnt know that. In that case if i have: <<>myCUDAfunc(); cudaMemcpy(); How do I know that myCUDAfunc() has finished before cudaMemcpy starts copying the results back? — foges, Jun 22 '11 at 10:30
There are functions to synchronize like cudaDeviceSynchronize, cudaStreamSynchronize and cudaEventSynchronize. This is pretty easy to use but you should check the CUDA C Programming Guide from nVidia available in the CUDA downloads page. It's only 187 pages and gives you much needed information on working with CUDA. — jmsu, Jun 22 '11 at 10:41
@user810045, to be clear: CUDA has the concept of "streams". A stream is a sequence of CUDA runtime API commands that are executed in the order they are issued. If you don't specify a stream identifier to a kernel (the optional 4th parameter in the <<<>>>), it runs in the default stream, which is always synchronous with other default stream CUDA commands. cudaMemcpy, which is also synchronous, will always therefore execute after the kernel if it is called after the kernel. If you want an asynchronous copy, you have to use cudaMemcpyAsync, which requires a stream identifier argument. — harrism, Jun 27 '11 at 1:00

jmsu · Answer 2 · 2011-06-22 10:37:55Z

The basic idea can be something like this:

Do 1st batch of calculations on GPU
Enter a loop: {

Copy results from device mem to host mem

Do next batch of calculations in GPU (the kernel launch is assynchronous and the control returns immediately to the CPU)

Process results of the previous iteration on CPU

}
Copy results from last iteration from device mem to host mem
Process results of last iteration

You can get finer control over asynchronous work between CPU and GPU by using cudaMemcpyAsync, cudaStream and cudaEvent.

As @harrism said you need your device to support deviceOverlap to do memory transfers and execute kernels at the same time but even if it does not have that option you can at least execute a kernel asynchronously with other computations on the CPU.

edit: deviceOverlap has been deprecated, one should use asyncEngineCount property.

Yup jmsu, that's exactly what I want to do. Glad to know that will work :) — foges, Jun 22 '11 at 10:36

Moojuk PNU

2014년 3월 9일 일요일

CUDA: CPU code in parallel to GPU code

2 Answers

댓글 없음:

댓글 쓰기

프로필