Some notes about using CUDA Fortran
Cremona, November 2020.
This page is in the "junk" directory! Make sure you check some better material as well, on
the main page
:-)
So you want to move your Fortran code to the GPU.
That's great!
Unfortunately, as of November 2020, using the NVidia CUDA Fortran is not super intuitive.
The user base is clearly much smaller than that of C++ CUDA, and this reflects in much less available
documentation online.
It may happen (quite often, actually) that you compile a program, and it runs faultlessly but no output
is produced, or maybe rubbish.
In such cases, it is important that you know the following stuff.
Here are a few notes that will hopefully help you along the way.
I will skip all the very basics (installing CUDA Fortran, programming in CUDA etc etc)
and only discuss some further aspects.
If you need a tutorial on CUDA, check out the channel "Creel" on youtube!
This guy prepared an amazing tutorial on CUDA (C++, but you can easily port the concepts to Fortran).
Part 1: compiling and debugging
The very basics: getting info about the device
Ok, this is basic, but let's remark this anyway.
In case you want to get some detailed info about your device (such as cache size, driver, compute capability etc etc)
you can use the deviceQuery script.
It is shipped in the CUDA examples (look in your installation directory, inside the "Utilities" folder).
You need to compile it before using, and this will create a useful executable.
Here's what I get:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Tesla K20Xm"
CUDA Driver Version / Runtime Version 11.0 / 10.2
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 5701 MBytes (5977800704 bytes)
(14) Multiprocessors, (192) CUDA Cores/MP: 2688 CUDA Cores
GPU Max Clock rate: 732 MHz (0.73 GHz)
Memory Clock rate: 2600 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.0, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
Compiling for a given compute capability (or it won't run)
Again, basic stuff.
Maybe you have tried compiling your program, the compiler didn't complain and you got your executable, but when running it
nothing happened.
If so, did you compile for the compute capability supported by your GPU?
First, nvfortran is not limited to CUDA but an compile for OpenACC and other stuff.
So, first you need to specify the -cuda option.
Then, you specify the compute capability.
For my Tesla K20X the compute capability is 3.5, so I specify: -gpu=cc35.
nvfortran -cuda -gpu=cc35 myprogram.cuf
You can find a number of info by running "nvfortran --help
".
Using single or double precision
Most GPUs work better with single precision floating point numbers, but sometimes you need to use double
precision.
There's a useful flag to do that:
if you add "-r4" or "-r8" to the compiling options, the compiler will
interpret "real" as single or double precision respectively.
You just declare variables as "real" without expliciting "real(kind=8)" or "real(kind=4)" and the
compiler will do everything by itself.
This comes handy.
Debugging with cuda-memcheck
The cuda-memcheck tool is absolutely a must-know.
Again, assume you compile your code -and it compiles fine- and once you launch it you don't get the expected results
(or maybe the program behaves as if it was not launching the kernels, for some reason).
In all such cases, you should try to run your program using:
cuda-memcheck ./myprogram.exe
This will tell you immediately if something is going wrong or if the program runs smoothly.
For example, cuda-memcheck could tell you that some kernels are actually not being launched for some reason, or it may
warn you that too many resources are requested and this results a kernel to fail (too many registers are being used maybe,
see below).
So, use cuda-memcheck!!
And after that, if you want to dig more into your errors, use cuda-gdb.
Debugging with cuda-gdb
Mom Nvidia provides the cuda-gdb package to help us debug our codes.
First, you should compile with debug symbols.
As you may expect, you need the option "-g" for enabling debugging symbols on the host side (the CPU).
You may also need to generate debug symbols on the device side (the GPU), adding the "debug" option to the "-gpu" flag.
Since you have probably already specified a compute capability, you need to separate these by a comma:
nvfortran -cuda -gpu=cc35,debug -g file.cuf
(Just a note: if you are compiling a C++ program with the nvcc compiler, you may need the "-G" option instead.)
Then, you run cuda-gdb as if it was gdb.
Note that often errors come as API errors (for instence, the way you call or use the kernels that you have implemented
may be wrong, or the way you conceived your kernel does not match the way it actually works).
Your application may keep running without giving an explicit error.
cuda-gdb will detect them as warnings, and it will pass over them without stopping.
So, once you open cuda-gdb you should run (just before using the command "run"):
set cuda api_failures stop
Finally, you can use the "backtrace" command (bt) and "frame" as usual.
Part 2: running your application (practical considerations)
In this section, just a couple of notes on problems that you may encounter while running your programs.
First, while your application runs, you can inspect the GPU temperature, power consumption, available
memory etc with the nvidia-smi command-line tool.
Blocks are mapped to multiprocessors
So, you see from the deviceQuery that your GPU has (say..) 14 multiprocessors.
What is the speed-up with respect to a serial CPU-only code?
How many operations will it run at the same time?
The answer to the first question is not simple, and clearly depends on the GPU processors clock,
memory access paths etc etc.
However, roughly speaking, it's clear that the more threads can be run in parallel, the higher the
expected speed-up will be.
Consider the following:
people usually use CUDA such that a thread is associated to each basic element of the computation.
Then, threads are grouped in blocks.
Let's make an example:
while solving partial differential equations, each thread could be mapped to a different grid point (or cell).
When solving for particles, a thread could be mapped to an individual particle.
Threads are grouped in blocks and there is a maximum number of threads per block (something like 1024,
depending on the compute capability).
Now, the key point here is that blocks are mapped to multiprocessors.
This means that, if you have 14 multiprocessors in your GPU, each one will take care of a block.
In this case, the GPU will be able to process 14 blocks at a time (not necessarily exactly concurrently,
but in chunks of 14 blocks at a time).
If in the launch configuration you specify 1024 threads per block and if each thread represents one
grid point in your simulation, then the kernel will process 1024*14 cells each time.
What about largest simulation grids (or more particles)? Can you have more blocks?
Sure thing, just the kernel will deal with the first 14 blocks and once these are done the next chunk of 14 blocks will be treated,
and so on until all blocks have been treated.
Having this clear in mind is quite important.
For example, if you need to implement a time-integration method, you need to make sure that each grid cell (or every particle)
has been processed before passing to the next time step.
Just to give you an idea:
here's what happens if you don't take this into account.
In case you wonder, that was supposed to be a cylindrical shock wave solved with a RK-2 time integration scheme, in which I should
synchronize the threads before passing from the first to the second stage.
I was clearly failing to do that, and both the first and second stage were executed on chunks of 14 blocks each, even before the kernel
could start on other regions in the domain.
Indeed, if you zoom in, you can see that the solution kind of looks good in chunks of 14 blocks each.
Each block was made by 32x32 grid cells, each one mapped to a different thread.
With this non-synchronization, inter-cell fluxes were going crazy.
Running out of resources: too many variables!
Here's another issue that you may come into.
Complicated kernels may run out of resources.
How do you notice that?
Nothing works without apparent errors.
Remedy: run cuda-memcheck.
This is frequently because of too many registers being required by your kernel.
If you use the deviceQuery script, you'll see that there is a maximum number of registers
available for each block.
In computers (and GPUs), registers are extremely small chunks of memory that are right next to the processor.
In practice, they store the local variables that you declare inside subroutines.
When I say they are small, I mean small!
Like 4, 8, 16, 32 bits, according to the architecture.
In my case they are 32 bits.
In the Tesla K20X, there are max 65536 registers available for each block.
If I want to run on, say 1024 threads per block, then each thread would have only 65536/1024 = 64 registers available.
This means that we can store something like 64 local variables only!!!
Not exactly, but that's to give you an idea.
If your kernels and subroutines are very simple, this may very well be enough.
If on the other hand they start being more complex (in CFD for example, higher order accuracies etc etc)
you may start to face some limitations.
All variables that do not fit in the registers may be automatically put into a higher-level memory (with a loss in performance),
but this may not be the case and you may run into troubles.
How to detect what's going on? Use cuda-memcheck
How to solve the problem? try to reduce the number of threads per block.
Note that you can check the registers occupancy using the nvprof profiler with the --print-gpu-trace flag.
This will show the "Regs" (registers) being used per thread.
Ok, that's it for now.
Enjoy your CUDA programming!
Cheers,
Stefano
-> BACK TO THE HOMEPAGE