Some notes about using CUDA Fortran

Cremona, November 2020.
This page is in the "junk" directory! Make sure you check some better material as well, on the main page :-)

So you want to move your Fortran code to the GPU. That's great!
Unfortunately, as of November 2020, using the NVidia CUDA Fortran is not super intuitive. The user base is clearly much smaller than that of C++ CUDA, and this reflects in much less available documentation online. It may happen (quite often, actually) that you compile a program, and it runs faultlessly but no output is produced, or maybe rubbish. In such cases, it is important that you know the following stuff.

Here are a few notes that will hopefully help you along the way. I will skip all the very basics (installing CUDA Fortran, programming in CUDA etc etc) and only discuss some further aspects. If you need a tutorial on CUDA, check out the channel "Creel" on youtube! This guy prepared an amazing tutorial on CUDA (C++, but you can easily port the concepts to Fortran).

Part 1: compiling and debugging

The very basics: getting info about the device

Ok, this is basic, but let's remark this anyway. In case you want to get some detailed info about your device (such as cache size, driver, compute capability etc etc) you can use the deviceQuery script. It is shipped in the CUDA examples (look in your installation directory, inside the "Utilities" folder). You need to compile it before using, and this will create a useful executable. Here's what I get:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla K20Xm"
  CUDA Driver Version / Runtime Version          11.0 / 10.2
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 5701 MBytes (5977800704 bytes)
  (14) Multiprocessors, (192) CUDA Cores/MP:     2688 CUDA Cores
  GPU Max Clock rate:                            732 MHz (0.73 GHz)
  Memory Clock rate:                             2600 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.0, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS

Compiling for a given compute capability (or it won't run)

Again, basic stuff. Maybe you have tried compiling your program, the compiler didn't complain and you got your executable, but when running it nothing happened. If so, did you compile for the compute capability supported by your GPU?

First, nvfortran is not limited to CUDA but an compile for OpenACC and other stuff. So, first you need to specify the -cuda option. Then, you specify the compute capability. For my Tesla K20X the compute capability is 3.5, so I specify: -gpu=cc35.

nvfortran -cuda -gpu=cc35 myprogram.cuf

You can find a number of info by running "nvfortran --help".

Using single or double precision

Most GPUs work better with single precision floating point numbers, but sometimes you need to use double precision. There's a useful flag to do that: if you add "-r4" or "-r8" to the compiling options, the compiler will interpret "real" as single or double precision respectively. You just declare variables as "real" without expliciting "real(kind=8)" or "real(kind=4)" and the compiler will do everything by itself. This comes handy.

Debugging with cuda-memcheck

The cuda-memcheck tool is absolutely a must-know. Again, assume you compile your code -and it compiles fine- and once you launch it you don't get the expected results (or maybe the program behaves as if it was not launching the kernels, for some reason). In all such cases, you should try to run your program using:

cuda-memcheck ./myprogram.exe

This will tell you immediately if something is going wrong or if the program runs smoothly. For example, cuda-memcheck could tell you that some kernels are actually not being launched for some reason, or it may warn you that too many resources are requested and this results a kernel to fail (too many registers are being used maybe, see below). So, use cuda-memcheck!! And after that, if you want to dig more into your errors, use cuda-gdb.

Debugging with cuda-gdb

Mom Nvidia provides the cuda-gdb package to help us debug our codes. First, you should compile with debug symbols. As you may expect, you need the option "-g" for enabling debugging symbols on the host side (the CPU). You may also need to generate debug symbols on the device side (the GPU), adding the "debug" option to the "-gpu" flag. Since you have probably already specified a compute capability, you need to separate these by a comma:

nvfortran -cuda -gpu=cc35,debug -g file.cuf

(Just a note: if you are compiling a C++ program with the nvcc compiler, you may need the "-G" option instead.) Then, you run cuda-gdb as if it was gdb. Note that often errors come as API errors (for instence, the way you call or use the kernels that you have implemented may be wrong, or the way you conceived your kernel does not match the way it actually works). Your application may keep running without giving an explicit error. cuda-gdb will detect them as warnings, and it will pass over them without stopping. So, once you open cuda-gdb you should run (just before using the command "run"):

set cuda api_failures stop

Finally, you can use the "backtrace" command (bt) and "frame" as usual.

Part 2: running your application (practical considerations)

In this section, just a couple of notes on problems that you may encounter while running your programs. First, while your application runs, you can inspect the GPU temperature, power consumption, available memory etc with the nvidia-smi command-line tool.

Blocks are mapped to multiprocessors

So, you see from the deviceQuery that your GPU has (say..) 14 multiprocessors. What is the speed-up with respect to a serial CPU-only code? How many operations will it run at the same time?
The answer to the first question is not simple, and clearly depends on the GPU processors clock, memory access paths etc etc. However, roughly speaking, it's clear that the more threads can be run in parallel, the higher the expected speed-up will be.

Consider the following: people usually use CUDA such that a thread is associated to each basic element of the computation. Then, threads are grouped in blocks.
Let's make an example: while solving partial differential equations, each thread could be mapped to a different grid point (or cell). When solving for particles, a thread could be mapped to an individual particle.
Threads are grouped in blocks and there is a maximum number of threads per block (something like 1024, depending on the compute capability). Now, the key point here is that blocks are mapped to multiprocessors. This means that, if you have 14 multiprocessors in your GPU, each one will take care of a block. In this case, the GPU will be able to process 14 blocks at a time (not necessarily exactly concurrently, but in chunks of 14 blocks at a time). If in the launch configuration you specify 1024 threads per block and if each thread represents one grid point in your simulation, then the kernel will process 1024*14 cells each time.

What about largest simulation grids (or more particles)? Can you have more blocks?
Sure thing, just the kernel will deal with the first 14 blocks and once these are done the next chunk of 14 blocks will be treated, and so on until all blocks have been treated.

Having this clear in mind is quite important. For example, if you need to implement a time-integration method, you need to make sure that each grid cell (or every particle) has been processed before passing to the next time step.
Just to give you an idea: here's what happens if you don't take this into account. In case you wonder, that was supposed to be a cylindrical shock wave solved with a RK-2 time integration scheme, in which I should synchronize the threads before passing from the first to the second stage. I was clearly failing to do that, and both the first and second stage were executed on chunks of 14 blocks each, even before the kernel could start on other regions in the domain. Indeed, if you zoom in, you can see that the solution kind of looks good in chunks of 14 blocks each. Each block was made by 32x32 grid cells, each one mapped to a different thread. With this non-synchronization, inter-cell fluxes were going crazy.

failed synchronization

Running out of resources: too many variables!

Here's another issue that you may come into.

Complicated kernels may run out of resources. How do you notice that? Nothing works without apparent errors. Remedy: run cuda-memcheck.

This is frequently because of too many registers being required by your kernel. If you use the deviceQuery script, you'll see that there is a maximum number of registers available for each block. In computers (and GPUs), registers are extremely small chunks of memory that are right next to the processor. In practice, they store the local variables that you declare inside subroutines. When I say they are small, I mean small! Like 4, 8, 16, 32 bits, according to the architecture. In my case they are 32 bits.
In the Tesla K20X, there are max 65536 registers available for each block. If I want to run on, say 1024 threads per block, then each thread would have only 65536/1024 = 64 registers available. This means that we can store something like 64 local variables only!!! Not exactly, but that's to give you an idea.
If your kernels and subroutines are very simple, this may very well be enough. If on the other hand they start being more complex (in CFD for example, higher order accuracies etc etc) you may start to face some limitations. All variables that do not fit in the registers may be automatically put into a higher-level memory (with a loss in performance), but this may not be the case and you may run into troubles.

How to detect what's going on? Use cuda-memcheck
How to solve the problem? try to reduce the number of threads per block.

Note that you can check the registers occupancy using the nvprof profiler with the --print-gpu-trace flag. This will show the "Regs" (registers) being used per thread.

Ok, that's it for now. Enjoy your CUDA programming!
Cheers,
Stefano       -> BACK TO THE HOMEPAGE