Some notes - setup an NVidia GPU for training neural nets

April 2020, Cremona, Italy

Hi all!
Some notes (mainly for my own future reference) on setting up a GPU for training neural nets. In this page the OS is Debian 10 "Buster".

Hardware setup

I got an (Asus) NVidia GeForce GTX 760 GPU off ebay for quite cheap. It's a bit old now (dates back to 2013) but should be just fine, with around 1000 CUDA cores. Only, we will have to compile TensorFlow specifying the support for this GPU.

For the computer, I'm using a lenovo Edge 72 computer, here.
In my setup, I will keep the monitor attached to the integrated graphics card (IGC = Integrated Graphics Chip) and keep the GTX 760 only for computations. Make sure that your motherboard supports this. Old ones (for example, and old HP tha I had) may not allow the IGC to be used when another GPU is plugged. If unsure, check the manual of your motherboard.

Practical notes: Cost of GPU + computer: about 150 EUR.

GPU-computer

BIOS configuration

...Yes, my computer is old enough to still have a BIOS.
My configuration: monitor connected to the integrated graphics card (IGC) and GPU only for computations. However, it's possible that the BIOS may activate the GPU instead of the IGC. So, check out the BIOS settings (maybe, with the GPU unplugged) and find out the "video" section. You should make sure that the graphics card parameter is NOT set to AUTO, but will point directly to the IGC.

Setup the OS

Ok, here comes the trickier part, installing the driver and configuring the system so that everything works. I'm using GNU/Linux, Debian 10.

As you may know, NVidia drivers are proprietary, not open-source. However, there is an open-source version which may be activated by default when you install the operating system. It's called "nouveau". This version is less performant than the proprietary NVidia driver, so we shall install it. Installing it will break for a moment the video settings, so we'll have to reset them by changing the xorg.conf file.

Step 0: lspci

Before starting, check out that your system recognized the GPU:

$ lspci | grep VGA

00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09)
01:00.0 VGA compatible controller: NVIDIA Corporation GK104 [GeForce GTX 760] (rev a1)


Step 1: disable nouveau drivers

First, we remove all automatically installed NVidia stuff. In Debian-based distributions:

$ sudo apt-get remove --purge nvidia-*
$ sudo nvidia-uninstall


Note that this may be a bit a problem if your IGC is also from NVidia... But I guess it's never the case. Not sure though.
They are die-hard, so we make sure that the drivers are not loaded, by blacklisting them. Go to the directory "/etc/modprobe.d" and take a look. You will probably have a file "blacklist-nouveau.conf" or similar. We create a file (the filename doesn't matter, it will just be called at the proper time):

$ sudo vim /etc/modprobe.d/blacklist-nouveau.conf

... and add this stuff in the file...

blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off


This should be enough. Changes will take place after rebooting. So, reboot your system (or if you prefer, run: sudo rmmod nvidia for changes to take place now).

Step 2: Install NVidia drivers

Ok, you can now download the NVidia drivers from the website, selecting your own GPU. Make sure you select the 64 bit version (unless you have a 32 bits machine). Note that NVidia updates the drivers quite frequently, fixing bugs and adding features. Download the latest one. In my case, the file is called "NVIDIA-Linux-x86_64-440.82.run".
Installing them is pretty easy. Make the file executable:

chmod +x ./NVIDIA-Linux-x86_64-440.82.run


For installing the driver, you need to do it out of the graphics system (X). So, press CTRL-ALT-F1 and you'll get out of X to a terminal. To get back to X you press CTRL-ALT-F7 in many systems. Note that in some systems, CTRL-ALT-F1 is actually X.
Log in and then kill the X system by:

$ sudo service lightdm stop

# Alternatively, you could have gone to the runlevel 3:
$ sudo init 3


I hope you didn't have anything important running in X.

We are ready to run the NVidia installer. Since I want to use the integrated graphics card for the monitor and the GPU only for the computations, I will specify to avoid installing the OpenGL support. Run:

$ ./NVIDIA-Linux-x86_64-440.82.run --no-opengl-files


This step should run just fine, unless you still have the nouveau drivers running around. You can accept the options that it suggests. Reboot the system. If everything went as expected, quite likely X is broken now. Keep reading.

Step 3: set the xorg.conf

Probably, at this point when the system is booted, it won't show the login screen, but a black screen instead, with a blinking cursor. This is because the NVidia installer probably setup the xorg.conf file, that tells to the graphics system X how to configure the outputs. So, we need to fix this file. Luckily, it's quite simple, and the file is quite intuitive.
First, press again CTRL-ALT-F1 (or another F-key) and log in the terminal. Check out the PCI configuration with lspci | grep VGA:

$ lspci | grep VGA

00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09)
01:00.0 VGA compatible controller: NVIDIA Corporation GK104 [GeForce GTX 760] (rev a1)


This tells us that the integrated card is on the PCI bus 00:02.0 and the GPU on 01:00.0. In the format of the xorg.conf file the IGC is on PCI:0@0:2:0. So, you should now modify the xorg.conf file:

$ sudo vim /etc/X11/xorg.conf


Mine looks like the following. You can see that first of all, there is a "ServerLayout" section, defining two screens, screen 0 and screen 1. The first screen is the actual one. We define a second one just to map it to the GPU. When, we will define one "Screen" for the IGP (and call it "intel") and one for the the GPU (called "nvidia") using the "Section "Screen"" keywork, and say that each Screen is mapped to a different "Device". We create also two Devices, one for the IGP ("intel") and one for the GPU ("nvidia") For the GPU, we don't really need to put the correct address I guess, since X will not be managing it. Just check out the file, it's quite easy:

# Configuration for /etc/X11/xorg.conf

Section "ServerLayout"
	Identifier 	"Layout0"
	Screen  	0  "intel"
	Screen  	1  "nvidia"
	InputDevice	"Keyboard0" "CoreKeyboard"
	InputDevice	"Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

# ========== MOUSE AND KEYBOARD ================

Section "InputDevice"
	# generated from default
	Identifier 	"Mouse0"
	Driver     	"mouse"
	Option     	"Protocol" "auto"
	Option     	"Device" "/dev/psaux"
	Option     	"Emulate3Buttons" "no"
	Option     	"ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
	# generated from default
	Identifier 	"Keyboard0"
	Driver     	"kbd"
EndSection

# ========== Integrated card and GPU ================

# This device maps to the IGC
Section "Device"
	Identifier 	"intel"
	Driver     	"intel"
	BusID      	"PCI:0@0:2:0"
	Option     	"AccelMethod" "SNA"
EndSection

# This device maps to the GPU
Section "Device"
	Identifier 	"nvidia"
	Driver     	"nvidia"
	BusID      	"PCI:0@1:0:0"
	Option     	"ConstrainCursor" "off"
EndSection

# This Screen calls the IGC device
Section "Screen"
	Identifier "intel"
	Device 	"intel"
EndSection

# This Screen calls the GPU device
Section "Screen"
	Identifier 	"nvidia"
	Device     	"nvidia"
	Option     	"AllowEmptyInitialConfiguration" "on"
	Option     	"IgnoreDisplayDevices" "CRT"
EndSection


So, now reboot and magically everything should be working! You can check that your installation went well and the status of your GPU by running:


$ nvidia-smi

Sun Apr 12 13:15:17 2020  	 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82     Driver Version: 440.82      CUDA Version: 10.2        |
|-------------------------------+----------------------+----------------------+
| GPU  Name    	Persistence-M   | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 760 	Off     | 00000000:01:00.0 N/A |                  N/A |
| 29%   30C P8  N/A /  N/A      |     12MiB /  1999MiB |    N/A       Default |
+-------------------------------+----------------------+----------------------+
                                                                          	 
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU      PID   Type   Process name                              Usage      |
|=============================================================================|
|   0                    Not Supported                                        |
+-----------------------------------------------------------------------------+


As you can see, the GPU is properly recognized, although no more info is available for this type of GPU (NVidia releases this only for higher-end GPUs).

Installing the software

The "compute capability" of a CUDA-capable GPU has to do with the specifications of the GPU and the set of instructions that it is able to run.
As of April 2020, TensorFlow is compiled as to support by default GPUs with compute capability larger than 3.5.

GPU with compute capability ≥ 3.5

If that's your case, then you just jave to install the gpu-enabled version of TensorFlow with python. You can setup a Conda (Anaconda/Miniconda) environment, or with pip:

$ pip install tensorflow-gpu

And you will be good to go. Now load TensorFlow from your python3 script and enjoy!

GPU with compute capability 3.0

My GPU has compute capability 3.0, which is not supported by default by TensorFlow. So, I will install it from the sources and specify that option. Check out the instructions on the tensorflow web page.
Basically, this will compile tensorflow from sources and create a package that you can then install into your python environment. Compiling TensorFlow on an low-specs computer takes forever, let me tell you. In short, the procedure becomes:

### Install pre-requisites

$ pip install -U --user pip six numpy wheel setuptools mock 'future>=0.17.1'
$ pip install -U --user keras_applications --no-deps
$ pip install -U --user keras_preprocessing --no-deps

### Need Go language, for bazelisk

$ sudo apt-get install golang-go

### Install bazelisk

$ git clone https://github.com/bazelbuild/bazelisk.git
$ cd bazelisk
$ ./build.sh
$ cd ..

### Create an alias to bazelisk called "bazel", for your simplicity
### (put the proper path here and reference to the proper binary)

$ echo "alias bazel='~/bazelisk/bin/bazelisk-linux-amd64'" >> ~/.bash_aliases


Then, we need to install CUDA.
You'll need the CUDA SDK, which you can download from the NVidia CUDA website. Unfortunately the CUDA package shipped with Debian supports the "nouveau" driver only, so we need to download it from the website. There is no direct support for Debian, but there are Ubuntu packages, which will do just fine. Just, dowload the ".run" file, since messing with the repositories may not work. Follow the instructions on the CUDA website.

So, you need to again go to runlevel 3 to install this... At this point I accepted to install the CUDA driver and this made the previous installation of the NVidia driver useless. I'll keep it this way.

Ok, it's time to download and compile Tensorflow. First, you need to modify the configure.py file by changing the default compute capability. Put 3.0 in place of 3.5. Then, configure:

$ git clone https://github.com/tensorflow/tensorflow.git
$ cd tensorflow
$ ./configure


During the configurations you will be prompted some questions. Make sure you enable CUDA support at the proper stage. Then, compile with bazilisk.

Notice that bazilisk is quite memory eager, and if you have a limited system, compiling tensorflow becomes an issue. I have 4 GB of RAM and after 4 hours of compilations, it crashed throwing some "critical error: out of memory". After passing the --local_ram_resources=2048 flag the compiling also crashed. The amount of time it takes is largely due to the system swapping very often (check with htop).
So, you should give a set of additional options, for the maximum number of jobs and the available RAM:

bazel build --config=opt --config=cuda --local_ram_resources=2048     \
                                       --local_cpu_resources=2        \
                                       --jobs=2                       \
                                       --ram_utilization_factor=30    \
                                       //tensorflow/tools/pip_package:build_pip_package


With this setup, it took some 9.5 hours (...) but it compiled succesfully! At this point, you will have a directory bazel-bin in the current working directory, and inside there, you will eventually find an executable build_pip_package. We build a package for pip in the /tmp directory:

$ ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg


And finally you can install it with pip by:

$ pip install /tmp/tensorflow_pkg/whatevernameithas.whl


Note that you may need to use python3.5, since python3.7 would not work.

pip install 


ANOTHER OPTION: using Theano

OK so, another option is using Theano. First, as of now it needs a python version larger than 3.4 but below 3.6. I have 3.7 installed, so let's use conda for creating a virtual environment with python 3.5.

Install miniconda (or anaconda) first. Then create a virtual environment by:

$ conda create --name neural_net_theano_py3p5 python=3.5
$ conda activate neural_net_theano_py3p5


Now let's install some packages that Theano needs. You'll need the CUDA SDK, which you can download from the NVidia CUDA website. Unfortunately the CUDA package shipped with Debian supports the "nouveau" driver only, so we need to download it from the website. There is no direct support for Debian, but there are Ubuntu packages, which will do just fine. Just, dowload the ".run" file, since messing with the repositories may not work. Follow the instructions on the CUDA website.

So, you need to again go to runlevel 3 to install this... At this point I accepted to install the CUDA driver and this made the previous installation of the NVidia driver useless. I'll keep it this way.

Then, you need to install the cuDNN libraries (CUda Deep Neural Network), which are used by Theano and TensorFlow. You'll need to create a profile, login into the NVidia website and the follow the instructions to install and download. The installation will boil down to downloading a tar.gz file and copying files in the proper locations, in /usr/local/cuda/... Ok, once this is done, let's try to install Theano:

$ conda install numpy scipy mkl
$ conda install theano pygpu


You will need one more step before running it, since it will probably not find the header files for the cuDNN library. Create a file .theanorc in your home, with:


[global]
device = cuda
floatX = float32

[dnn]
include_path=/usr/local/cuda-10.2/include
library_path=/usr/local/cuda-10.2/lib64


At this point, Theano should work.

References




Back to Homepage