Installing autoware + docker (cont'd)
This is a continuation of Installing Autoware on Ubuntu 20.04.  

I'm starting over after working with the system76 support engineers to address some issues I ran into.  The tl;dr is that I ended up having a mix of packages from the system76 apt repositories and the nvidia apt repositories, and these packages didn't play well together.  I removed all of the system76 packages and re-installed exclusively from the nvidia apt repositories.

The current state of my drivers:

# nvidia-smi
Mon Oct 31 12:52:32 2022
| NVIDIA-SMI 515.65.07    Driver Version: 515.65.07    CUDA Version: 11.7     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   41C    P8     5W /  N/A |    168MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A      1393      G   /usr/lib/xorg/Xorg                 32MiB |
|    0   N/A  N/A      2203      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      4500      C   /usr/NX/bin/nxnode.bin            126MiB |

Nvidia packages

# dpkg -l | grep nvidia-
ii  libnvidia-cfg1-515:amd64                     515.65.07-0ubuntu1                                                    amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-515                         515.65.07-0ubuntu1                                                    all          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-515:amd64                  515.65.07-0ubuntu1                                                    amd64        NVIDIA libcompute package
ii  libnvidia-compute-515:i386                   515.65.07-0ubuntu1                                                    i386         NVIDIA libcompute package
ii  libnvidia-decode-515:amd64                   515.65.07-0ubuntu1                                                    amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-decode-515:i386                    515.65.07-0ubuntu1                                                    i386         NVIDIA Video Decoding runtime libraries
ii  libnvidia-encode-515:amd64                   515.65.07-0ubuntu1                                                    amd64        NVENC Video Encoding runtime library
ii  libnvidia-encode-515:i386                    515.65.07-0ubuntu1                                                    i386         NVENC Video Encoding runtime library
ii  libnvidia-extra-515:amd64                    515.65.07-0ubuntu1                                                    amd64        Extra libraries for the NVIDIA driver
ii  libnvidia-fbc1-515:amd64                     515.65.07-0ubuntu1                                                    amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-515:amd64                       515.65.07-0ubuntu1                                                    amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  nvidia-compute-utils-515                     515.65.07-0ubuntu1                                                    amd64        NVIDIA compute utilities
ii  nvidia-dkms-515                              515.65.07-0ubuntu1                                                    amd64        NVIDIA DKMS package
ii  nvidia-driver-515                            515.65.07-0ubuntu1                                                    amd64        NVIDIA driver metapackage
ii  nvidia-kernel-common-515                     515.65.07-0ubuntu1                                                    amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-515                     515.65.07-0ubuntu1                                                    amd64        NVIDIA kernel source package
ii  nvidia-settings                              520.61.05-0ubuntu1                                                    amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-515                             515.65.07-0ubuntu1                                                    amd64        NVIDIA driver support binaries
ii  screen-resolution-extra                      0.18build1                                                            all          Extension for the nvidia-settings control panel
ii  xserver-xorg-video-nvidia-515                515.65.07-0ubuntu1                                                    amd64        NVIDIA binary Xorg driver

Cuda packages

# dpkg -l | grep cuda-
ii  cuda-repo-ubuntu2004-11-6-local              11.6.0-510.39.01-1                                                    amd64        cuda repository configuration files
ii  cuda-toolkit-11-6-config-common              11.6.55-1                                                             all          Common config package for CUDA Toolkit 11.6.
ii  cuda-toolkit-11-8-config-common              11.8.89-1                                                             all          Common config package for CUDA Toolkit 11.8.
ii  cuda-toolkit-11-config-common                11.8.89-1                                                             all          Common config package for CUDA Toolkit 11.
ii  cuda-toolkit-config-common                   11.8.89-1                                                             all          Common config package for CUDA Toolkit.

Re-install cuda 11-6

Based on these instructions with a slight modification:

$ cuda_version=11-6
$ apt install cuda-${cuda_version} --no-install-recommends

and ignoring errors like W: Sources disagree on hashes for supposely identical version '11.6.55-1' of 'cuda-cudart-11-6:amd64'. which I haven't tracked down yet.

Command output

After upgrading:

$ nvidia-smi
Mon Oct 31 15:52:57 2022       
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |

Start autoware container

$ rocker --nvidia --x11 --user --volume $HOME/Development/autoware --volume $HOME/Development/autoware_map --
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

As documented here, this error happens because the nvidia-container-toolkit is missing.  To fix this:

$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

I'm still getting the same error, so now am re-installing the nvidia-docker2 package that I had purged earlier, based on these instructions:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -s -L | sudo apt-key add - \
      && curl -s -L$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

At this point re-running the rocker command work, and it dropped me into the container.

Hitting rviz2 errors

I'm hitting the same errors described in this rviz2 discussion:

$ rviz2
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-tleyden'
libGL error: MESA-LOADER: failed to retrieve device information
libGL error: MESA-LOADER: failed to retrieve device information
[ERROR] [1666389050.804997231] [rviz2]: RenderingAPIException: OpenGL 1.5 is not supported in GLRenderSystem::initialiseContext at /tmp/binarydeb/ros-galactic-rviz-ogre-vendor-8.5.1/.obj-x86_64-linux-gnu/ogre-v1.12.1-prefix/src/ogre-v1.12.1/RenderSystems/GL/src/OgreGLRenderSystem.cpp (line 1201)

Use system76 version of nvidia prime-select

There is a thing called nvidia prime-select which essentially controls whether GUI apps will default to running on the nvidia graphics card or using the built-in graphics card in the CPU.

With System76 machines however, they have their own version of nvidia prime-select that supports the following modes (details here):

  1. nvidia - use the nvidia GPU graphics card for everything
  2. integrated - use the CPU integrated graphics card for everything
  3. hybrid -  use the integrated CPU graphics card for most things, but allow an override mechanism to run things on the nvidia GPU.  

I have not been able to figure out how to use hybrid mode and use the override environment variables to force apps onto the GPU from within the container, but I was able to brute force it by switching from hybrid to nvidia mode.

system76-power graphics nvidia

Rviz2 running on the GPU

Now when I start the autoware container and run rviz2 it no longer shows any errors:

$ rviz2
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-tleyden'
[INFO] [1667247760.746904715] [rviz2]: Stereo is NOT SUPPORTED
[INFO] [1667247760.747021907] [rviz2]: OpenGl version: 3.1 (GLSL 1.4)
[INFO] [1667247760.767387690] [rviz2]: Stereo is NOT SUPPORTED

Unfortunately it's not straightforward to verify it with nvidia-smi, which does not seem to be able to show processes running in containers.  I was however able to indirectly verify it by two ways:

  1. In my experience, the only way rviz2 would even start from a docker container and run on the integrated cpu graphics card would be to pass in /dev/dri as a container argument.  Without this argument, it must be running on the GPU.
  2. I started 5 rviz2 processes and verified with nvidia-smi that the amount of GPU memory available was steadily decreasing each time I started another rviz2 process.

Future work: making it work in hybrid mode

The problem with running everything on the GPU is that it uses up precious GPU memory, even for things that don't need to be on the GPU.

From my experience so far, the normal approach to forcing things to run on the GPU does not work in docker containers, but I'm hoping someone will answer that on the nvidia forum to provide more details.

In the meantime, at least the system76-power graphics nvidia mode worked, and I would guess that using nvidia prime-select to set the default to the GPU would also work.