Introduction to the NVIDIA DGX A100 System
NVIDIA DGX A100 DU-09821-001 _v01|14
Provides active health monitoring and system alerts for NVIDIA DGX nodes in a data
center. It also provides simple commands for checking the health of the DGX A100
system from the command line.
‣
Data Center GPU Management (DCGM)
This software enables node-wide administration of GPUs and can be used for cluster
and data-center level management.
‣
DGX A100 system support packages.
‣
The NVIDIA GPU driver
‣
Docker Engine
‣
NVIDIA Container Toolkit
‣
Mellanox OpenFabrics Enterprise Distribution for Linux (MOFED)
‣
Mellanox Software Tools (MST)
‣
cachefilesd (daemon for managing cache data storage)
1.5. Additional Documentation
This section provides links to additional documentation.
‣
MIG User Guide
The new Multi-Instance GPU (MIG) feature allows the NVIDIA A100 GPU to be securely
partitioned into up to seven separate GPU Instances for CUDA applications.
‣
NGC Container Registry for DGX
How to access the NGC container registry for using containerized deep learning GPU-
accelerated applications on your DGX A100 system.
‣
NVSM Software User Guide
Contains instructions for using the NVIDIA System Management software.
‣
DCGM Software User Guide
Contains instructions for using the Data Center GPU Manager software.
1.6. Customer Support
Contact NVIDIA Enterprise Support for assistance in reporting, troubleshooting, or diagnosing
problems with your DGX A100 system. Also contact NVIDIA Enterprise Support for assistance
in moving the DGX A100 system.
‣
For contracted Enterprise Support questions, you can send an email to
enterprisesupport@nvidia.com.
‣
For additional details about how to obtain support, go to NVIDIA Enterprise Support.
Our support team can help collect appropriate information about your issue and involve
internal resources as needed.