Do you have a question about the Nvidia DGX A100 and is the answer not in the manual?
GPU | 8 x NVIDIA A100 Tensor Core GPUs |
---|---|
System Memory | 1 TB DDR4 |
Storage | 15 TB NVMe SSD |
GPU Memory | 320 GB total (40 GB per GPU) |
CPU | 2 x 64-Core AMD EPYC 7742 |
Networking | 8 x 200 Gb/s InfiniBand or Ethernet |
Interconnect | NVIDIA NVLink |
Provides information about the hardware components of the DGX A100 system.
Details the two DGX A100 system models and their component specifications.
Details the physical dimensions and form factor of the DGX A100 system.
Outlines the power supply capabilities and configurations for the DGX A100 system.
Explains the N+N redundancy configuration for the DGX A100 power supply units.
Specifies the approved locking power cord types for the DGX A100 system.
Provides instructions on how to correctly use the locking power cords for the DGX A100.
Details the environmental operating conditions for the DGX A100 system.
Describes the connections and controls located on the front panel of the DGX A100.
Details the front panel components and their functions when the bezel is attached.
Details the front panel components and their functions when the bezel is removed.
Shows an image and describes the rear panel modules of the DGX A100.
Details the connections and controls on the motherboard of the DGX A100 system.
Illustrates the components located on the motherboard tray of the DGX A100.
Illustrates the components located on the GPU tray of the DGX A100 system.
Covers network connectivity, including cables and adapters for the DGX A100.
Details the network ports available on the DGX A100 system.
Lists supported network cables and adapters, determined by ConnectX firmware.
Presents an image illustrating the internal system topology of the DGX A100.
Describes the DGX OS software stack and its components pre-installed on the system.
Provides links to other relevant documentation for the DGX A100 system.
Details how to contact NVIDIA Enterprise Support for assistance with the DGX A100 system.
Explains how to connect to the DGX A100 console via direct or BMC remote connection.
Details how to establish a direct console connection using a display and keyboard.
Explains how to establish a remote console connection to the DGX A100 using the BMC.
Details how to establish an SSH connection to the DGX A100 operating system.
Guides through the initial system setup process after powering on the DGX A100 for the first time.
Provides instructions for updating the DGX A100 software to the latest version.
Provides instructions for updating the DGX A100 software to the latest version.
Explains how to enable the srp_daemon for Mellanox drivers, if needed.
Covers installation prerequisites and site information requirements for the DGX A100.
Provides information on registering the DGX A100 system.
Explains the concept and importance of obtaining an NGC account.
Details the specific startup and shutdown sequences for the DGX A100 system.
Provides considerations for a smooth startup process of the DGX A100 system.
Provides considerations for a safe and proper shutdown of the DGX A100 system.
Guides on performing a health check using NVSM and verifying Docker/NVIDIA driver.
Provides instructions for running the DGX stress test using NVSM before production use.
Explains how to run NGC containers with GPU support on DGX A100 systems.
Details how to use the 'docker run --gpus' command for GPU-enabled containers.
Explains using NVIDIA Container Runtime for Docker to run GPU-accelerated containers.
Addresses CPU mitigations for side-channel vulnerabilities and their performance impact.
Shows how to check if CPU mitigations are enabled or disabled on the DGX system.
Provides instructions to disable CPU mitigations for improved performance on DGX nodes.
Provides instructions to re-enable CPU mitigations, restoring security hardening.
Explains how to manage the DGX crash dump feature using a provided script.
Details how to use the DGX crash dump configuration script to enable or disable dumps.
Describes connecting via Serial Over LAN to view console output during crash dumps.
Introduces the self-encrypting drive (SED) management software and its capabilities.
Provides steps to install the nv-disk-encrypt and optional TPM2 tools packages.
Explains DGX A100 BIOS setup controls for Trusted Computing features like TPM and Block SID.
Shows how to identify drives that support the Self-Encrypting Drive (SED) feature.
Details enabling TPM and disabling Block SID requests in the BIOS setup.
Guides on initializing the DGX A100 system for drive encryption using nv-disk-encrypt.
Explains how to enable drive locking for SEDs after initialization using nv-disk-encrypt.
Demonstrates specifying drive/password mapping using a JSON file for initialization.
Explains how to determine which drives are eligible for self-encryption management.
Guides on creating JSON files for drive/password mapping and initializing the system.
Shows how to use -k and -r options to generate random passwords during initialization.
Details how to specify passwords manually when initializing drives.
Explains how to disable drive locks, allowing free read/write after power-on.
Provides instructions on how to export drive keys from the vault to a file.
Details how to securely erase data from DGX A100 system SSDs, including RAID destruction.
Guides on clearing the TPM contents to regain access after losing the TPM password.
Outlines steps for managing disk passwords, adding, or replacing drives in the system.
Explains how to recover from lost encryption keys, including factory-reset consequences.
Details how to configure network proxy settings for the DGX A100 system.
Explains how to set proxy addresses for the OS and general applications in /etc/environment.
Details how to configure proxy settings specifically for the apt package manager.
Explains configuring proxy environment variables for Docker to access NGC registry.
Guides on changing default Docker IP addresses to avoid network conflicts.
Lists required open ports on the firewall for DGX A100 system communication.
Specifies URLs and network access requirements for running NGC containers.
Explains how to set a static IP address for the BMC when DHCP is not supported.
Details setting a static BMC IP address using the ipmitool command-line utility.
Describes setting a static BMC IP address through the system BIOS utility.
Guides on configuring static IP addresses for network interfaces from the Ubuntu command line.
Explains how to switch network ports between InfiniBand and Ethernet configurations.
Details starting Mellanox Software Tools and determining current port configurations.
Provides steps to switch port configurations using the mlxconfig command.
Explains how to set filesystem quotas to limit disk space usage for NGC containers.
Guides on switching the RAID level between RAID 0 and RAID 5 for storage capacity and redundancy.
Details configuring NVSM for custom drive partitioning and non-default RAID setups.
Provides instructions for updating the DGX A100 software via the NVIDIA public repository.
Details network connectivity checks needed before performing software updates.
Provides the step-by-step process for updating the DGX A100 software using apt.
Guides on obtaining the DGX A100 software ISO image and checksum file for restoration.
Guides on obtaining the DGX A100 software ISO image and checksum file for restoration.
Explains how to reimage the DGX A100 system remotely using the BMC.
Guides on creating bootable media (USB/DVD) for DGX A100 software installation.
Details creating a bootable USB flash drive using the dd command on Linux.
Details creating a bootable USB flash drive using the Rufus utility on Windows.
Explains the process of reimaging the DGX A100 system using a USB flash drive.
Describes how to retain the RAID partition during OS installation or reimaging.
Describes how to retain the RAID partition during OS installation or reimaging.
Explains the option to encrypt the DGX OS root filesystem during installation.
Describes booting into a live environment for debugging without modifying disks.
Details the option to perform an extensive check of the installation media for defects.
Provides steps to connect to the Baseboard Management Controller (BMC) via a web browser.
Describes the main controls and navigation within the BMC interface.
Details how to add or remove users and change BMC login credentials.
Details how to add or remove users and change BMC login credentials.
Explains how to access the DGX A100 console remotely via the KVM feature in the BMC.
Guides on setting up external user authentication services like Active Directory or LDAP.
Details configuring platform event filters within the BMC settings.
Explains how to upload or generate SSL certificates for the BMC.
Describes how to view the SSL certificate details on the BMC SSL Settings page.
Provides information and steps for generating an SSL certificate within the BMC.
Details the requirements and steps for uploading an SSL certificate to the BMC.
Guides on updating the SBIOS certificate, often required for SSL authentication.
Provides instructions on how to access the System BIOS (SBIOS) setup utility.
Details how to set the system's boot order from the SBIOS setup or boot menu.
Explains how to access SBIOS settings via a local terminal using Serial-over-LAN (SOL).
Covers user-level security practices for protecting the DGX A100 from unauthorized access.
Recommends securing the BMC port with a dedicated management network and firewall.
Details security measures incorporated into the NVIDIA DGX A100 system.
Explains Secure Flash for preventing unsigned firmware from being installed on the DGX A100.
Describes the firmware encryption algorithm (AES-CBC) and key strength for DGX A100 firmware.
Explains the concept of signing to ensure firmware integrity.
Refers to configuring NVSM security for system management.
Explains how to securely delete data from DGX A100 SSDs to permanently destroy stored data.
Lists prerequisites for secure data deletion, including bootable media with DGX OS ISO.
Provides step-by-step instructions to securely delete data from DGX A100 system SSDs.
Lists the Redfish features supported by the DGX A100 system for management.
Describes methods for installing DGX A100 software on air-gapped systems.
Guides on reimaging an air-gapped DGX A100 system.
Details creating a local repository mirror for updating DGX systems in air-gapped environments.
Provides steps to create a repository mirror on a DGX OS 4 system.
Guides on configuring an air-gapped DGX OS 4 system to use the local repository mirror.
Guides on configuring an air-gapped DGX OS 5 system to use the local repository mirror.
Explains how to install Docker containers hosted on the NVIDIA NGC Container Registry.
Provides general safety information and precautions for using the DGX A100 server.
Explains safety symbols used in documentation and on the product, denoting CAUTION and WARNING.
Describes the intended application environments and suitability of the DGX A100.
Provides guidelines for selecting an appropriate site for installing the DGX A100 system.
Offers information on safe handling practices for the DGX A100 equipment to prevent injury or damage.
Details electrical precautions, including power and electrical warnings, and power cord requirements.
Provides warnings and instructions for safely accessing the DGX A100 system's interior.
Outlines installation guidelines and warnings related to mounting the DGX A100 system in a rack.
Provides information and precautions for handling electric discharges (ESD) to protect components.
Discusses other potential hazards, including perchlorate material and nickel in the bezel.
Details the compliance of the DGX A100 system with US FCC regulations (Class A).
Explains the accreditation of TÜV Rheinland for US and Canadian certification.
Details compliance with Canadian Interference-Causing Equipment Regulation (Class A).
Outlines CE marking and compliance with EU directives for the DGX A100.
States that the product meets applicable EMC requirements for Class A, I.T.E equipment.
Indicates INMETRO compliance for Brazil.
Mentions Voluntary Control Council for Interference (VCCI) compliance for Japan.
Details compliance with Korean regulations for Class A electromagnetic wave suitability equipment.
States no specific certification is needed for China due to power consumption.
Indicates Bureau of Standards, Metrology & Inspection (BSMI) and Taiwan RoHS compliance.
Details compliance with Customs Union Technical Regulations (CU TR) and Federal Agency of Communication.
States compliance with Israeli Standards Institution (SII) regulations.
Details India RoHS compliance and Bureau of India Standards (BIS) verification.
Lists South African Bureau of Standards (SABS) and NRCS compliance standards.
Details UK Conformity Assessed (UKCA) compliance with relevant UK regulations.