EasyManua.ls Logo

Nvidia DGX A100 User Manual

Nvidia DGX A100
118 pages
To Next Page IconTo Next Page
To Next Page IconTo Next Page
To Previous Page IconTo Previous Page
To Previous Page IconTo Previous Page
Page #37 background imageLoading...
Page #37 background image
Quick Start and Basic Operation
NVIDIA DGX A100 DU-09821-001 _v01|29
4.8.2. Disabling CPU Mitigations
CAUTION: Performing the following instructions will disable the CPU mitigations provided by
the DGX OS Server software.
1. Install the nv-mitigations-off package.
$ sudo apt install nv-mitigations-off -y
2. Reboot the system.
3. Verify CPU mitigations are disabled.
$ cat /sys/devices/system/cpu/vulnerabilities/*
The output should include several Vulnerable lines. See Determining the CPU Mitigation
State of the DGX System for example output.
4.8.3. Re-enabling CPU Mitigations
1. Remove the nv-mitigations-off package.
$ sudo apt purge nv-mitigations-off
2. Reboot the system.
3. Verify CPU mitigations are enabled.
$ cat /sys/devices/system/cpu/vulnerabilities/*
The output should include several Mitigations lines. See Determining the CPU Mitigation
State of the DGX System for example output.

Table of Contents

Other manuals for Nvidia DGX A100

Question and Answer IconNeed help?

Do you have a question about the Nvidia DGX A100 and is the answer not in the manual?

Nvidia DGX A100 Specifications

General IconGeneral
GPU8 x NVIDIA A100 Tensor Core GPUs
System Memory1 TB DDR4
Storage15 TB NVMe SSD
GPU Memory320 GB total (40 GB per GPU)
CPU2 x 64-Core AMD EPYC 7742
Networking8 x 200 Gb/s InfiniBand or Ethernet
InterconnectNVIDIA NVLink

Summary

NVIDIA DGX A100 System Introduction

Hardware Overview

Provides information about the hardware components of the DGX A100 system.

DGX A100 Models and Component Descriptions

Details the two DGX A100 system models and their component specifications.

Mechanical Specifications

Details the physical dimensions and form factor of the DGX A100 system.

Power Specifications

Outlines the power supply capabilities and configurations for the DGX A100 system.

Support for N+N Redundancy

Explains the N+N redundancy configuration for the DGX A100 power supply units.

DGX A100 Locking Power Cord Specification

Specifies the approved locking power cord types for the DGX A100 system.

Using the Locking Power Cords

Provides instructions on how to correctly use the locking power cords for the DGX A100.

Environmental Specifications

Details the environmental operating conditions for the DGX A100 system.

Front Panel Connections and Controls

Describes the connections and controls located on the front panel of the DGX A100.

With a Bezel

Details the front panel components and their functions when the bezel is attached.

With the Bezel Removed

Details the front panel components and their functions when the bezel is removed.

Rear Panel Modules

Shows an image and describes the rear panel modules of the DGX A100.

Motherboard Connections and Controls

Details the connections and controls on the motherboard of the DGX A100 system.

Motherboard Tray Components

Illustrates the components located on the motherboard tray of the DGX A100.

GPU Tray Components

Illustrates the components located on the GPU tray of the DGX A100 system.

Network Connections, Cables, and Adaptors

Covers network connectivity, including cables and adapters for the DGX A100.

Network Ports

Details the network ports available on the DGX A100 system.

Supported Network Cables and Adaptors

Lists supported network cables and adapters, determined by ConnectX firmware.

DGX A100 System Topology

Presents an image illustrating the internal system topology of the DGX A100.

DGX OS Software

Describes the DGX OS software stack and its components pre-installed on the system.

Additional Documentation

Provides links to other relevant documentation for the DGX A100 system.

Customer Support

Details how to contact NVIDIA Enterprise Support for assistance with the DGX A100 system.

Connecting to the DGX A100

Connecting to the Console

Explains how to connect to the DGX A100 console via direct or BMC remote connection.

Direct Connection

Details how to establish a direct console connection using a display and keyboard.

Remote Connection through the BMC

Explains how to establish a remote console connection to the DGX A100 using the BMC.

SSH Connection to the OS

Details how to establish an SSH connection to the DGX A100 operating system.

First Boot Setup

Setting up the System

Guides through the initial system setup process after powering on the DGX A100 for the first time.

Post Setup Tasks

Provides instructions for updating the DGX A100 software to the latest version.

Obtaining Software Updates

Provides instructions for updating the DGX A100 software to the latest version.

Enabling the srp Daemon

Explains how to enable the srp_daemon for Mellanox drivers, if needed.

Quick Start and Basic Operation

Installation and Configuration

Covers installation prerequisites and site information requirements for the DGX A100.

Registration

Provides information on registering the DGX A100 system.

Obtaining an NGC Account

Explains the concept and importance of obtaining an NGC account.

Turning DGX A100 On and Off

Details the specific startup and shutdown sequences for the DGX A100 system.

Startup Considerations

Provides considerations for a smooth startup process of the DGX A100 system.

Shutdown Considerations

Provides considerations for a safe and proper shutdown of the DGX A100 system.

Verifying Functionality - Quick Health Check

Guides on performing a health check using NVSM and verifying Docker/NVIDIA driver.

Running the Pre-flight Test

Provides instructions for running the DGX stress test using NVSM before production use.

Running NGC Containers with GPU Support

Explains how to run NGC containers with GPU support on DGX A100 systems.

Using Native GPU Support

Details how to use the 'docker run --gpus' command for GPU-enabled containers.

Using the NVIDIA Container Runtime for Docker

Explains using NVIDIA Container Runtime for Docker to run GPU-accelerated containers.

Managing CPU Mitigations

Addresses CPU mitigations for side-channel vulnerabilities and their performance impact.

Determining the CPU Mitigation State of the DGX System

Shows how to check if CPU mitigations are enabled or disabled on the DGX system.

Disabling CPU Mitigations

Provides instructions to disable CPU mitigations for improved performance on DGX nodes.

Re-enabling CPU Mitigations

Provides instructions to re-enable CPU mitigations, restoring security hardening.

Additional Features and Instructions

Managing the DGX Crash Dump Feature

Explains how to manage the DGX crash dump feature using a provided script.

Using the Script

Details how to use the DGX crash dump configuration script to enable or disable dumps.

Connecting to Serial Over LAN to View the Console

Describes connecting via Serial Over LAN to view console output during crash dumps.

Managing the DGX A100 Self-Encrypting Drives

Overview

Introduces the self-encrypting drive (SED) management software and its capabilities.

Installing the Software

Provides steps to install the nv-disk-encrypt and optional TPM2 tools packages.

Configuring Trusted Computing

Explains DGX A100 BIOS setup controls for Trusted Computing features like TPM and Block SID.

Determining Whether Drives Support SID

Shows how to identify drives that support the Self-Encrypting Drive (SED) feature.

Enabling the TPM and Preventing the BIOS from Sending Block SID Requests

Details enabling TPM and disabling Block SID requests in the BIOS setup.

Initializing the System for Drive Encryption

Guides on initializing the DGX A100 system for drive encryption using nv-disk-encrypt.

Enabling Drive Locking

Explains how to enable drive locking for SEDs after initialization using nv-disk-encrypt.

Initialization Examples

Demonstrates specifying drive/password mapping using a JSON file for initialization.

Determining Which Drives Can be Managed as Self-Encrypting

Explains how to determine which drives are eligible for self-encryption management.

Creating the Drive/Password Mapping JSON Files and Using it to Initialize the System

Guides on creating JSON files for drive/password mapping and initializing the system.

Example 2: Generating Random Passwords

Shows how to use -k and -r options to generate random passwords during initialization.

Example 3: Specifying Passwords One at a Time When Prompted

Details how to specify passwords manually when initializing drives.

Disabling Drive Locking

Explains how to disable drive locks, allowing free read/write after power-on.

Exporting the Vault

Provides instructions on how to export drive keys from the vault to a file.

Erasing Your Data

Details how to securely erase data from DGX A100 system SSDs, including RAID destruction.

Clearing the TPM

Guides on clearing the TPM contents to regain access after losing the TPM password.

Changing Disk Passwords, Adding Disks, or Replacing Disks

Outlines steps for managing disk passwords, adding, or replacing drives in the system.

Recovering From Lost Keys

Explains how to recover from lost encryption keys, including factory-reset consequences.

Network Configuration

Configuring Network Proxies

Details how to configure network proxy settings for the DGX A100 system.

For the OS and Most Applications

Explains how to set proxy addresses for the OS and general applications in /etc/environment.

For apt

Details how to configure proxy settings specifically for the apt package manager.

For Docker

Explains configuring proxy environment variables for Docker to access NGC registry.

Configuring Docker IP Addresses

Guides on changing default Docker IP addresses to avoid network conflicts.

Open Ports

Lists required open ports on the firewall for DGX A100 system communication.

Connectivity Requirements for NGC Containers

Specifies URLs and network access requirements for running NGC containers.

Configuring a Static IP Address for the BMC

Explains how to set a static IP address for the BMC when DHCP is not supported.

Configuring a BMC Static Address by Using ipmitool

Details setting a static BMC IP address using the ipmitool command-line utility.

Configuring a BMC Static IP Address by Using the System BIOS

Describes setting a static BMC IP address through the system BIOS utility.

Configuring a BMC Static IP Address for the Network Ports

Guides on configuring static IP addresses for network interfaces from the Ubuntu command line.

Switching Between InfiniBand and Ethernet

Explains how to switch network ports between InfiniBand and Ethernet configurations.

Starting the Mellanox Software Tools and Determining the Current Port Configuration

Details starting Mellanox Software Tools and determining current port configurations.

Switching the Port Configuration

Provides steps to switch port configurations using the mlxconfig command.

Configuring Storage

Setting Filesystem Quotas

Explains how to set filesystem quotas to limit disk space usage for NGC containers.

Switching Between RAID 0 and RAID 5

Guides on switching the RAID level between RAID 0 and RAID 5 for storage capacity and redundancy.

Configuring Support for Custom Drive Partitioning

Details configuring NVSM for custom drive partitioning and non-default RAID setups.

Updating and Restoring the Software

Updating the DGX A100 Software

Provides instructions for updating the DGX A100 software via the NVIDIA public repository.

Connectivity Requirements for Software Updates

Details network connectivity checks needed before performing software updates.

Update Instructions

Provides the step-by-step process for updating the DGX A100 software using apt.

Restoring the DGX A100 Software Image

Guides on obtaining the DGX A100 software ISO image and checksum file for restoration.

Obtaining the DGX A100 Software ISO Image and Checksum File

Guides on obtaining the DGX A100 software ISO image and checksum file for restoration.

Remotely Reimaging the System

Explains how to reimage the DGX A100 system remotely using the BMC.

Creating a Bootable Installation Medium

Guides on creating bootable media (USB/DVD) for DGX A100 software installation.

Creating a Bootable USB Flash Drive by Using the dd Command

Details creating a bootable USB flash drive using the dd command on Linux.

Creating a Bootable USB Flash Drive by Using Akeo Rufus

Details creating a bootable USB flash drive using the Rufus utility on Windows.

Reimaging the System from a USB Flash Drive

Explains the process of reimaging the DGX A100 system using a USB flash drive.

Installation Options

Describes how to retain the RAID partition during OS installation or reimaging.

Retaining the RAID Partition While Installing the OS

Describes how to retain the RAID partition during OS installation or reimaging.

Advanced Installation Option (Encrypted Root - DGX OS 5 or Later)

Explains the option to encrypt the DGX OS root filesystem during installation.

Boot into Live Environment (DGX OS 5 or Later)

Describes booting into a live environment for debugging without modifying disks.

Check Disc for Defects (DGX OS 5 or Later)

Details the option to perform an extensive check of the installation media for defects.

Using the BMC

Connecting to the BMC

Provides steps to connect to the Baseboard Management Controller (BMC) via a web browser.

Overview of BMC Controls

Describes the main controls and navigation within the BMC interface.

Common BMC Tasks

Details how to add or remove users and change BMC login credentials.

Changing the BMC Login Credentials

Details how to add or remove users and change BMC login credentials.

Using the Remote Console

Explains how to access the DGX A100 console remotely via the KVM feature in the BMC.

Setting Up Active Directory or LDAP/E-Directory

Guides on setting up external user authentication services like Active Directory or LDAP.

Configuring Platform Event Filters

Details configuring platform event filters within the BMC settings.

Uploading or Generating SSL Certificates

Explains how to upload or generate SSL certificates for the BMC.

Viewing the SSL Certificate

Describes how to view the SSL certificate details on the BMC SSL Settings page.

Generating the SSL Certificate

Provides information and steps for generating an SSL certificate within the BMC.

Uploading the SSL Certificate

Details the requirements and steps for uploading an SSL certificate to the BMC.

Updating the SBIOS Certificate

Guides on updating the SBIOS certificate, often required for SSL authentication.

SBIOS Settings

Accessing the SBIOS Setup

Provides instructions on how to access the System BIOS (SBIOS) setup utility.

Configuring the Boot Order

Details how to set the system's boot order from the SBIOS setup or boot menu.

Configuring the local terminal to access the SBIOS settings screen

Explains how to access SBIOS settings via a local terminal using Serial-over-LAN (SOL).

Security

User Security Measures

Covers user-level security practices for protecting the DGX A100 from unauthorized access.

Securing the BMC Port

Recommends securing the BMC port with a dedicated management network and firewall.

System Security Measures

Details security measures incorporated into the NVIDIA DGX A100 system.

Secure Flash of DGX A100 Firmware

Explains Secure Flash for preventing unsigned firmware from being installed on the DGX A100.

Encryption

Describes the firmware encryption algorithm (AES-CBC) and key strength for DGX A100 firmware.

Signing

Explains the concept of signing to ensure firmware integrity.

NVSM Security

Refers to configuring NVSM security for system management.

Secure Data Deletion

Explains how to securely delete data from DGX A100 SSDs to permanently destroy stored data.

Prerequisites

Lists prerequisites for secure data deletion, including bootable media with DGX OS ISO.

Instructions

Provides step-by-step instructions to securely delete data from DGX A100 system SSDs.

Redfish APIs Support

Supported Redfish Features

Lists the Redfish features supported by the DGX A100 system for management.

Installing Software on Air-Gapped DGX A100 Systems

Installing NVIDIA DGX A100 Software

Describes methods for installing DGX A100 software on air-gapped systems.

Reimaging the System

Guides on reimaging an air-gapped DGX A100 system.

Creating a Local Mirror of the NVIDIA and Canonical Repositories

Details creating a local repository mirror for updating DGX systems in air-gapped environments.

Creating the Mirror in a DGX OS 4 System

Provides steps to create a repository mirror on a DGX OS 4 system.

Configuring the Target Air-Gapped DGX OS 4 System

Guides on configuring an air-gapped DGX OS 4 system to use the local repository mirror.

Configuring the Target Air-Gapped DGX OS 5 System

Guides on configuring an air-gapped DGX OS 5 system to use the local repository mirror.

Installing Docker Containers

Explains how to install Docker containers hosted on the NVIDIA NGC Container Registry.

Safety

Safety Information

Provides general safety information and precautions for using the DGX A100 server.

Safety Warnings and Cautions

Explains safety symbols used in documentation and on the product, denoting CAUTION and WARNING.

Intended Application Uses

Describes the intended application environments and suitability of the DGX A100.

Site Selection

Provides guidelines for selecting an appropriate site for installing the DGX A100 system.

Equipment Handling Practices

Offers information on safe handling practices for the DGX A100 equipment to prevent injury or damage.

Electrical Precautions

Details electrical precautions, including power and electrical warnings, and power cord requirements.

System Access Warnings

Provides warnings and instructions for safely accessing the DGX A100 system's interior.

Rack Mount Warnings

Outlines installation guidelines and warnings related to mounting the DGX A100 system in a rack.

Electrostatic Discharge

Provides information and precautions for handling electric discharges (ESD) to protect components.

Other Hazards

Discusses other potential hazards, including perchlorate material and nickel in the bezel.

Compliance

United States

Details the compliance of the DGX A100 system with US FCC regulations (Class A).

United States/Canada

Explains the accreditation of TÜV Rheinland for US and Canadian certification.

Canada

Details compliance with Canadian Interference-Causing Equipment Regulation (Class A).

CE

Outlines CE marking and compliance with EU directives for the DGX A100.

Australia and New Zealand

States that the product meets applicable EMC requirements for Class A, I.T.E equipment.

Brazil

Indicates INMETRO compliance for Brazil.

Japan

Mentions Voluntary Control Council for Interference (VCCI) compliance for Japan.

South Korea

Details compliance with Korean regulations for Class A electromagnetic wave suitability equipment.

China

States no specific certification is needed for China due to power consumption.

Taiwan

Indicates Bureau of Standards, Metrology & Inspection (BSMI) and Taiwan RoHS compliance.

Russia/Kazakhstan/Belarus

Details compliance with Customs Union Technical Regulations (CU TR) and Federal Agency of Communication.

Israel

States compliance with Israeli Standards Institution (SII) regulations.

India

Details India RoHS compliance and Bureau of India Standards (BIS) verification.

South Africa

Lists South African Bureau of Standards (SABS) and NRCS compliance standards.

Great Britain (England, Wales, and Scotland)

Details UK Conformity Assessed (UKCA) compliance with relevant UK regulations.

Related product manuals