NVIDIA · November 2020

A100

80GB SXM

The NVIDIA A100 80GB SXM is a high-performance GPU designed for data centers, targeting AI, machine learning, and high-performance computing workloads. It is part of the Ampere architecture, offering significant improvements in memory capacity and bandwidth over its predecessors. The 80GB variant provides enhanced memory for large-scale models and datasets, making it ideal for demanding applications.

A100 80GB SXM
VRAM
80GB GB
FP32 TFLOPS
19.5 TFLOPS
CUDA Cores
6,912
TDP
400W*** W

Provider Marketplace

Cheapest
$0.69/hour
Starting from
Best Value
$0.69/hour
Starting from
Enterprise Choice
$3.50/hour
Starting from

All Cloud Providers

19 Options available
FluidStackCheapest
On-DemandGlobal Availability
$0.69/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$0.69/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$0.69/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$0.69/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$0.69/ hour
Estimated Cost
Provision
Vast.ai favicon
On-DemandGlobal Availability
$0.69/ hour
Estimated Cost
Provision
Thunder Compute favicon
On-DemandGlobal Availability
$0.78/ hour
Estimated Cost
Provision
Lambda Labs favicon
On-DemandGlobal Availability
$1.29/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$1.36/ hour
Estimated Cost
Provision
RunPod favicon
On-DemandGlobal Availability
$1.39/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$1.39/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$1.39/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$1.39/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$1.42/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$1.47/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$1.60/ hour
Estimated Cost
Provision
Vultr favicon
On-DemandGlobal Availability
$2.40/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$2.40/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$3.50/ hour
Estimated Cost
Provision

Compute Performance

FP649.7 TFLOPS TFLOPS
FP3219.5 TFLOPS TFLOPS
TF32156 TFLOPS (Sparse), 78 TFLOPS (Dense) TFLOPS
FP16312 TFLOPS (Sparse), 156 TFLOPS (Dense) TFLOPS
BF16312 TFLOPS (Sparse), 156 TFLOPS (Dense) TFLOPS
FP8Not Supported TFLOPS
INT8624 TOPS (Sparse), 312 TOPS (Dense) TOPS
INT4Not Supported TOPS

Architecture

MicroarchitectureAmpere
Process NodeTSMC N7
Die Size826 mm²
Transistors54.2B
Compute Units108 SMs
Tensor Cores3rd Gen, 432 Tensor Cores
RT Cores
Matrix EngineMatrix Core
Base Clock1095 MHz
Boost Clock1410 MHz
Transformer Engine
Sparse AccelerationSupported (2:4 structured sparsity)
Dynamic PrecisionSupported (FP16/BF16/FP32/INT8/INT4)

Memory & VRAM

Memory TypeHBM2e
Total Capacity80GB GB
Bandwidth2039GB/s
Bus Width5120-bit
HBM Stacks5
ECC SupportYes (Inline)
Unified MemoryYes (CUDA Unified Memory)
Compression
NUMA Awareness
Memory PoolingNot Supported

Connectivity & Scaling

InterconnectNVLink
GenerationNVLink 3
IB Bandwidth600 GB/s
PCIe InterfacePCIe Gen 4 xx16
CXL Support
TopologyFully-connected NVLink mesh (via HGX baseboard)
Max GPUs/Node8
Scale-OutYes (via InfiniBand NDR/XDR or RoCE v2)
GPUDirect RDMAYes
P2P MemoryYes

Virtualization

MIG SupportSupported
MIG Partitions7 instances (max)
SR-IOVNot Supported
vGPU ReadinessSupported (NVIDIA vGPU)
K8s ReadinessCertified (NVIDIA GPU Operator)
GPU SharingMIG, Time-Slicing, MPS, vGPU
Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP400 W W
Peak Power450 W
Idle Power50-70 W
Perf / Watt0.18 TFLOPS FP64/W, 0.5 TFLOPS FP32/W (theoretical max, varies by workload)
PSU RequiredN/A
ConnectorsSXM4 edge connector (direct board power, no external PCIe connectors)
Thermal LimitsOperating temperature up to 85°C GPU temperature; requires high-performance liquid or forced-air cooling
EfficiencyN/A

Physical Design

Form FactorSXM4 module
FHFLN/A
Slot WidthN/A
Dimensions110 mm x 140 mm
Weight1.8–2.2 kg
CoolingDirect liquid cooling (cold plate), passive
Rack DensityOptimized for high-density GPU baseboards (HGX A100 4/8-GPU)

Thermals & Cooling

AirflowServer chassis airflow required (Not Published)
Temp Range0°C to 45°C
ThrottlingThermal-based clock reduction at Tjunction limit
Noise LevelNot Applicable (Passive Module)
Liquid CoolingAir-cooled
DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported
ROCmNot Supported
oneAPINot Supported
PyTorchOfficially supported
TensorFlowOfficially supported
JAXSupported via CUDA backend
HuggingFaceOptimized (CUDA kernels available)
Triton ServerSupported
DockerOfficial container images available
Compiler StackMature CUDA compiler stack
Kernel OptimUpstream Linux kernel support for NVIDIA datacenter GPUs documented
Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro
Preconfigured4U 8-GPU systems
DGX/HGXCore of HGX baseboard
Rack-ScaleNVLink Switch System, InfiniBand scale-out
Edge DeployNot typically suited for edge deployments due to high TDP
Ref ArchitecturesNVIDIA MGX, SuperPOD

System Compatibility

CPU PairingIntegrated with platform CPU (HGX/DGX architecture)
NUMAStandard NUMA behavior
Required PCIeNot Applicable (SXM/OAM)
MotherboardPlatform-specific (HGX baseboard with SXM4 sockets required)
Rack PowerContact vendor for rack power planning
BIOS Limits
CXL ReadyNot Supported
OS CompatRHEL and Ubuntu LTS supported; Windows Server supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe A100 80GB SXM offers high efficiency with its large memory capacity and high bandwidth, suitable for memory-intensive workloads.
2-GPUWith NVLink, two A100 80GB SXM GPUs can achieve near-linear scaling due to high inter-GPU bandwidth.
4-GPUScaling remains near-linear with four GPUs, as NVSwitch effectively manages communication between GPUs.
8-GPUEight GPU configurations maintain near-linear scaling, leveraging NVSwitch to minimize communication overhead.
64+ GPUAt scales beyond 64 GPUs, InfiniBand or Ethernet overhead becomes significant, requiring careful network architecture to maintain performance.

Scaling Characteristics

Cross-Node LatencyGPUDirect RDMA support minimizes cross-node latency, allowing efficient multi-node training.
Network BottlenecksThe primary bottleneck is typically the Host-to-Device bridge when not using NVLink, but with NVLink, the bottleneck shifts to network interconnects at scale.
ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron for efficient distributed training.

Workload Readiness

LLM Training

The A100 80GB SXM, based on the Ampere architecture, is highly suitable for training large language models. It can handle up to 70B parameter models on a single node and scales efficiently for 400B+ models in a multi-node setup due to its high VRAM and NVLink support.

LLM Inference

The A100 excels in LLM inference with its large VRAM providing ample KV cache headroom, enabling high token-per-second throughput. Ideal for serving large models efficiently.

Vision Training

With its 3rd-gen Tensor Cores, the A100 is highly effective for vision model training, supporting large batch sizes and complex architectures with ease.

Diffusion Models

The A100's large VRAM and Tensor Cores make it well-suited for training and inference of diffusion models, handling high computational demands efficiently.

Multimodal AI

The A100's versatility and large memory capacity make it ideal for multimodal AI tasks, supporting complex models that integrate vision, language, and other modalities.

Reinforcement Learning

The A100 is effective for reinforcement learning workloads, benefiting from its high throughput and ability to handle large state and action spaces.

HPC / Simulation

The A100 supports FP64 computations, making it suitable for HPC simulations that require double precision, although not as specialized as the A100 40GB variant.

Scientific Computing

With robust FP64 support, the A100 is well-suited for scientific computing tasks that demand high precision and large-scale computations.

Edge Inference

The A100's high TDP and form factor are not optimized for edge inference, where power efficiency and compactness are critical.

Real-Time Serving

The A100 is capable of real-time AI serving, leveraging its high throughput and large memory to handle demanding workloads efficiently.

Fine-Tuning

The A100's large VRAM supports full fine-tuning of large models, making it highly efficient for this purpose.

LoRA Efficiency

While the A100 can handle LoRA fine-tuning efficiently, its capabilities are more aligned with full fine-tuning due to its high VRAM capacity.

Market Authority

MLPerf Ranking

Officially reported in MLPerf Training and Inference results (v1.0 and later), with A100 80GB SXM featured in submissions from NVIDIA and partner OEMs.

Cloud Adoption

Publicly confirmed by Google Cloud, Microsoft Azure, and Amazon Web Services (AWS) as available in their cloud GPU offerings.

Supercomputer Usage

Used in top supercomputers such as Selene (NVIDIA), Perlmutter (NERSC), and Leonardo (CINECA), as documented in the TOP500 list.

Research Citations

Widely cited in research papers for large-scale deep learning, including works published in NeurIPS, ICML, and Nature; Google Scholar returns thousands of results for 'A100 80GB SXM'.

Community Benchmarks

Featured in community benchmarks such as MLPerf, Hugging Face leaderboards, and open-source ML performance comparisons.

GitHub Support

Extensive support in major deep learning frameworks (PyTorch, TensorFlow, JAX) and libraries (DeepSpeed, Megatron-LM) with explicit optimizations for A100 80GB SXM, as seen in official and community GitHub repositories.

Enterprise Cases

NVIDIA and partners have published case studies highlighting A100 80GB SXM deployments in industries such as healthcare (Clara), finance, and automotive (Mercedes-Benz AI research).

Key Strengths

This GPU excels at AI training and inference, offering exceptional performance for deep learning frameworks. Its large memory capacity and high bandwidth make it particularly effective for large-scale models and data-intensive tasks. The A100's support for multi-instance GPU (MIG) technology allows for efficient resource partitioning, enhancing its versatility.

Limitations

While the A100 80GB SXM offers exceptional performance, its high power consumption and cooling requirements may limit its use to well-equipped data centers. The SXM form factor restricts compatibility to specific platforms, and its premium pricing can be a barrier for smaller organizations. Availability may also be constrained by high demand and production limitations.

Expert Insight

The A100 represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS
VRAM
TDP
Cores
Information updated daily. Cloud pricing subject to vendor availability.