NVIDIA · November 2020

A100

80GB PCIe

The NVIDIA A100 80GB PCIe is a high-performance GPU designed for data centers, targeting AI, machine learning, and high-performance computing workloads. It is part of the Ampere architecture, offering significant improvements in performance and memory capacity over its predecessors. The 80GB variant provides ample memory for large-scale models and datasets, making it ideal for demanding applications.

A100 80GB PCIe
VRAM
80GB GB
FP32 TFLOPS
19.5 TFLOPS
CUDA Cores
6,912
TDP
300W W

Provider Marketplace

Cheapest
$0.44/hour
Starting from
Best Value
$1.19/hour
Starting from
Enterprise Choice
$10.00/hour
Starting from

All Cloud Providers

8 Options available
FluenceCheapest
On-DemandGlobal Availability
$0.44/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$1.19/ hour
Estimated Cost
Provision
RunPod favicon
On-DemandGlobal Availability
$1.39/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$1.39/ hour
Estimated Cost
Provision
Microsoft Azure favicon
On-DemandGlobal Availability
$1.39/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$5.00/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$10.00/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$10.00/ hour
Estimated Cost
Provision

Compute Performance

FP649.7 TFLOPS TFLOPS
FP3219.5 TFLOPS TFLOPS
TF3278 TFLOPS (Sparse), 39 TFLOPS (Dense) TFLOPS
FP16156 TFLOPS (Sparse), 78 TFLOPS (Dense) TFLOPS
BF16156 TFLOPS (Sparse), 78 TFLOPS (Dense) TFLOPS
FP8Not Supported TFLOPS
INT8312 TOPS (Sparse), 156 TOPS (Dense) TOPS
INT4Not Supported TOPS

Architecture

MicroarchitectureAmpere
Process NodeTSMC N7
Die Size826 mm²
Transistors54.2B
Compute Units108 SMs
Tensor Cores3rd Gen, 432 Tensor Cores
RT Cores
Matrix EngineMatrix Core
Base Clock765 MHz
Boost Clock1410 MHz
Transformer Engine
Sparse AccelerationSupported (2:4 structured sparsity)
Dynamic PrecisionSupported (FP16/FP32/INT8/INT4)

Memory & VRAM

Memory TypeHBM2e
Total Capacity80GB GB
Bandwidth2039GB/s
Bus Width5120-bit
HBM Stacks5
ECC SupportYes (Inline)
Unified MemoryYes (CUDA Unified Memory)
Compression
NUMA Awareness
Memory PoolingNot Supported

Connectivity & Scaling

InterconnectPCIe
GenerationPCIe Gen 4
IB Bandwidth31.5 GB/s
PCIe InterfacePCIe Gen 4 xx16
CXL Support
TopologyPCIe switch or CPU root complex
Max GPUs/Node8
Scale-OutYes (via InfiniBand or Ethernet)
GPUDirect RDMAYes
P2P MemoryYes (via PCIe BAR1, limited performance)

Virtualization

MIG SupportSupported
MIG Partitions7 instances (max)
SR-IOVNot Supported
vGPU ReadinessSupported (NVIDIA vGPU)
K8s ReadinessCertified (NVIDIA GPU Operator)
GPU SharingMIG, Time-Slicing, MPS, vGPU
Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP300 W W
Peak Power320-340 W
Idle Power35-50 W
Perf / Watt0.25-0.30 TFLOPS FP32/W
PSU RequiredN/A
Connectors1x 8-pin PCIe
Thermal LimitsMax GPU temperature 85°C
EfficiencyN/A

Physical Design

Form FactorPCIe card
FHFLFull Height, Full Length
Slot WidthDouble
Dimensions267 mm x 112 mm
Weight1.8–2.2 kg
CoolingPassive
Rack DensityStandard PCIe server GPU density

Thermals & Cooling

AirflowRequires front-to-back chassis airflow (Not Published)
Temp Range0°C to 45°C
ThrottlingThermal-based clock reduction at Tjunction limit
Noise LevelNot Applicable (Passive Module)
Liquid CoolingAir-cooled
DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported
ROCmNot Supported
oneAPINot Supported
PyTorchOfficially supported
TensorFlowOfficially supported
JAXSupported via CUDA backend
HuggingFaceOptimized (CUDA kernels available)
Triton ServerSupported
DockerOfficial container images available
Compiler StackMature CUDA compiler stack
Kernel OptimUpstream Linux kernel support for NVIDIA datacenter GPUs documented
Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro
Preconfigured2U/4U universal GPU servers
DGX/HGXNot the core of a DGX system or HGX baseboard
Rack-ScaleInfiniBand scale-out
Edge DeployLimited suitability for edge deployment due to higher TDP
Ref ArchitecturesNVIDIA MGX, SuperPOD

System Compatibility

CPU PairingDual-socket Intel Xeon Scalable or AMD EPYC 7003/9004 class recommended
NUMAStandard NUMA behavior
Required PCIePCIe Gen 4 x16 recommended
MotherboardFull-length, double-width PCIe Gen 4 x16 slot required
Rack PowerContact vendor for rack power planning
BIOS LimitsAbove 4G decoding and Resizable BAR recommended; SR-IOV support not published
CXL ReadyNo CXL memory expansion
OS CompatSupported on major Linux distributions (RHEL, Ubuntu LTS); Windows Server supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe A100 80GB PCIe offers high efficiency for single GPU tasks, leveraging its large memory capacity and high compute power.
2-GPUScaling between two GPUs is limited by PCIe Gen4 bandwidth, which provides up to 32GB/s for data transfer between GPUs.
4-GPUScaling across four GPUs is further constrained by PCIe lane contention, with diminishing returns as more GPUs are added due to shared bandwidth.
8-GPUScaling to eight GPUs is significantly limited by PCIe bandwidth, as there is no NVLink support to facilitate higher inter-GPU communication speeds.
64+ GPUAt large scales, InfiniBand or Ethernet overhead becomes a factor, with network latency and bandwidth affecting distributed training efficiency.

Scaling Characteristics

Cross-Node LatencyGPUDirect RDMA support helps reduce cross-node latency, but performance is still dependent on the network infrastructure, such as InfiniBand or high-speed Ethernet.
Network BottlenecksThe primary bottleneck is the lack of NVLink, leading to reliance on PCIe bandwidth, which can be a limiting factor for inter-GPU communication.
ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron for efficient distributed training.

Workload Readiness

LLM Training

The A100 80GB PCIe, based on the Ampere architecture, is suitable for training large models up to 70B parameters on a single node and can scale to 400B+ models in a multi-node setup due to its high VRAM and NVLink support.

LLM Inference

Highly efficient for inference with its large VRAM allowing for substantial KV cache, supporting high token-per-second throughput for large models.

Vision Training

Excellent for vision training tasks due to its large VRAM and Tensor Cores, enabling efficient processing of large batch sizes and complex models.

Diffusion Models

Well-suited for diffusion models, leveraging its Tensor Cores and large memory to handle the computational demands of these models efficiently.

Multimodal AI

Capable of handling multimodal AI workloads effectively, thanks to its ample VRAM and versatile architecture supporting diverse data types.

Reinforcement Learning

Effective for reinforcement learning tasks, providing the necessary computational power and memory bandwidth for complex simulations and model updates.

HPC / Simulation

Strong support for HPC simulations with robust FP64 performance, making it suitable for scientific and engineering applications requiring high precision.

Scientific Computing

Ideal for scientific computing tasks, offering excellent double precision performance and large memory capacity for data-intensive computations.

Edge Inference

Not optimal for edge inference due to its high power consumption and large form factor, better suited for data center environments.

Real-Time Serving

Capable of real-time AI serving with high throughput and low latency, supported by its powerful Tensor Cores and large memory capacity.

Fine-Tuning

Highly efficient for full fine-tuning tasks, leveraging its large VRAM to manage extensive model parameters and gradients.

LoRA Efficiency

Efficient for LoRA fine-tuning, benefiting from its architecture's ability to handle lower VRAM requirements while maintaining performance.

Market Authority

MLPerf Ranking

The NVIDIA A100 80GB PCIe is officially listed in MLPerf Training and Inference results (v1.1, v2.0, v2.1, v3.0) as a tested system by NVIDIA and partners. Results are published for both single-node and multi-node configurations.

Cloud Adoption

NVIDIA and hyperscalers (AWS, Google Cloud, Microsoft Azure) publicly confirm availability of A100 80GB PCIe instances (e.g., AWS p4d, Azure ND A100 v4, Google Cloud A2).

Supercomputer Usage

The A100 80GB PCIe is deployed in top supercomputers such as Perlmutter (NERSC), Selene (NVIDIA), and Leonardo (CINECA), as confirmed by official system documentation and TOP500 listings.

Research Citations

Thousands of research papers on arXiv and IEEE Xplore explicitly reference the use of NVIDIA A100 80GB PCIe for deep learning and HPC workloads (search: 'A100 80GB PCIe').

Community Benchmarks

A100 80GB PCIe results are included in open community benchmarks such as MLPerf, DAWNBench, and Hugging Face leaderboards, with users posting reproducible results.

GitHub Support

Widespread support for A100 80GB PCIe in major deep learning frameworks (PyTorch, TensorFlow, JAX) and libraries (DeepSpeed, Megatron-LM, Hugging Face Transformers) with explicit optimization flags and documentation.

Enterprise Cases

NVIDIA and partners (e.g., Microsoft, Oracle, Dell) have published enterprise case studies highlighting A100 80GB PCIe deployments for AI training, inference, and HPC workloads.

Key Strengths

This GPU excels at AI training and inference, offering exceptional performance for deep learning frameworks like TensorFlow and PyTorch. Its large memory capacity and high bandwidth make it particularly effective for large-scale models and data-intensive tasks. The A100's support for multi-instance GPU (MIG) technology allows for efficient resource partitioning.

Limitations

While the A100 80GB PCIe offers excellent performance, it lacks NVLink support, which can be a limitation for workloads requiring high inter-GPU communication. Its high power consumption necessitates adequate power supply and cooling infrastructure. Availability can be constrained due to high demand and production limitations.

Expert Insight

The A100 represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS
VRAM
TDP
Cores
Information updated daily. Cloud pricing subject to vendor availability.