NVIDIA · November 2020

A100

Name: NVIDIA A100 80GB PCIe
Brand: NVIDIA
Availability: InStock
Rating: 4.8 (12 reviews)

80GB PCIe

The NVIDIA A100 80GB PCIe is a high-performance GPU designed for data centers, targeting AI, machine learning, and high-performance computing workloads. It is part of the Ampere architecture, offering significant improvements in performance and memory capacity over its predecessors. The 80GB variant provides ample memory for large-scale models and datasets, making it ideal for demanding applications.

VRAM

80GB GB

FP32 TFLOPS

19.5 TFLOPS

CUDA Cores

6,912

TDP

300W W

Provider Marketplace

Cheapest

$0.44/hour

Starting from

Fluence Visit

Best Value

$1.19/hour

Starting from

Amazon Web Services (AWS)Visit

Enterprise Choice

$10.00/hour

Starting from

Azure (Microsoft)Visit

All Cloud Providers

8 Options available

FluenceCheapest

On-Demand•Global Availability

$0.44/ hour

Estimated Cost

Provision

Amazon Web Services (AWS)

On-Demand•Global Availability

$1.19/ hour

Estimated Cost

Provision

RunPod

On-Demand•Global Availability

$1.39/ hour

Estimated Cost

Provision

Google Cloud Platform (GCP)

On-Demand•Global Availability

$1.39/ hour

Estimated Cost

Provision

Microsoft Azure

On-Demand•Global Availability

$1.39/ hour

Estimated Cost

Provision

GCP (Google Cloud)

On-Demand•Global Availability

$5.00/ hour

Estimated Cost

Provision

AWS (Amazon)

On-Demand•Global Availability

$10.00/ hour

Estimated Cost

Provision

Azure (Microsoft)

On-Demand•Global Availability

$10.00/ hour

Estimated Cost

Provision

Compute Performance

FP649.7 TFLOPS TFLOPS

FP3219.5 TFLOPS TFLOPS

TF3278 TFLOPS (Sparse), 39 TFLOPS (Dense) TFLOPS

FP16156 TFLOPS (Sparse), 78 TFLOPS (Dense) TFLOPS

BF16156 TFLOPS (Sparse), 78 TFLOPS (Dense) TFLOPS

FP8Not Supported TFLOPS

INT8312 TOPS (Sparse), 156 TOPS (Dense) TOPS

INT4Not Supported TOPS

Architecture

MicroarchitectureAmpere

Process NodeTSMC N7

Die Size826 mm²

Transistors54.2B

Compute Units108 SMs

Tensor Cores3rd Gen, 432 Tensor Cores

RT Cores—

Matrix EngineMatrix Core

Base Clock765 MHz

Boost Clock1410 MHz

Transformer Engine—

Sparse AccelerationSupported (2:4 structured sparsity)

Dynamic PrecisionSupported (FP16/FP32/INT8/INT4)

Memory & VRAM

Memory TypeHBM2e

Total Capacity80GB GB

Bandwidth2039GB/s

Bus Width5120-bit

HBM Stacks5

ECC SupportYes (Inline)

Unified MemoryYes (CUDA Unified Memory)

Compression—

NUMA Awareness—

Memory PoolingNot Supported

Connectivity & Scaling

InterconnectPCIe

GenerationPCIe Gen 4

IB Bandwidth31.5 GB/s

PCIe InterfacePCIe Gen 4 xx16

CXL Support—

TopologyPCIe switch or CPU root complex

Max GPUs/Node8

Scale-OutYes (via InfiniBand or Ethernet)

GPUDirect RDMAYes

P2P MemoryYes (via PCIe BAR1, limited performance)

Virtualization

MIG SupportSupported

MIG Partitions7 instances (max)

SR-IOVNot Supported

vGPU ReadinessSupported (NVIDIA vGPU)

K8s ReadinessCertified (NVIDIA GPU Operator)

GPU SharingMIG, Time-Slicing, MPS, vGPU

Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP300 W W

Peak Power320-340 W

Idle Power35-50 W

Perf / Watt0.25-0.30 TFLOPS FP32/W

PSU RequiredN/A

Connectors1x 8-pin PCIe

Thermal LimitsMax GPU temperature 85°C

EfficiencyN/A

Physical Design

Form FactorPCIe card

FHFLFull Height, Full Length

Slot WidthDouble

Dimensions267 mm x 112 mm

Weight1.8–2.2 kg

CoolingPassive

Rack DensityStandard PCIe server GPU density

Thermals & Cooling

AirflowRequires front-to-back chassis airflow (Not Published)

Temp Range0°C to 45°C

ThrottlingThermal-based clock reduction at Tjunction limit

Noise LevelNot Applicable (Passive Module)

Liquid CoolingAir-cooled

DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported

ROCmNot Supported

oneAPINot Supported

PyTorchOfficially supported

TensorFlowOfficially supported

JAXSupported via CUDA backend

HuggingFaceOptimized (CUDA kernels available)

Triton ServerSupported

DockerOfficial container images available

Compiler StackMature CUDA compiler stack

Kernel OptimUpstream Linux kernel support for NVIDIA datacenter GPUs documented

Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro

Preconfigured2U/4U universal GPU servers

DGX/HGXNot the core of a DGX system or HGX baseboard

Rack-ScaleInfiniBand scale-out

Edge DeployLimited suitability for edge deployment due to higher TDP

Ref ArchitecturesNVIDIA MGX, SuperPOD

System Compatibility

CPU PairingDual-socket Intel Xeon Scalable or AMD EPYC 7003/9004 class recommended

NUMAStandard NUMA behavior

Required PCIePCIe Gen 4 x16 recommended

MotherboardFull-length, double-width PCIe Gen 4 x16 slot required

Rack PowerContact vendor for rack power planning

BIOS LimitsAbove 4G decoding and Resizable BAR recommended; SR-IOV support not published

CXL ReadyNo CXL memory expansion

OS CompatSupported on major Linux distributions (RHEL, Ubuntu LTS); Windows Server supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe A100 80GB PCIe offers high efficiency for single GPU tasks, leveraging its large memory capacity and high compute power.

2-GPUScaling between two GPUs is limited by PCIe Gen4 bandwidth, which provides up to 32GB/s for data transfer between GPUs.

4-GPUScaling across four GPUs is further constrained by PCIe lane contention, with diminishing returns as more GPUs are added due to shared bandwidth.

8-GPUScaling to eight GPUs is significantly limited by PCIe bandwidth, as there is no NVLink support to facilitate higher inter-GPU communication speeds.

64+ GPUAt large scales, InfiniBand or Ethernet overhead becomes a factor, with network latency and bandwidth affecting distributed training efficiency.

Scaling Characteristics

Cross-Node LatencyGPUDirect RDMA support helps reduce cross-node latency, but performance is still dependent on the network infrastructure, such as InfiniBand or high-speed Ethernet.

Network BottlenecksThe primary bottleneck is the lack of NVLink, leading to reliance on PCIe bandwidth, which can be a limiting factor for inter-GPU communication.

ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron for efficient distributed training.

Workload Readiness

LLM Training

The A100 80GB PCIe, based on the Ampere architecture, is suitable for training large models up to 70B parameters on a single node and can scale to 400B+ models in a multi-node setup due to its high VRAM and NVLink support.

LLM Inference

Highly efficient for inference with its large VRAM allowing for substantial KV cache, supporting high token-per-second throughput for large models.

Vision Training

Excellent for vision training tasks due to its large VRAM and Tensor Cores, enabling efficient processing of large batch sizes and complex models.

Diffusion Models

Well-suited for diffusion models, leveraging its Tensor Cores and large memory to handle the computational demands of these models efficiently.

Multimodal AI

Capable of handling multimodal AI workloads effectively, thanks to its ample VRAM and versatile architecture supporting diverse data types.

Reinforcement Learning

Effective for reinforcement learning tasks, providing the necessary computational power and memory bandwidth for complex simulations and model updates.

HPC / Simulation

Strong support for HPC simulations with robust FP64 performance, making it suitable for scientific and engineering applications requiring high precision.

Scientific Computing

Ideal for scientific computing tasks, offering excellent double precision performance and large memory capacity for data-intensive computations.

Edge Inference

Not optimal for edge inference due to its high power consumption and large form factor, better suited for data center environments.

Real-Time Serving

Capable of real-time AI serving with high throughput and low latency, supported by its powerful Tensor Cores and large memory capacity.

Fine-Tuning

Highly efficient for full fine-tuning tasks, leveraging its large VRAM to manage extensive model parameters and gradients.

LoRA Efficiency

Efficient for LoRA fine-tuning, benefiting from its architecture's ability to handle lower VRAM requirements while maintaining performance.

Market Authority

MLPerf Ranking

The NVIDIA A100 80GB PCIe is officially listed in MLPerf Training and Inference results (v1.1, v2.0, v2.1, v3.0) as a tested system by NVIDIA and partners. Results are published for both single-node and multi-node configurations.

Cloud Adoption

NVIDIA and hyperscalers (AWS, Google Cloud, Microsoft Azure) publicly confirm availability of A100 80GB PCIe instances (e.g., AWS p4d, Azure ND A100 v4, Google Cloud A2).

Supercomputer Usage

The A100 80GB PCIe is deployed in top supercomputers such as Perlmutter (NERSC), Selene (NVIDIA), and Leonardo (CINECA), as confirmed by official system documentation and TOP500 listings.

Research Citations

Thousands of research papers on arXiv and IEEE Xplore explicitly reference the use of NVIDIA A100 80GB PCIe for deep learning and HPC workloads (search: 'A100 80GB PCIe').

Community Benchmarks

A100 80GB PCIe results are included in open community benchmarks such as MLPerf, DAWNBench, and Hugging Face leaderboards, with users posting reproducible results.

GitHub Support

Widespread support for A100 80GB PCIe in major deep learning frameworks (PyTorch, TensorFlow, JAX) and libraries (DeepSpeed, Megatron-LM, Hugging Face Transformers) with explicit optimization flags and documentation.

Enterprise Cases

NVIDIA and partners (e.g., Microsoft, Oracle, Dell) have published enterprise case studies highlighting A100 80GB PCIe deployments for AI training, inference, and HPC workloads.

Key Strengths

This GPU excels at AI training and inference, offering exceptional performance for deep learning frameworks like TensorFlow and PyTorch. Its large memory capacity and high bandwidth make it particularly effective for large-scale models and data-intensive tasks. The A100's support for multi-instance GPU (MIG) technology allows for efficient resource partitioning.

Limitations

While the A100 80GB PCIe offers excellent performance, it lacks NVLink support, which can be a limitation for workloads requiring high inter-GPU communication. Its high power consumption necessitates adequate power supply and cooling infrastructure. Availability can be constrained due to high demand and production limitations.

Also in the Lineup

GeForce RTX 4090 Founders Edition

NVIDIA

GeForce RTX 5080 Founders Edition

NVIDIA

GeForce RTX 5090 RTX 5090

NVIDIA

H100 NVL

Expert Insight

The A100 represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS

VRAM

TDP

Cores

Information updated daily. Cloud pricing subject to vendor availability.