NVIDIA · 2022-03-27

H100

SXM

The NVIDIA H100 SXM variant features exceptional performance and scalability for a wide range of workloads. It includes fourth-generation Tensor Cores and a Transformer Engine with FP8 precision, providing up to 4X faster training over the prior generation for large language models.

H100 SXM
VRAM
80GB GB
FP32 TFLOPS
67 TFLOPS
CUDA Cores
16,896
TDP
Up to 700W (configurable) W

Provider Marketplace

Cheapest
$1.90/hour
Starting from
Best Value
$1.90/hour
Starting from
Enterprise Choice
$2.94/hour
Starting from

All Cloud Providers

3 Options available
HyperstackCheapest
On-DemandGlobal Availability
$1.90/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$1.90/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$2.94/ hour
Estimated Cost
Provision

Compute Performance

FP6434 TFLOPS TFLOPS
FP3267 TFLOPS TFLOPS
TF3267 TFLOPS (Dense), 133 TFLOPS (Sparse) TFLOPS
FP16133 TFLOPS (Dense), 197 TFLOPS (Sparse) TFLOPS
BF16133 TFLOPS (Dense), 197 TFLOPS (Sparse) TFLOPS
FP8263 TFLOPS (Dense), 395 TFLOPS (Sparse) TFLOPS
INT8263 TOPS (Dense), 527 TOPS (Sparse) TOPS
INT4527 TOPS (Dense), 1050 TOPS (Sparse) TOPS

Architecture

MicroarchitectureHopper
Process NodeTSMC 4N
Die Size814 mm²
Transistors80B
Compute Units132 SMs
Tensor Cores4th Gen, 528 Tensor Cores
RT Cores
Matrix EngineTransformer Engine (FP8/FP16/BF16)
Base Clock1400 MHz
Boost Clock1770 MHz
Transformer EngineYes (Gen 1)
Sparse AccelerationSupported (2:4 structured sparsity)
Dynamic PrecisionSupported (FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeHBM3
Total Capacity80GB GB
Bandwidth3.0TB/s
Bus Width5120-bit
HBM Stacks5
ECC SupportYes (Inline)
Unified MemoryYes (CUDA Unified Memory)
Compression
NUMA Awareness
Memory PoolingNVLink memory pooling supported via NVLink Switch System

Connectivity & Scaling

InterconnectNVLink
GenerationNVLink 4
IB Bandwidth900 GB/s
PCIe InterfacePCIe Gen 5 x16
CXL Support
TopologyFully-connected NVLink mesh (via HGX baseboard)
Max GPUs/Node8
Scale-OutYes (InfiniBand NDR/RoCE v2 via NICs)
GPUDirect RDMAYes
P2P MemoryYes

Virtualization

MIG SupportSupported
MIG Partitions7 instances (max)
SR-IOVNot Supported
vGPU ReadinessSupported (NVIDIA vGPU)
K8s ReadinessCertified (NVIDIA GPU Operator)
GPU SharingMIG, Time-Slicing, MPS, vGPU
Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP700 W W
Peak Power700-800 W
Idle Power70-100 W
Perf / WattUp to 26 TFLOPS/W (FP8, theoretical peak)
PSU RequiredN/A
ConnectorsDirect busbar (SXM socket, no external connectors)
Thermal LimitsMax GPU temperature 85°C (throttling above 85°C)
EfficiencyN/A

Physical Design

Form FactorSXM5 module
FHFLN/A
Slot WidthN/A
Dimensions112 mm x 157 mm
Weight1.8–2.2 kg
CoolingDirect liquid cooling (cold plate)
Rack DensityOptimized for high-density GPU servers (HGX H100 4/8-GPU baseboards)

Thermals & Cooling

AirflowServer chassis airflow required (Not Published)
Temp Range0°C to 45°C
ThrottlingThermal-based clock reduction at Tjunction limit
Noise LevelNot Applicable (Passive Module)
Liquid CoolingAir-cooled
DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported
ROCmNot Supported
oneAPINot Supported
PyTorchOfficially supported
TensorFlowOfficially supported
JAXSupported via CUDA backend
HuggingFaceOptimized (CUDA kernels available)
Triton ServerSupported
DockerOfficial container images available
Compiler StackMature CUDA compiler stack
Kernel OptimStandard driver-based support
Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro
Preconfigured4U 8-GPU systems, 2U 4-GPU systems
DGX/HGXCore of HGX baseboards, available in DGX systems
Rack-ScaleNVLink Switch System, InfiniBand scale-out
Edge DeployNot suitable for edge deployment due to high TDP
Ref ArchitecturesNVIDIA MGX, SuperPOD

System Compatibility

CPU PairingIntegrated with platform CPU (HGX/DGX architecture)
NUMANUMA locality impacts GPU-to-CPU bandwidth; optimal performance with balanced memory configuration
Required PCIeNot Applicable (SXM/OAM)
MotherboardPlatform-specific (HGX baseboard with SXM socket required)
Rack PowerContact vendor for rack power planning
BIOS Limits
CXL ReadyNot Supported
OS CompatSupported on major Linux distributions (RHEL, Ubuntu LTS); Windows Server supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe H100 SXM offers high efficiency with NVLink providing direct GPU-to-GPU communication, minimizing latency and maximizing throughput.
2-GPUNear-linear scaling due to NVLink's high bandwidth, allowing efficient data transfer between two GPUs.
4-GPUContinues near-linear scaling with NVLink, as the NVSwitch architecture effectively manages communication between multiple GPUs.
8-GPUMaintains near-linear scaling up to 8 GPUs, leveraging NVSwitch to handle inter-GPU communication without significant bottlenecks.
64+ GPUScalability is impacted by InfiniBand/Ethernet overhead, but multi-rail networking and GPUDirect RDMA help mitigate latency issues.

Scaling Characteristics

Cross-Node LatencyCross-node latency is minimized with support for GPUDirect RDMA, enabling efficient data transfer across nodes in a distributed training setup.
Network BottlenecksPotential bottlenecks could arise from VRAM pressure when handling large models, but NVLink and NVSwitch alleviate host-to-device bridge limitations.
ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, facilitating efficient distributed training with frameworks like DeepSpeed and Megatron.

Workload Readiness

LLM Training

The H100 SXM, based on the Hopper architecture, is highly suitable for training large language models, including 70B and 400B+ models, especially in multi-node configurations due to its high VRAM and advanced interconnect capabilities.

LLM Inference

The H100 SXM excels in LLM inference with its 4th-gen Tensor cores, providing high token-per-second throughput and ample KV cache headroom for efficient inference of large models.

Vision Training

With its advanced Tensor cores and high memory bandwidth, the H100 SXM is highly efficient for training large-scale vision models, offering significant improvements over previous architectures.

Diffusion Models

The H100 SXM is well-suited for diffusion models, benefiting from its high computational throughput and memory capacity, enabling efficient training and inference of complex generative models.

Multimodal AI

The H100 SXM's architecture supports multimodal AI workloads effectively, leveraging its high compute power and memory to handle diverse data types and complex model architectures.

Reinforcement Learning

The H100 SXM provides excellent performance for reinforcement learning tasks, with its high throughput and efficient parallel processing capabilities, enabling rapid training of complex agents.

HPC / Simulation

The H100 SXM offers strong FP64 performance, making it suitable for HPC simulations that require double precision, although it is optimized more for AI workloads.

Scientific Computing

The H100 SXM is capable of handling scientific computing tasks, especially those that can leverage its Tensor cores and high memory bandwidth for accelerated computations.

Edge Inference

The H100 SXM is not ideal for edge inference due to its high power consumption and large form factor, making it more suitable for data center deployments.

Real-Time Serving

The H100 SXM is highly efficient for real-time AI serving, providing low latency and high throughput for demanding applications, thanks to its advanced architecture and Tensor cores.

Fine-Tuning

The H100 SXM is highly efficient for full fine-tuning tasks, leveraging its large VRAM and compute capabilities to handle extensive model updates.

LoRA Efficiency

The H100 SXM supports efficient LoRA fine-tuning, benefiting from its advanced architecture to perform low-rank adaptations with lower VRAM requirements.

Market Authority

MLPerf Ranking

NVIDIA H100 SXM is officially reported in MLPerf Training v3.1 and Inference v3.1 results, consistently ranking at or near the top across multiple benchmarks.

Cloud Adoption

Publicly confirmed by NVIDIA, H100 SXM is adopted by AWS (Amazon EC2 P5 instances), Google Cloud (A3 supercomputers), Microsoft Azure (ND H100 v5 VMs), and Oracle Cloud.

Supercomputer Usage

H100 SXM is deployed in top supercomputers such as NVIDIA's Eos, and is confirmed as part of the hardware stack for the Frontier and Leonardo supercomputers.

Research Citations

H100 SXM is cited in numerous 2023-2024 research papers, particularly in large language model training and high-performance computing, as indexed by arXiv and IEEE Xplore.

Community Benchmarks

H100 SXM results are widely shared in open MLPerf submissions and community-led benchmarks, including Hugging Face and MLCommons forums.

GitHub Support

Extensive support for H100 SXM optimizations is present in major repositories such as PyTorch, TensorFlow, DeepSpeed, and NVIDIA's CUDA samples.

Enterprise Cases

NVIDIA has published case studies highlighting H100 SXM deployments at organizations like ServiceNow, OpenAI, and various healthcare and automotive enterprises.

Key Strengths

The H100 SXM excels in AI and machine learning workloads, particularly in training large neural networks and performing inference at scale. It offers significant performance improvements over its predecessors due to its advanced architecture and increased memory bandwidth. The H100 is also well-suited for high-performance computing (HPC) applications, providing exceptional computational power and efficiency.

Limitations

One limitation of the H100 SXM is its high power consumption, which may not be suitable for all datacenter environments. Additionally, its reliance on specific server platforms and cooling solutions can limit deployment flexibility. Availability can be constrained due to high demand and production capacities, potentially leading to longer lead times for procurement.

Expert Insight

The H100 represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS
VRAM
TDP
Cores
Information updated daily. Cloud pricing subject to vendor availability.