NVIDIA · 2022-03-27

H100

NVL

The NVIDIA H100 NVL variant is optimized for large language model inference, offering up to 5x performance improvement over NVIDIA A100 systems for LLMs up to 70 billion parameters. It features a PCIe form factor, NVLink bridge, and 188GB HBM3 memory for enhanced performance and scalability.

H100 NVL
VRAM
94GB GB
FP32 TFLOPS
67 TFLOPS
CUDA Cores
16,896 (Per GPU)
TDP
350-400W (configurable) W

Provider Marketplace

Cheapest
$0.00/hour
Starting from
Best Value
$0.00/hour
Starting from
Enterprise Choice
$3.07/hour
Starting from

All Cloud Providers

2 Options available
Crusoe Cloud favicon
Crusoe CloudCheapest
On-DemandGlobal Availability
$0.00/ hour
Estimated Cost
Provision
RunPod favicon
On-DemandGlobal Availability
$3.07/ hour
Estimated Cost
Provision

Compute Performance

FP6434 TFLOPS TFLOPS
FP3267 TFLOPS TFLOPS
TF32133 TFLOPS (Dense), 266 TFLOPS (Sparse) TFLOPS
FP16133 TFLOPS (Dense), 266 TFLOPS (Sparse) TFLOPS
BF16133 TFLOPS (Dense), 266 TFLOPS (Sparse) TFLOPS
FP8266 TFLOPS (Dense), 532 TFLOPS (Sparse) TFLOPS
INT8266 TOPS (Dense), 532 TOPS (Sparse) TOPS
INT4532 TOPS (Dense), 1064 TOPS (Sparse) TOPS

Architecture

MicroarchitectureHopper
Process NodeTSMC 4N
Die Size814 mm²
Transistors80B
Compute Units132 SMs
Tensor Cores4th Gen, 528 Tensor Cores
RT Cores
Matrix EngineTransformer Engine (FP8/FP16/BF16)
Base Clock
Boost Clock
Transformer EngineYes (Gen 1)
Sparse AccelerationSupported (2:4 structured sparsity)
Dynamic PrecisionSupported (FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeHBM3
Total Capacity94GB GB
Bandwidth4.8TB/s
Bus Width6144-bit
HBM Stacks6
ECC SupportYes (Inline)
Unified MemoryYes (CUDA Unified Memory)
Compression
NUMA Awareness
Memory PoolingYes (NVLink memory pooling)

Connectivity & Scaling

InterconnectNVLink
GenerationNVLink 4
IB Bandwidth1.8 TB/s
PCIe InterfacePCIe Gen 5 xx16
CXL Support
TopologyNVLink domain with NVSwitch, fully connected mesh
Max GPUs/Node4
Scale-OutYes
GPUDirect RDMAYes
P2P MemoryYes

Virtualization

MIG SupportSupported
MIG Partitions7 instances (max)
SR-IOVNot Supported
vGPU ReadinessSupported (NVIDIA vGPU)
K8s ReadinessCertified (NVIDIA GPU Operator)
GPU SharingMIG, Time-Slicing, MPS, vGPU
Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP700 W W
Peak Power700-750 W
Idle Power70-100 W
Perf / WattUp to 26 TFLOPS FP16/W (theoretical, workload-dependent)
PSU RequiredN/A
Connectors2x PCIe 8-pin per GPU
Thermal LimitsOperating temperature up to 85°C GPU temperature; requires high airflow or liquid cooling in dense deployments
EfficiencyN/A

Physical Design

Form FactorDual SXM5 module (NVLink, H100 NVL configuration)
FHFLN/A
Slot WidthN/A
Dimensions267 mm x 112 mm x 41 mm (per module, typical SXM5)
Weight1.8–2.2 kg (per module, typical SXM5)
CoolingPassive (requires external server/board cooling)
Rack DensityOptimized for high-density GPU servers (NVLink interconnect, multi-GPU baseboards)

Thermals & Cooling

AirflowDirect-to-chip liquid cooling
Temp Range0°C to 45°C
ThrottlingThermal-based clock reduction at Tjunction limit
Noise LevelNot Applicable (Passive Module)
Liquid CoolingDirect-to-chip liquid cooling required
DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported
ROCmNot Supported
oneAPINot Supported
PyTorchOfficially supported
TensorFlowOfficially supported
JAXSupported via CUDA backend
HuggingFaceOptimized (CUDA kernels available)
Triton ServerSupported
DockerOfficial container images available
Compiler StackMature CUDA compiler stack
Kernel OptimStandard driver-based support
Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro
Preconfigured4U 8-GPU systems
DGX/HGXCore of HGX baseboard
Rack-ScaleNVLink Switch System, InfiniBand scale-out
Edge DeployNot suitable for edge deployment due to high TDP
Ref ArchitecturesNVIDIA MGX, OVX, SuperPOD

System Compatibility

CPU PairingIntegrated with platform CPU (HGX/DGX architecture)
NUMAStandard NUMA behavior
Required PCIeNot Applicable (SXM/OAM)
MotherboardPlatform-specific (HGX/NVL baseboard)
Rack PowerContact vendor for rack power planning
BIOS Limits
CXL ReadyNot Supported
OS CompatRHEL and Ubuntu LTS supported; Windows Server supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe H100 NVL offers high single GPU efficiency with substantial compute capabilities, leveraging its advanced architecture.
2-GPUWith NVLink bridge support, two GPUs can achieve efficient scaling, minimizing latency and maximizing bandwidth.
4-GPUScaling to four GPUs remains efficient with NVLink bridges, though limited by PCIe bandwidth if not using NVLink.
8-GPUNear-linear scaling is achievable with NVLink bridges, but PCIe configurations may face bandwidth contention.
64+ GPUInfiniBand or RoCE v2 is necessary to manage network overhead and maintain efficiency at large scales, with potential bottlenecks in inter-node communication.

Scaling Characteristics

Cross-Node LatencyGPUDirect RDMA support helps reduce cross-node latency, essential for maintaining performance in distributed setups.
Network BottlenecksPotential bottlenecks include PCIe bandwidth limitations and VRAM pressure, especially in data-intensive applications.
ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron for efficient distributed training.

Workload Readiness

LLM Training

The H100 NVL, based on the Hopper architecture, is highly suitable for training large language models up to 400B+ parameters in a multi-node setup due to its high VRAM capacity and advanced interconnects.

LLM Inference

The H100 NVL excels in LLM inference with high token-per-second throughput and ample KV cache headroom, making it ideal for large-scale deployments.

Vision Training

With its advanced Tensor cores and substantial VRAM, the H100 NVL is highly efficient for training large vision models, supporting complex architectures and large batch sizes.

Diffusion Models

The H100 NVL is well-suited for diffusion models, offering high computational throughput and memory bandwidth necessary for training and inference of complex generative models.

Multimodal AI

The H100 NVL's architecture supports multimodal AI tasks efficiently, providing the necessary compute power and memory bandwidth for handling diverse data types simultaneously.

Reinforcement Learning

The H100 NVL is highly capable for reinforcement learning workloads, offering fast computation and high memory capacity to handle complex environments and large state spaces.

HPC / Simulation

The H100 NVL provides strong support for HPC simulations with its robust FP64 performance, making it suitable for scientific and engineering simulations requiring high precision.

Scientific Computing

With excellent double precision capabilities, the H100 NVL is ideal for scientific computing tasks that demand high accuracy and computational power.

Edge Inference

The H100 NVL is not optimized for edge inference due to its high power consumption and large form factor, making it more suitable for data center environments.

Real-Time Serving

The H100 NVL is highly efficient for real-time AI serving, offering low latency and high throughput for demanding applications.

Fine-Tuning

The H100 NVL is highly efficient for full fine-tuning tasks, leveraging its large VRAM and advanced architecture to handle extensive model updates.

LoRA Efficiency

The H100 NVL is also efficient for LoRA fine-tuning, providing sufficient memory and compute resources to support parameter-efficient training methods.

Market Authority

Cloud Adoption

NVIDIA has publicly confirmed H100 NVL adoption by Microsoft Azure and Oracle Cloud Infrastructure.

Research Citations

Limited; as of June 2024, few peer-reviewed papers explicitly cite H100 NVL due to its recent release.

GitHub Support

Some emerging support; select repositories (e.g., NVIDIA/DeepLearningExamples) mention H100 NVL compatibility, but widespread optimization is not yet prevalent.

Key Strengths

The H100 NVL excels at large-scale AI training and inference tasks, particularly in natural language processing and deep learning models. Its architecture is optimized for transformer models, offering significant performance improvements over previous generations. The GPU's high memory bandwidth and advanced tensor cores make it ideal for demanding computational workloads.

Limitations

The H100 NVL's high power requirements and need for advanced cooling solutions can be a limitation for some deployments. Additionally, its premium pricing and availability constraints may pose challenges for smaller organizations. Users should also consider the infrastructure investment needed to fully leverage its capabilities.

Expert Insight

The H100 represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS
VRAM
TDP
Cores
Information updated daily. Cloud pricing subject to vendor availability.