What workloads is the NVIDIA H100 NVL suited for?

The H100 NVL variant is best suited for large language model inference tasks, offering high performance and low latency for power-constrained data center environments. Large Language Model Inference AI Inference Acceleration Power-Constrained Data Centers

NVIDIA · 2022-03-27

H100

Name: NVIDIA H100 NVL
Brand: NVIDIA
Availability: InStock
Rating: 4.8 (12 reviews)

NVL

The NVIDIA H100 NVL variant is optimized for large language model inference, offering up to 5x performance improvement over NVIDIA A100 systems for LLMs up to 70 billion parameters. It features a PCIe form factor, NVLink bridge, and 188GB HBM3 memory for enhanced performance and scalability.

VRAM

94GB GB

FP32 TFLOPS

67 TFLOPS

CUDA Cores

16,896 (Per GPU)

TDP

350-400W (configurable) W

Provider Marketplace

Cheapest

$0.00/hour

Starting from

Crusoe Cloud Visit

Best Value

$0.00/hour

Starting from

Crusoe Cloud Visit

Enterprise Choice

$3.07/hour

Starting from

RunPod Visit

All Cloud Providers

2 Options available

Crusoe CloudCheapest

On-Demand•Global Availability

$0.00/ hour

Estimated Cost

Provision

RunPod

On-Demand•Global Availability

$3.07/ hour

Estimated Cost

Provision

Compute Performance

FP6434 TFLOPS TFLOPS

FP3267 TFLOPS TFLOPS

TF32133 TFLOPS (Dense), 266 TFLOPS (Sparse) TFLOPS

FP16133 TFLOPS (Dense), 266 TFLOPS (Sparse) TFLOPS

BF16133 TFLOPS (Dense), 266 TFLOPS (Sparse) TFLOPS

FP8266 TFLOPS (Dense), 532 TFLOPS (Sparse) TFLOPS

INT8266 TOPS (Dense), 532 TOPS (Sparse) TOPS

INT4532 TOPS (Dense), 1064 TOPS (Sparse) TOPS

Architecture

MicroarchitectureHopper

Process NodeTSMC 4N

Die Size814 mm²

Transistors80B

Compute Units132 SMs

Tensor Cores4th Gen, 528 Tensor Cores

RT Cores—

Matrix EngineTransformer Engine (FP8/FP16/BF16)

Base Clock—

Boost Clock—

Transformer EngineYes (Gen 1)

Sparse AccelerationSupported (2:4 structured sparsity)

Dynamic PrecisionSupported (FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeHBM3

Total Capacity94GB GB

Bandwidth4.8TB/s

Bus Width6144-bit

HBM Stacks6

ECC SupportYes (Inline)

Unified MemoryYes (CUDA Unified Memory)

Compression—

NUMA Awareness—

Memory PoolingYes (NVLink memory pooling)

Connectivity & Scaling

InterconnectNVLink

GenerationNVLink 4

IB Bandwidth1.8 TB/s

PCIe InterfacePCIe Gen 5 xx16

CXL Support—

TopologyNVLink domain with NVSwitch, fully connected mesh

Max GPUs/Node4

Scale-OutYes

GPUDirect RDMAYes

P2P MemoryYes

Virtualization

MIG SupportSupported

MIG Partitions7 instances (max)

SR-IOVNot Supported

vGPU ReadinessSupported (NVIDIA vGPU)

K8s ReadinessCertified (NVIDIA GPU Operator)

GPU SharingMIG, Time-Slicing, MPS, vGPU

Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP700 W W

Peak Power700-750 W

Idle Power70-100 W

Perf / WattUp to 26 TFLOPS FP16/W (theoretical, workload-dependent)

PSU RequiredN/A

Connectors2x PCIe 8-pin per GPU

Thermal LimitsOperating temperature up to 85°C GPU temperature; requires high airflow or liquid cooling in dense deployments

EfficiencyN/A

Physical Design

Form FactorDual SXM5 module (NVLink, H100 NVL configuration)

FHFLN/A

Slot WidthN/A

Dimensions267 mm x 112 mm x 41 mm (per module, typical SXM5)

Weight1.8–2.2 kg (per module, typical SXM5)

CoolingPassive (requires external server/board cooling)

Rack DensityOptimized for high-density GPU servers (NVLink interconnect, multi-GPU baseboards)

Thermals & Cooling

AirflowDirect-to-chip liquid cooling

Temp Range0°C to 45°C

ThrottlingThermal-based clock reduction at Tjunction limit

Noise LevelNot Applicable (Passive Module)

Liquid CoolingDirect-to-chip liquid cooling required

DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported

ROCmNot Supported

oneAPINot Supported

PyTorchOfficially supported

TensorFlowOfficially supported

JAXSupported via CUDA backend

HuggingFaceOptimized (CUDA kernels available)

Triton ServerSupported

DockerOfficial container images available

Compiler StackMature CUDA compiler stack

Kernel OptimStandard driver-based support

Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro

Preconfigured4U 8-GPU systems

DGX/HGXCore of HGX baseboard

Rack-ScaleNVLink Switch System, InfiniBand scale-out

Edge DeployNot suitable for edge deployment due to high TDP

Ref ArchitecturesNVIDIA MGX, OVX, SuperPOD

System Compatibility

CPU PairingIntegrated with platform CPU (HGX/DGX architecture)

NUMAStandard NUMA behavior

Required PCIeNot Applicable (SXM/OAM)

MotherboardPlatform-specific (HGX/NVL baseboard)

Rack PowerContact vendor for rack power planning

BIOS Limits—

CXL ReadyNot Supported

OS CompatRHEL and Ubuntu LTS supported; Windows Server supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe H100 NVL offers high single GPU efficiency with substantial compute capabilities, leveraging its advanced architecture.

2-GPUWith NVLink bridge support, two GPUs can achieve efficient scaling, minimizing latency and maximizing bandwidth.

4-GPUScaling to four GPUs remains efficient with NVLink bridges, though limited by PCIe bandwidth if not using NVLink.

8-GPUNear-linear scaling is achievable with NVLink bridges, but PCIe configurations may face bandwidth contention.

64+ GPUInfiniBand or RoCE v2 is necessary to manage network overhead and maintain efficiency at large scales, with potential bottlenecks in inter-node communication.

Scaling Characteristics

Cross-Node LatencyGPUDirect RDMA support helps reduce cross-node latency, essential for maintaining performance in distributed setups.

Network BottlenecksPotential bottlenecks include PCIe bandwidth limitations and VRAM pressure, especially in data-intensive applications.

ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron for efficient distributed training.

Workload Readiness

LLM Training

The H100 NVL, based on the Hopper architecture, is highly suitable for training large language models up to 400B+ parameters in a multi-node setup due to its high VRAM capacity and advanced interconnects.

LLM Inference

The H100 NVL excels in LLM inference with high token-per-second throughput and ample KV cache headroom, making it ideal for large-scale deployments.

Vision Training

With its advanced Tensor cores and substantial VRAM, the H100 NVL is highly efficient for training large vision models, supporting complex architectures and large batch sizes.

Diffusion Models

The H100 NVL is well-suited for diffusion models, offering high computational throughput and memory bandwidth necessary for training and inference of complex generative models.

Multimodal AI

The H100 NVL's architecture supports multimodal AI tasks efficiently, providing the necessary compute power and memory bandwidth for handling diverse data types simultaneously.

Reinforcement Learning

The H100 NVL is highly capable for reinforcement learning workloads, offering fast computation and high memory capacity to handle complex environments and large state spaces.

HPC / Simulation

The H100 NVL provides strong support for HPC simulations with its robust FP64 performance, making it suitable for scientific and engineering simulations requiring high precision.

Scientific Computing

With excellent double precision capabilities, the H100 NVL is ideal for scientific computing tasks that demand high accuracy and computational power.

Edge Inference

The H100 NVL is not optimized for edge inference due to its high power consumption and large form factor, making it more suitable for data center environments.

Real-Time Serving

The H100 NVL is highly efficient for real-time AI serving, offering low latency and high throughput for demanding applications.

Fine-Tuning

The H100 NVL is highly efficient for full fine-tuning tasks, leveraging its large VRAM and advanced architecture to handle extensive model updates.

LoRA Efficiency

The H100 NVL is also efficient for LoRA fine-tuning, providing sufficient memory and compute resources to support parameter-efficient training methods.

Market Authority

Cloud Adoption

NVIDIA has publicly confirmed H100 NVL adoption by Microsoft Azure and Oracle Cloud Infrastructure.

Research Citations

Limited; as of June 2024, few peer-reviewed papers explicitly cite H100 NVL due to its recent release.

GitHub Support

Some emerging support; select repositories (e.g., NVIDIA/DeepLearningExamples) mention H100 NVL compatibility, but widespread optimization is not yet prevalent.

Key Strengths

The H100 NVL excels at large-scale AI training and inference tasks, particularly in natural language processing and deep learning models. Its architecture is optimized for transformer models, offering significant performance improvements over previous generations. The GPU's high memory bandwidth and advanced tensor cores make it ideal for demanding computational workloads.

Limitations

The H100 NVL's high power requirements and need for advanced cooling solutions can be a limitation for some deployments. Additionally, its premium pricing and availability constraints may pose challenges for smaller organizations. Users should also consider the infrastructure investment needed to fully leverage its capabilities.

Also in the Lineup

GeForce RTX 4090 Founders Edition

NVIDIA

GeForce RTX 5080 Founders Edition

NVIDIA

GeForce RTX 5090 RTX 5090

Expert Insight

The H100 represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS

VRAM

TDP

Cores

Information updated daily. Cloud pricing subject to vendor availability.