What workloads is the NVIDIA H100 SXM suited for?

The H100 SXM variant is ideal for high-performance computing (HPC) applications, large language model training, and data analytics that require significant compute power and memory bandwidth. High-Performance Computing (HPC) Large Language Model Training Data Analytics

NVIDIA · 2022-03-27

H100

Name: NVIDIA H100 SXM
Brand: NVIDIA
Availability: InStock
Rating: 4.8 (12 reviews)

SXM

The NVIDIA H100 SXM variant features exceptional performance and scalability for a wide range of workloads. It includes fourth-generation Tensor Cores and a Transformer Engine with FP8 precision, providing up to 4X faster training over the prior generation for large language models.

VRAM

80GB GB

FP32 TFLOPS

67 TFLOPS

CUDA Cores

16,896

TDP

Up to 700W (configurable) W

Provider Marketplace

Cheapest

$1.90/hour

Starting from

Hyperstack Visit

Best Value

$1.90/hour

Starting from

Hyperstack Visit

Enterprise Choice

$2.94/hour

Starting from

Oblivus Visit

All Cloud Providers

3 Options available

HyperstackCheapest

On-Demand•Global Availability

$1.90/ hour

Estimated Cost

Provision

Hyperstack

On-Demand•Global Availability

$1.90/ hour

Estimated Cost

Provision

Oblivus

On-Demand•Global Availability

$2.94/ hour

Estimated Cost

Provision

Compute Performance

FP6434 TFLOPS TFLOPS

FP3267 TFLOPS TFLOPS

TF3267 TFLOPS (Dense), 133 TFLOPS (Sparse) TFLOPS

FP16133 TFLOPS (Dense), 197 TFLOPS (Sparse) TFLOPS

BF16133 TFLOPS (Dense), 197 TFLOPS (Sparse) TFLOPS

FP8263 TFLOPS (Dense), 395 TFLOPS (Sparse) TFLOPS

INT8263 TOPS (Dense), 527 TOPS (Sparse) TOPS

INT4527 TOPS (Dense), 1050 TOPS (Sparse) TOPS

Architecture

MicroarchitectureHopper

Process NodeTSMC 4N

Die Size814 mm²

Transistors80B

Compute Units132 SMs

Tensor Cores4th Gen, 528 Tensor Cores

RT Cores—

Matrix EngineTransformer Engine (FP8/FP16/BF16)

Base Clock1400 MHz

Boost Clock1770 MHz

Transformer EngineYes (Gen 1)

Sparse AccelerationSupported (2:4 structured sparsity)

Dynamic PrecisionSupported (FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeHBM3

Total Capacity80GB GB

Bandwidth3.0TB/s

Bus Width5120-bit

HBM Stacks5

ECC SupportYes (Inline)

Unified MemoryYes (CUDA Unified Memory)

Compression—

NUMA Awareness—

Memory PoolingNVLink memory pooling supported via NVLink Switch System

Connectivity & Scaling

InterconnectNVLink

GenerationNVLink 4

IB Bandwidth900 GB/s

PCIe InterfacePCIe Gen 5 x16

CXL Support—

TopologyFully-connected NVLink mesh (via HGX baseboard)

Max GPUs/Node8

Scale-OutYes (InfiniBand NDR/RoCE v2 via NICs)

GPUDirect RDMAYes

P2P MemoryYes

Virtualization

MIG SupportSupported

MIG Partitions7 instances (max)

SR-IOVNot Supported

vGPU ReadinessSupported (NVIDIA vGPU)

K8s ReadinessCertified (NVIDIA GPU Operator)

GPU SharingMIG, Time-Slicing, MPS, vGPU

Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP700 W W

Peak Power700-800 W

Idle Power70-100 W

Perf / WattUp to 26 TFLOPS/W (FP8, theoretical peak)

PSU RequiredN/A

ConnectorsDirect busbar (SXM socket, no external connectors)

Thermal LimitsMax GPU temperature 85°C (throttling above 85°C)

EfficiencyN/A

Physical Design

Form FactorSXM5 module

FHFLN/A

Slot WidthN/A

Dimensions112 mm x 157 mm

Weight1.8–2.2 kg

CoolingDirect liquid cooling (cold plate)

Rack DensityOptimized for high-density GPU servers (HGX H100 4/8-GPU baseboards)

Thermals & Cooling

AirflowServer chassis airflow required (Not Published)

Temp Range0°C to 45°C

ThrottlingThermal-based clock reduction at Tjunction limit

Noise LevelNot Applicable (Passive Module)

Liquid CoolingAir-cooled

DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported

ROCmNot Supported

oneAPINot Supported

PyTorchOfficially supported

TensorFlowOfficially supported

JAXSupported via CUDA backend

HuggingFaceOptimized (CUDA kernels available)

Triton ServerSupported

DockerOfficial container images available

Compiler StackMature CUDA compiler stack

Kernel OptimStandard driver-based support

Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro

Preconfigured4U 8-GPU systems, 2U 4-GPU systems

DGX/HGXCore of HGX baseboards, available in DGX systems

Rack-ScaleNVLink Switch System, InfiniBand scale-out

Edge DeployNot suitable for edge deployment due to high TDP

Ref ArchitecturesNVIDIA MGX, SuperPOD

System Compatibility

CPU PairingIntegrated with platform CPU (HGX/DGX architecture)

NUMANUMA locality impacts GPU-to-CPU bandwidth; optimal performance with balanced memory configuration

Required PCIeNot Applicable (SXM/OAM)

MotherboardPlatform-specific (HGX baseboard with SXM socket required)

Rack PowerContact vendor for rack power planning

BIOS Limits—

CXL ReadyNot Supported

OS CompatSupported on major Linux distributions (RHEL, Ubuntu LTS); Windows Server supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe H100 SXM offers high efficiency with NVLink providing direct GPU-to-GPU communication, minimizing latency and maximizing throughput.

2-GPUNear-linear scaling due to NVLink's high bandwidth, allowing efficient data transfer between two GPUs.

4-GPUContinues near-linear scaling with NVLink, as the NVSwitch architecture effectively manages communication between multiple GPUs.

8-GPUMaintains near-linear scaling up to 8 GPUs, leveraging NVSwitch to handle inter-GPU communication without significant bottlenecks.

64+ GPUScalability is impacted by InfiniBand/Ethernet overhead, but multi-rail networking and GPUDirect RDMA help mitigate latency issues.

Scaling Characteristics

Cross-Node LatencyCross-node latency is minimized with support for GPUDirect RDMA, enabling efficient data transfer across nodes in a distributed training setup.

Network BottlenecksPotential bottlenecks could arise from VRAM pressure when handling large models, but NVLink and NVSwitch alleviate host-to-device bridge limitations.

ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, facilitating efficient distributed training with frameworks like DeepSpeed and Megatron.

Workload Readiness

LLM Training

The H100 SXM, based on the Hopper architecture, is highly suitable for training large language models, including 70B and 400B+ models, especially in multi-node configurations due to its high VRAM and advanced interconnect capabilities.

LLM Inference

The H100 SXM excels in LLM inference with its 4th-gen Tensor cores, providing high token-per-second throughput and ample KV cache headroom for efficient inference of large models.

Vision Training

With its advanced Tensor cores and high memory bandwidth, the H100 SXM is highly efficient for training large-scale vision models, offering significant improvements over previous architectures.

Diffusion Models

The H100 SXM is well-suited for diffusion models, benefiting from its high computational throughput and memory capacity, enabling efficient training and inference of complex generative models.

Multimodal AI

The H100 SXM's architecture supports multimodal AI workloads effectively, leveraging its high compute power and memory to handle diverse data types and complex model architectures.

Reinforcement Learning

The H100 SXM provides excellent performance for reinforcement learning tasks, with its high throughput and efficient parallel processing capabilities, enabling rapid training of complex agents.

HPC / Simulation

The H100 SXM offers strong FP64 performance, making it suitable for HPC simulations that require double precision, although it is optimized more for AI workloads.

Scientific Computing

The H100 SXM is capable of handling scientific computing tasks, especially those that can leverage its Tensor cores and high memory bandwidth for accelerated computations.

Edge Inference

The H100 SXM is not ideal for edge inference due to its high power consumption and large form factor, making it more suitable for data center deployments.

Real-Time Serving

The H100 SXM is highly efficient for real-time AI serving, providing low latency and high throughput for demanding applications, thanks to its advanced architecture and Tensor cores.

Fine-Tuning

The H100 SXM is highly efficient for full fine-tuning tasks, leveraging its large VRAM and compute capabilities to handle extensive model updates.

LoRA Efficiency

The H100 SXM supports efficient LoRA fine-tuning, benefiting from its advanced architecture to perform low-rank adaptations with lower VRAM requirements.

Market Authority

MLPerf Ranking

NVIDIA H100 SXM is officially reported in MLPerf Training v3.1 and Inference v3.1 results, consistently ranking at or near the top across multiple benchmarks.

Cloud Adoption

Publicly confirmed by NVIDIA, H100 SXM is adopted by AWS (Amazon EC2 P5 instances), Google Cloud (A3 supercomputers), Microsoft Azure (ND H100 v5 VMs), and Oracle Cloud.

Supercomputer Usage

H100 SXM is deployed in top supercomputers such as NVIDIA's Eos, and is confirmed as part of the hardware stack for the Frontier and Leonardo supercomputers.

Research Citations

H100 SXM is cited in numerous 2023-2024 research papers, particularly in large language model training and high-performance computing, as indexed by arXiv and IEEE Xplore.

Community Benchmarks

H100 SXM results are widely shared in open MLPerf submissions and community-led benchmarks, including Hugging Face and MLCommons forums.

GitHub Support

Extensive support for H100 SXM optimizations is present in major repositories such as PyTorch, TensorFlow, DeepSpeed, and NVIDIA's CUDA samples.

Enterprise Cases

NVIDIA has published case studies highlighting H100 SXM deployments at organizations like ServiceNow, OpenAI, and various healthcare and automotive enterprises.

Key Strengths

The H100 SXM excels in AI and machine learning workloads, particularly in training large neural networks and performing inference at scale. It offers significant performance improvements over its predecessors due to its advanced architecture and increased memory bandwidth. The H100 is also well-suited for high-performance computing (HPC) applications, providing exceptional computational power and efficiency.

Limitations

One limitation of the H100 SXM is its high power consumption, which may not be suitable for all datacenter environments. Additionally, its reliance on specific server platforms and cooling solutions can limit deployment flexibility. Availability can be constrained due to high demand and production capacities, potentially leading to longer lead times for procurement.

Also in the Lineup

GeForce RTX 4090 Founders Edition

NVIDIA

GeForce RTX 5080 Founders Edition

NVIDIA

GeForce RTX 5090 RTX 5090

Expert Insight

The H100 represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS

VRAM

TDP

Cores

Information updated daily. Cloud pricing subject to vendor availability.