NVIDIA · March 2023

H100

NVL1

The NVIDIA H100 NVL1 is a high-performance GPU designed for data centers, targeting AI training and inference workloads. It is part of the Hopper architecture, offering significant improvements in performance and efficiency over its predecessors. The H100 NVL1 is optimized for large-scale AI models and high-performance computing tasks, making it a top choice for enterprises and research institutions.

H100 NVL1
VRAM
94GB GB
FP32 TFLOPS
67 TFLOPS
CUDA Cores
16896

Provider Marketplace

Cheapest
$1.91/hour
Starting from
Best Value
$1.91/hour
Starting from
Enterprise Choice
$3.07/hour
Starting from

All Cloud Providers

2 Options available
TensorDockCheapest
On-DemandGlobal Availability
$1.91/ hour
Estimated Cost
Provision
RunPod favicon
On-DemandGlobal Availability
$3.07/ hour
Estimated Cost
Provision

Compute Performance

FP6434 TFLOPS TFLOPS
FP3267 TFLOPS TFLOPS
TF32133 TFLOPS (Dense), 266 TFLOPS (Sparse) TFLOPS
FP16133 TFLOPS (Dense), 266 TFLOPS (Sparse) TFLOPS
BF16133 TFLOPS (Dense), 266 TFLOPS (Sparse) TFLOPS
FP8266 TFLOPS (Dense), 532 TFLOPS (Sparse) TFLOPS
INT8266 TOPS (Dense), 532 TOPS (Sparse) TOPS
INT4532 TOPS (Dense), 1064 TOPS (Sparse) TOPS

Architecture

MicroarchitectureHopper
Process NodeTSMC 4N
Die Size814 mm²
Transistors80B
Compute Units132 SMs
Tensor Cores4th Gen, 528 Tensor Cores
RT Cores
Matrix EngineTransformer Engine (FP8/FP16/BF16)
Base Clock
Boost Clock
Transformer EngineYes (Gen 1)
Sparse AccelerationSupported (2:4 structured sparsity)
Dynamic PrecisionSupported (FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeHBM3
Total Capacity94GB GB
Bandwidth4.8TB/s
Bus Width6144-bit
HBM Stacks6
ECC SupportYes (Inline)
Unified MemoryYes (CUDA Unified Memory)
Compression
NUMA Awareness
Memory PoolingNVLink memory pooling supported with NVLink Switch System

Connectivity & Scaling

InterconnectNVLink Switch
GenerationNVLink 4
IB Bandwidth1.8 TB/s
PCIe InterfacePCIe Gen 5 x16
CXL Support
TopologyNVLink domain via NVLink Switch; all-to-all within NVL1 system
Max GPUs/Node2
Scale-OutYes (via InfiniBand NDR/RoCE v2)
GPUDirect RDMAYes
P2P MemoryYes

Virtualization

MIG SupportSupported
MIG Partitions7 instances (max)
SR-IOVNot Supported
vGPU ReadinessSupported (NVIDIA vGPU)
K8s ReadinessCertified (NVIDIA GPU Operator)
GPU SharingMIG, Time-Slicing, MPS, vGPU
Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP700 W W
Peak Power700-750 W
Idle Power60-80 W
Perf / WattUp to 0.45 TFLOPS FP16/W
PSU RequiredN/A
Connectors2x PCIe 8-pin (per GPU)
Thermal LimitsMax GPU temperature 85°C; requires high-efficiency data center cooling
EfficiencyN/A

Physical Design

Form FactorDual SXM5 module (NVLink 1:1 pair, H100 NVL1)
FHFLN/A
Slot WidthN/A
Dimensions143 mm x 78 mm x 32 mm (per SXM5 module)
Weight1.8–2.2 kg (per SXM5 module)
CoolingPassive (requires external server cooling)
Rack DensityDesigned for high-density GPU servers with SXM5 sockets; supports NVLink interconnect for multi-GPU scaling

Thermals & Cooling

AirflowDirect-to-chip liquid cooling
Temp Range0°C to 45°C
ThrottlingThermal-based clock reduction at Tjunction limit
Noise LevelNot Applicable (Passive Module)
Liquid CoolingDirect-to-chip liquid cooling required
DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported
ROCmNot Supported
oneAPINot Supported
PyTorchOfficially supported
TensorFlowOfficially supported
JAXSupported via CUDA backend
HuggingFaceOptimized (CUDA kernels available)
Triton ServerSupported
DockerOfficial container images available
Compiler StackMature CUDA compiler stack
Kernel OptimStandard driver-based support
Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro
Preconfigured4U 8-GPU systems
DGX/HGXCore of HGX baseboards
Rack-ScaleNVLink Switch System, InfiniBand scale-out
Edge DeployNot suitable for edge deployment due to high TDP
Ref ArchitecturesNVIDIA MGX, OVX, SuperPOD

System Compatibility

CPU PairingIntegrated with platform CPU (HGX/DGX architecture)
NUMAStandard NUMA behavior
Required PCIeNot Applicable (SXM/OAM)
MotherboardPlatform-specific (HGX/NVL baseboard)
Rack PowerContact vendor for rack power planning
BIOS Limits
CXL ReadyNot Supported
OS CompatRHEL and Ubuntu LTS supported; Windows Server supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe H100 NVL1 offers high efficiency with its advanced architecture, maximizing throughput for single GPU workloads.
2-GPUWith NVLink bridge support, two GPUs can achieve near-linear scaling due to high interconnect bandwidth.
4-GPUScaling remains efficient with four GPUs, leveraging NVLink bridges to minimize latency and maximize bandwidth.
8-GPUScaling is near-linear up to 8 GPUs using NVLink bridges, ensuring high bandwidth and low latency communication.
64+ GPUAt this scale, InfiniBand or RoCE v2 overhead becomes significant, requiring careful network topology design to minimize latency and maximize throughput.

Scaling Characteristics

Cross-Node LatencySupports GPUDirect RDMA, allowing for low-latency, high-bandwidth communication across nodes, essential for distributed training.
Network BottlenecksPotential bottlenecks include host-to-device PCIe bandwidth limitations and VRAM pressure in memory-intensive workloads.
ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron for efficient distributed training.

Workload Readiness

LLM Training

The H100 NVL1, based on the Hopper architecture, is highly suitable for training large language models, supporting up to 400B+ parameter models in a multi-node setup due to its high VRAM capacity and advanced interconnects.

LLM Inference

Optimized for high throughput inference with 4th-gen Tensor cores, providing excellent token-per-second performance and sufficient KV cache for large models.

Vision Training

Ideal for vision model training with its high computational throughput and efficient tensor operations, supporting large batch sizes and complex models.

Diffusion Models

Highly capable for diffusion models due to its large memory bandwidth and tensor core optimizations, allowing for efficient training and inference.

Multimodal AI

Well-suited for multimodal AI tasks, leveraging its advanced architecture to handle diverse data types and complex model architectures efficiently.

Reinforcement Learning

Excellent for reinforcement learning with its high parallel processing capabilities and fast memory access, enabling rapid environment simulation and model updates.

HPC / Simulation

Strong performance in HPC simulations with robust FP64 support, making it suitable for scientific and engineering applications requiring high precision.

Scientific Computing

Highly effective for scientific computing tasks, offering substantial computational power and memory bandwidth for data-intensive workloads.

Edge Inference

Less suitable for edge inference due to high power consumption and large form factor, better suited for data center deployments.

Real-Time Serving

Capable of real-time AI serving with low latency and high throughput, ideal for demanding AI applications in a data center environment.

Fine-Tuning

Highly efficient for full fine-tuning tasks due to its large VRAM and advanced tensor core capabilities, supporting complex model adjustments.

LoRA Efficiency

Efficient for LoRA techniques, providing sufficient memory and processing power to handle parameter-efficient tuning methods.

Market Authority

Key Strengths

The H100 NVL1 excels at AI training and inference, particularly for large language models and complex neural networks. Its advanced architecture and high memory bandwidth provide significant performance advantages in deep learning and scientific computing tasks, making it a preferred choice for demanding workloads.

Limitations

While the H100 NVL1 offers exceptional performance, it comes with a high power requirement and cost, which may be a consideration for some organizations. Availability can be limited due to high demand, and its advanced features may require specific software optimizations to fully leverage its capabilities.

Expert Insight

The H100 represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS
VRAM
TDP
Cores
Information updated daily. Cloud pricing subject to vendor availability.