NVIDIA · March 2023

H100

Name: NVIDIA H100 NVL1
Brand: NVIDIA
Availability: InStock
Rating: 4.8 (12 reviews)

NVL1

The NVIDIA H100 NVL1 is a high-performance GPU designed for data centers, targeting AI training and inference workloads. It is part of the Hopper architecture, offering significant improvements in performance and efficiency over its predecessors. The H100 NVL1 is optimized for large-scale AI models and high-performance computing tasks, making it a top choice for enterprises and research institutions.

VRAM

94GB GB

FP32 TFLOPS

67 TFLOPS

CUDA Cores

16896

Provider Marketplace

Cheapest

$1.91/hour

Starting from

TensorDock Visit

Best Value

$1.91/hour

Starting from

TensorDock Visit

Enterprise Choice

$3.07/hour

Starting from

RunPod Visit

All Cloud Providers

2 Options available

TensorDockCheapest

On-Demand•Global Availability

$1.91/ hour

Estimated Cost

Provision

RunPod

On-Demand•Global Availability

$3.07/ hour

Estimated Cost

Provision

Compute Performance

FP6434 TFLOPS TFLOPS

FP3267 TFLOPS TFLOPS

TF32133 TFLOPS (Dense), 266 TFLOPS (Sparse) TFLOPS

FP16133 TFLOPS (Dense), 266 TFLOPS (Sparse) TFLOPS

BF16133 TFLOPS (Dense), 266 TFLOPS (Sparse) TFLOPS

FP8266 TFLOPS (Dense), 532 TFLOPS (Sparse) TFLOPS

INT8266 TOPS (Dense), 532 TOPS (Sparse) TOPS

INT4532 TOPS (Dense), 1064 TOPS (Sparse) TOPS

Architecture

MicroarchitectureHopper

Process NodeTSMC 4N

Die Size814 mm²

Transistors80B

Compute Units132 SMs

Tensor Cores4th Gen, 528 Tensor Cores

RT Cores—

Matrix EngineTransformer Engine (FP8/FP16/BF16)

Base Clock—

Boost Clock—

Transformer EngineYes (Gen 1)

Sparse AccelerationSupported (2:4 structured sparsity)

Dynamic PrecisionSupported (FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeHBM3

Total Capacity94GB GB

Bandwidth4.8TB/s

Bus Width6144-bit

HBM Stacks6

ECC SupportYes (Inline)

Unified MemoryYes (CUDA Unified Memory)

Compression—

NUMA Awareness—

Memory PoolingNVLink memory pooling supported with NVLink Switch System

Connectivity & Scaling

InterconnectNVLink Switch

GenerationNVLink 4

IB Bandwidth1.8 TB/s

PCIe InterfacePCIe Gen 5 x16

CXL Support—

TopologyNVLink domain via NVLink Switch; all-to-all within NVL1 system

Max GPUs/Node2

Scale-OutYes (via InfiniBand NDR/RoCE v2)

GPUDirect RDMAYes

P2P MemoryYes

Virtualization

MIG SupportSupported

MIG Partitions7 instances (max)

SR-IOVNot Supported

vGPU ReadinessSupported (NVIDIA vGPU)

K8s ReadinessCertified (NVIDIA GPU Operator)

GPU SharingMIG, Time-Slicing, MPS, vGPU

Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP700 W W

Peak Power700-750 W

Idle Power60-80 W

Perf / WattUp to 0.45 TFLOPS FP16/W

PSU RequiredN/A

Connectors2x PCIe 8-pin (per GPU)

Thermal LimitsMax GPU temperature 85°C; requires high-efficiency data center cooling

EfficiencyN/A

Physical Design

Form FactorDual SXM5 module (NVLink 1:1 pair, H100 NVL1)

FHFLN/A

Slot WidthN/A

Dimensions143 mm x 78 mm x 32 mm (per SXM5 module)

Weight1.8–2.2 kg (per SXM5 module)

CoolingPassive (requires external server cooling)

Rack DensityDesigned for high-density GPU servers with SXM5 sockets; supports NVLink interconnect for multi-GPU scaling

Thermals & Cooling

AirflowDirect-to-chip liquid cooling

Temp Range0°C to 45°C

ThrottlingThermal-based clock reduction at Tjunction limit

Noise LevelNot Applicable (Passive Module)

Liquid CoolingDirect-to-chip liquid cooling required

DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported

ROCmNot Supported

oneAPINot Supported

PyTorchOfficially supported

TensorFlowOfficially supported

JAXSupported via CUDA backend

HuggingFaceOptimized (CUDA kernels available)

Triton ServerSupported

DockerOfficial container images available

Compiler StackMature CUDA compiler stack

Kernel OptimStandard driver-based support

Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro

Preconfigured4U 8-GPU systems

DGX/HGXCore of HGX baseboards

Rack-ScaleNVLink Switch System, InfiniBand scale-out

Edge DeployNot suitable for edge deployment due to high TDP

Ref ArchitecturesNVIDIA MGX, OVX, SuperPOD

System Compatibility

CPU PairingIntegrated with platform CPU (HGX/DGX architecture)

NUMAStandard NUMA behavior

Required PCIeNot Applicable (SXM/OAM)

MotherboardPlatform-specific (HGX/NVL baseboard)

Rack PowerContact vendor for rack power planning

BIOS Limits—

CXL ReadyNot Supported

OS CompatRHEL and Ubuntu LTS supported; Windows Server supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe H100 NVL1 offers high efficiency with its advanced architecture, maximizing throughput for single GPU workloads.

2-GPUWith NVLink bridge support, two GPUs can achieve near-linear scaling due to high interconnect bandwidth.

4-GPUScaling remains efficient with four GPUs, leveraging NVLink bridges to minimize latency and maximize bandwidth.

8-GPUScaling is near-linear up to 8 GPUs using NVLink bridges, ensuring high bandwidth and low latency communication.

64+ GPUAt this scale, InfiniBand or RoCE v2 overhead becomes significant, requiring careful network topology design to minimize latency and maximize throughput.

Scaling Characteristics

Cross-Node LatencySupports GPUDirect RDMA, allowing for low-latency, high-bandwidth communication across nodes, essential for distributed training.

Network BottlenecksPotential bottlenecks include host-to-device PCIe bandwidth limitations and VRAM pressure in memory-intensive workloads.

ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron for efficient distributed training.

Workload Readiness

LLM Training

The H100 NVL1, based on the Hopper architecture, is highly suitable for training large language models, supporting up to 400B+ parameter models in a multi-node setup due to its high VRAM capacity and advanced interconnects.

LLM Inference

Optimized for high throughput inference with 4th-gen Tensor cores, providing excellent token-per-second performance and sufficient KV cache for large models.

Vision Training

Ideal for vision model training with its high computational throughput and efficient tensor operations, supporting large batch sizes and complex models.

Diffusion Models

Highly capable for diffusion models due to its large memory bandwidth and tensor core optimizations, allowing for efficient training and inference.

Multimodal AI

Well-suited for multimodal AI tasks, leveraging its advanced architecture to handle diverse data types and complex model architectures efficiently.

Reinforcement Learning

Excellent for reinforcement learning with its high parallel processing capabilities and fast memory access, enabling rapid environment simulation and model updates.

HPC / Simulation

Strong performance in HPC simulations with robust FP64 support, making it suitable for scientific and engineering applications requiring high precision.

Scientific Computing

Highly effective for scientific computing tasks, offering substantial computational power and memory bandwidth for data-intensive workloads.

Edge Inference

Less suitable for edge inference due to high power consumption and large form factor, better suited for data center deployments.

Real-Time Serving

Capable of real-time AI serving with low latency and high throughput, ideal for demanding AI applications in a data center environment.

Fine-Tuning

Highly efficient for full fine-tuning tasks due to its large VRAM and advanced tensor core capabilities, supporting complex model adjustments.

LoRA Efficiency

Efficient for LoRA techniques, providing sufficient memory and processing power to handle parameter-efficient tuning methods.

Market Authority

Key Strengths

The H100 NVL1 excels at AI training and inference, particularly for large language models and complex neural networks. Its advanced architecture and high memory bandwidth provide significant performance advantages in deep learning and scientific computing tasks, making it a preferred choice for demanding workloads.

Limitations

While the H100 NVL1 offers exceptional performance, it comes with a high power requirement and cost, which may be a consideration for some organizations. Availability can be limited due to high demand, and its advanced features may require specific software optimizations to fully leverage its capabilities.

Also in the Lineup

GeForce RTX 4090 Founders Edition

NVIDIA

GeForce RTX 5080 Founders Edition

NVIDIA

GeForce RTX 5090 RTX 5090

Expert Insight

The H100 represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS

VRAM

TDP

Cores

Information updated daily. Cloud pricing subject to vendor availability.