NVIDIA · March 2022

H100

PCIe

The NVIDIA H100 PCIe is a high-performance GPU designed for data centers, targeting AI, machine learning, and high-performance computing workloads. It is part of the Hopper architecture, offering significant improvements in performance and efficiency over its predecessors. The H100 PCIe variant is optimized for PCIe-based systems, providing flexibility in deployment across a wide range of server configurations.

H100 PCIe
VRAM
80GB GB
FP32 TFLOPS
51 TFLOPS
CUDA Cores
14,592

Provider Marketplace

Cheapest
$1.99/hour
Starting from
Best Value
$2.39/hour
Starting from
Enterprise Choice
$2.39/hour
Starting from

All Cloud Providers

3 Options available
CivoCheapest
On-DemandGlobal Availability
$1.99/ hour
Estimated Cost
Provision
RunPod favicon
On-DemandGlobal Availability
$2.39/ hour
Estimated Cost
Provision
RunPod favicon
On-DemandGlobal Availability
$2.39/ hour
Estimated Cost
Provision

Compute Performance

FP6426 TFLOPS TFLOPS
FP3251 TFLOPS TFLOPS
TF3251 TFLOPS (Dense), 101 TFLOPS (Sparse) TFLOPS
FP16101 TFLOPS (Dense), 202 TFLOPS (Sparse) TFLOPS
BF16101 TFLOPS (Dense), 202 TFLOPS (Sparse) TFLOPS
FP8202 TFLOPS (Dense), 404 TFLOPS (Sparse) TFLOPS
INT8202 TOPS (Dense), 404 TOPS (Sparse) TOPS
INT4404 TOPS (Dense), 808 TOPS (Sparse) TOPS

Architecture

MicroarchitectureHopper
Process NodeTSMC 4N
Die Size814 mm²
Transistors80B
Compute Units132 SMs
Tensor Cores4th Gen, 528 Tensor Cores
RT Cores
Matrix EngineTransformer Engine (FP8/FP16/BF16)
Base Clock1035 MHz
Boost Clock1770 MHz
Transformer EngineYes (Gen 1)
Sparse AccelerationSupported (2:4 structured sparsity)
Dynamic PrecisionSupported (FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeHBM2e
Total Capacity80GB GB
Bandwidth2.0 TB/s
Bus Width5120-bit
HBM Stacks5
ECC SupportYes (Inline)
Unified MemoryYes (CUDA Unified Memory)
Compression
NUMA Awareness
Memory PoolingNot Supported

Connectivity & Scaling

InterconnectPCIe
GenerationPCIe Gen 5
IB Bandwidth64 GB/s (bi-directional per GPU)
PCIe InterfacePCIe Gen 5 xx16
CXL Support
TopologyPCIe switch or CPU root complex
Max GPUs/Node4
Scale-OutYes (via InfiniBand NDR/XDR or RoCE v2)
GPUDirect RDMAYes
P2P MemoryYes (via PCIe BAR1, limited compared to NVLink)

Virtualization

MIG SupportSupported
MIG Partitions7 instances (max)
SR-IOVNot Supported
vGPU ReadinessSupported (NVIDIA vGPU)
K8s ReadinessCertified (NVIDIA GPU Operator)
GPU SharingMIG, Time-Slicing, MPS, vGPU
Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP350 W W
Peak Power350-400 W
Idle Power40-60 W
Perf / WattUp to 67 TFLOPS FP16 / 350 W ≈ 0.19 TFLOPS/W (FP16, theoretical peak)
PSU RequiredN/A
Connectors1x PCIe 8-pin + PCIe slot
Thermal LimitsMax GPU temperature 85°C
EfficiencyN/A

Physical Design

Form FactorPCIe card
FHFLFull Height, Full Length (FHFL)
Slot WidthDual slot
Dimensions267 mm x 112 mm
Weight1.5–1.8 kg
CoolingPassive
Rack DensityStandard PCIe server GPU density

Thermals & Cooling

AirflowRequires front-to-back chassis airflow (Not Published)
Temp Range0°C to 45°C
ThrottlingThermal-based clock reduction at Tjunction limit
Noise LevelNot Applicable (Passive Module)
Liquid CoolingAir-cooled
DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported
ROCmNot Supported
oneAPINot Supported
PyTorchOfficially supported
TensorFlowOfficially supported
JAXSupported via CUDA backend
HuggingFaceOptimized (CUDA kernels available)
Triton ServerSupported
DockerOfficial container images available
Compiler StackMature CUDA compiler stack
Kernel OptimUpstream Linux kernel support for NVIDIA datacenter GPUs documented
Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro
Preconfigured2U/4U universal GPU servers
DGX/HGXNot the core of a DGX system; typically used in PCIe configurations
Rack-ScaleInfiniBand scale-out for high-performance computing clusters
Edge DeployLimited suitability for edge deployments due to higher TDP
Ref ArchitecturesNVIDIA MGX, OVX

System Compatibility

CPU PairingDual-socket Intel Xeon Scalable or AMD EPYC 7003/9004 class recommended
NUMAStandard NUMA behavior
Required PCIePCIe Gen 5 x16 recommended
MotherboardFull-length, double-width PCIe Gen 5 x16 slot required
Rack PowerContact vendor for rack power planning
BIOS LimitsResizable BAR and Above 4G decoding required; SR-IOV support Not Published
CXL ReadyNo CXL memory expansion
OS CompatRHEL, Ubuntu LTS, and Windows Server supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe H100 PCIe offers high single GPU efficiency with PCIe Gen5 bandwidth, providing up to 64GB/s for data transfer.
2-GPUScaling between two GPUs is limited by PCIe lane contention, with a maximum of 64GB/s bandwidth per GPU.
4-GPUFour GPU scaling is constrained by PCIe bandwidth, leading to diminishing returns as more GPUs contend for the same PCIe lanes.
8-GPUScaling to eight GPUs is further limited by PCIe bandwidth, with significant contention and reduced efficiency compared to NVLink configurations.
64+ GPUAt scales of 64 GPUs or more, InfiniBand or Ethernet overhead becomes significant, requiring careful network topology design to minimize latency and maximize throughput.

Scaling Characteristics

Cross-Node LatencyCross-node latency is minimized with GPUDirect RDMA support, allowing for efficient data transfer across nodes in a distributed training setup.
Network BottlenecksThe primary bottleneck is the lack of NVLink, leading to reliance on PCIe bandwidth and potential VRAM pressure in large models.
ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron for efficient distributed training.

Workload Readiness

LLM Training

The H100 PCIe, based on the Hopper architecture, is highly suitable for training large language models. It supports multi-node scalability and can handle models up to 400B+ parameters due to its high VRAM capacity and advanced interconnects.

LLM Inference

The H100 PCIe is highly efficient for inference tasks, offering excellent token-per-second performance and sufficient KV cache headroom, making it ideal for deploying large-scale language models.

Vision Training

With its advanced Tensor Cores and high memory bandwidth, the H100 PCIe excels in vision training tasks, providing significant speedups for large-scale image classification and object detection models.

Diffusion Models

The H100 PCIe is well-suited for diffusion models, benefiting from its high computational throughput and memory capacity, enabling efficient training and inference of complex generative models.

Multimodal AI

The H100 PCIe's architecture supports multimodal AI tasks effectively, leveraging its Tensor Cores for processing diverse data types and large datasets, making it ideal for applications like image-text models.

Reinforcement Learning

The H100 PCIe offers excellent performance for reinforcement learning, with its high throughput and ability to handle complex simulations and large state spaces efficiently.

HPC / Simulation

The H100 PCIe provides robust support for HPC simulations, with strong FP64 performance, making it suitable for scientific and engineering applications requiring high precision.

Scientific Computing

The H100 PCIe excels in scientific computing tasks, offering high double-precision performance and memory bandwidth, ideal for complex simulations and data analysis.

Edge Inference

The H100 PCIe is less suited for edge inference due to its higher power consumption and larger form factor, making it more appropriate for data center deployments.

Real-Time Serving

The H100 PCIe is highly capable for real-time AI serving, with its low latency and high throughput, making it ideal for deploying AI models in production environments.

Fine-Tuning

The H100 PCIe is highly efficient for full fine-tuning tasks, thanks to its large VRAM and advanced architecture, allowing for the fine-tuning of large models with minimal overhead.

LoRA Efficiency

The H100 PCIe is also efficient for LoRA fine-tuning, providing sufficient resources to handle parameter-efficient training methods effectively.

Market Authority

MLPerf Ranking

The NVIDIA H100 PCIe is officially listed in MLPerf Training v3.0 and Inference v3.1 results, with performance data published by NVIDIA and partners.

Cloud Adoption

NVIDIA has publicly confirmed H100 PCIe availability on Google Cloud, Microsoft Azure, and Amazon Web Services (AWS) as of late 2023.

Supercomputer Usage

The H100 PCIe is deployed in supercomputers such as the NVIDIA Eos and is listed in public documentation for systems like the Texas Advanced Computing Center's Lonestar6 and Oak Ridge National Laboratory's Frontier expansion nodes.

Research Citations

The H100 PCIe is cited in peer-reviewed papers and arXiv preprints from 2023 onward, particularly in large language model and HPC research.

Community Benchmarks

Community benchmarks for H100 PCIe are available on sites like MLPerf, Hugging Face forums, and independent blogs, though most public benchmarks focus on the SXM variant.

GitHub Support

Official support for H100 PCIe is present in major deep learning frameworks (PyTorch, TensorFlow, JAX) and libraries (NVIDIA cuDNN, CUDA 12.x), with explicit references in GitHub repositories and release notes.

Enterprise Cases

NVIDIA and partners (e.g., Dell, HPE) have published case studies highlighting H100 PCIe deployments in enterprise AI and HPC workloads.

Key Strengths

The H100 PCIe excels at AI training and inference, offering substantial performance gains in deep learning workloads due to its advanced tensor cores and high memory bandwidth. It is also well-suited for scientific simulations and data analytics, providing a versatile solution for complex computational tasks.

Limitations

While the H100 PCIe offers excellent performance, it lacks NVLink support, which can be a limitation for applications requiring high-speed inter-GPU communication. Additionally, its high power consumption may necessitate upgrades to power delivery systems in some data centers. Availability can be constrained due to high demand and production limitations.

Expert Insight

The H100 represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS
VRAM
TDP
Cores
Information updated daily. Cloud pricing subject to vendor availability.