NVIDIA · March 2022

H100

Name: NVIDIA H100 PCIe
Brand: NVIDIA
Availability: InStock
Rating: 4.8 (12 reviews)

PCIe

The NVIDIA H100 PCIe is a high-performance GPU designed for data centers, targeting AI, machine learning, and high-performance computing workloads. It is part of the Hopper architecture, offering significant improvements in performance and efficiency over its predecessors. The H100 PCIe variant is optimized for PCIe-based systems, providing flexibility in deployment across a wide range of server configurations.

VRAM

80GB GB

FP32 TFLOPS

51 TFLOPS

CUDA Cores

14,592

Provider Marketplace

Cheapest

$1.99/hour

Starting from

Civo Visit

Best Value

$2.39/hour

Starting from

RunPod Visit

Enterprise Choice

$2.39/hour

Starting from

RunPod Visit

All Cloud Providers

3 Options available

CivoCheapest

On-Demand•Global Availability

$1.99/ hour

Estimated Cost

Provision

RunPod

On-Demand•Global Availability

$2.39/ hour

Estimated Cost

Provision

RunPod

On-Demand•Global Availability

$2.39/ hour

Estimated Cost

Provision

Compute Performance

FP6426 TFLOPS TFLOPS

FP3251 TFLOPS TFLOPS

TF3251 TFLOPS (Dense), 101 TFLOPS (Sparse) TFLOPS

FP16101 TFLOPS (Dense), 202 TFLOPS (Sparse) TFLOPS

BF16101 TFLOPS (Dense), 202 TFLOPS (Sparse) TFLOPS

FP8202 TFLOPS (Dense), 404 TFLOPS (Sparse) TFLOPS

INT8202 TOPS (Dense), 404 TOPS (Sparse) TOPS

INT4404 TOPS (Dense), 808 TOPS (Sparse) TOPS

Architecture

MicroarchitectureHopper

Process NodeTSMC 4N

Die Size814 mm²

Transistors80B

Compute Units132 SMs

Tensor Cores4th Gen, 528 Tensor Cores

RT Cores—

Matrix EngineTransformer Engine (FP8/FP16/BF16)

Base Clock1035 MHz

Boost Clock1770 MHz

Transformer EngineYes (Gen 1)

Sparse AccelerationSupported (2:4 structured sparsity)

Dynamic PrecisionSupported (FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeHBM2e

Total Capacity80GB GB

Bandwidth2.0 TB/s

Bus Width5120-bit

HBM Stacks5

ECC SupportYes (Inline)

Unified MemoryYes (CUDA Unified Memory)

Compression—

NUMA Awareness—

Memory PoolingNot Supported

Connectivity & Scaling

InterconnectPCIe

GenerationPCIe Gen 5

IB Bandwidth64 GB/s (bi-directional per GPU)

PCIe InterfacePCIe Gen 5 xx16

CXL Support—

TopologyPCIe switch or CPU root complex

Max GPUs/Node4

Scale-OutYes (via InfiniBand NDR/XDR or RoCE v2)

GPUDirect RDMAYes

P2P MemoryYes (via PCIe BAR1, limited compared to NVLink)

Virtualization

MIG SupportSupported

MIG Partitions7 instances (max)

SR-IOVNot Supported

vGPU ReadinessSupported (NVIDIA vGPU)

K8s ReadinessCertified (NVIDIA GPU Operator)

GPU SharingMIG, Time-Slicing, MPS, vGPU

Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP350 W W

Peak Power350-400 W

Idle Power40-60 W

Perf / WattUp to 67 TFLOPS FP16 / 350 W ≈ 0.19 TFLOPS/W (FP16, theoretical peak)

PSU RequiredN/A

Connectors1x PCIe 8-pin + PCIe slot

Thermal LimitsMax GPU temperature 85°C

EfficiencyN/A

Physical Design

Form FactorPCIe card

FHFLFull Height, Full Length (FHFL)

Slot WidthDual slot

Dimensions267 mm x 112 mm

Weight1.5–1.8 kg

CoolingPassive

Rack DensityStandard PCIe server GPU density

Thermals & Cooling

AirflowRequires front-to-back chassis airflow (Not Published)

Temp Range0°C to 45°C

ThrottlingThermal-based clock reduction at Tjunction limit

Noise LevelNot Applicable (Passive Module)

Liquid CoolingAir-cooled

DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported

ROCmNot Supported

oneAPINot Supported

PyTorchOfficially supported

TensorFlowOfficially supported

JAXSupported via CUDA backend

HuggingFaceOptimized (CUDA kernels available)

Triton ServerSupported

DockerOfficial container images available

Compiler StackMature CUDA compiler stack

Kernel OptimUpstream Linux kernel support for NVIDIA datacenter GPUs documented

Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro

Preconfigured2U/4U universal GPU servers

DGX/HGXNot the core of a DGX system; typically used in PCIe configurations

Rack-ScaleInfiniBand scale-out for high-performance computing clusters

Edge DeployLimited suitability for edge deployments due to higher TDP

Ref ArchitecturesNVIDIA MGX, OVX

System Compatibility

CPU PairingDual-socket Intel Xeon Scalable or AMD EPYC 7003/9004 class recommended

NUMAStandard NUMA behavior

Required PCIePCIe Gen 5 x16 recommended

MotherboardFull-length, double-width PCIe Gen 5 x16 slot required

Rack PowerContact vendor for rack power planning

BIOS LimitsResizable BAR and Above 4G decoding required; SR-IOV support Not Published

CXL ReadyNo CXL memory expansion

OS CompatRHEL, Ubuntu LTS, and Windows Server supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe H100 PCIe offers high single GPU efficiency with PCIe Gen5 bandwidth, providing up to 64GB/s for data transfer.

2-GPUScaling between two GPUs is limited by PCIe lane contention, with a maximum of 64GB/s bandwidth per GPU.

4-GPUFour GPU scaling is constrained by PCIe bandwidth, leading to diminishing returns as more GPUs contend for the same PCIe lanes.

8-GPUScaling to eight GPUs is further limited by PCIe bandwidth, with significant contention and reduced efficiency compared to NVLink configurations.

64+ GPUAt scales of 64 GPUs or more, InfiniBand or Ethernet overhead becomes significant, requiring careful network topology design to minimize latency and maximize throughput.

Scaling Characteristics

Cross-Node LatencyCross-node latency is minimized with GPUDirect RDMA support, allowing for efficient data transfer across nodes in a distributed training setup.

Network BottlenecksThe primary bottleneck is the lack of NVLink, leading to reliance on PCIe bandwidth and potential VRAM pressure in large models.

ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron for efficient distributed training.

Workload Readiness

LLM Training

The H100 PCIe, based on the Hopper architecture, is highly suitable for training large language models. It supports multi-node scalability and can handle models up to 400B+ parameters due to its high VRAM capacity and advanced interconnects.

LLM Inference

The H100 PCIe is highly efficient for inference tasks, offering excellent token-per-second performance and sufficient KV cache headroom, making it ideal for deploying large-scale language models.

Vision Training

With its advanced Tensor Cores and high memory bandwidth, the H100 PCIe excels in vision training tasks, providing significant speedups for large-scale image classification and object detection models.

Diffusion Models

The H100 PCIe is well-suited for diffusion models, benefiting from its high computational throughput and memory capacity, enabling efficient training and inference of complex generative models.

Multimodal AI

The H100 PCIe's architecture supports multimodal AI tasks effectively, leveraging its Tensor Cores for processing diverse data types and large datasets, making it ideal for applications like image-text models.

Reinforcement Learning

The H100 PCIe offers excellent performance for reinforcement learning, with its high throughput and ability to handle complex simulations and large state spaces efficiently.

HPC / Simulation

The H100 PCIe provides robust support for HPC simulations, with strong FP64 performance, making it suitable for scientific and engineering applications requiring high precision.

Scientific Computing

The H100 PCIe excels in scientific computing tasks, offering high double-precision performance and memory bandwidth, ideal for complex simulations and data analysis.

Edge Inference

The H100 PCIe is less suited for edge inference due to its higher power consumption and larger form factor, making it more appropriate for data center deployments.

Real-Time Serving

The H100 PCIe is highly capable for real-time AI serving, with its low latency and high throughput, making it ideal for deploying AI models in production environments.

Fine-Tuning

The H100 PCIe is highly efficient for full fine-tuning tasks, thanks to its large VRAM and advanced architecture, allowing for the fine-tuning of large models with minimal overhead.

LoRA Efficiency

The H100 PCIe is also efficient for LoRA fine-tuning, providing sufficient resources to handle parameter-efficient training methods effectively.

Market Authority

MLPerf Ranking

The NVIDIA H100 PCIe is officially listed in MLPerf Training v3.0 and Inference v3.1 results, with performance data published by NVIDIA and partners.

Cloud Adoption

NVIDIA has publicly confirmed H100 PCIe availability on Google Cloud, Microsoft Azure, and Amazon Web Services (AWS) as of late 2023.

Supercomputer Usage

The H100 PCIe is deployed in supercomputers such as the NVIDIA Eos and is listed in public documentation for systems like the Texas Advanced Computing Center's Lonestar6 and Oak Ridge National Laboratory's Frontier expansion nodes.

Research Citations

The H100 PCIe is cited in peer-reviewed papers and arXiv preprints from 2023 onward, particularly in large language model and HPC research.

Community Benchmarks

Community benchmarks for H100 PCIe are available on sites like MLPerf, Hugging Face forums, and independent blogs, though most public benchmarks focus on the SXM variant.

GitHub Support

Official support for H100 PCIe is present in major deep learning frameworks (PyTorch, TensorFlow, JAX) and libraries (NVIDIA cuDNN, CUDA 12.x), with explicit references in GitHub repositories and release notes.

Enterprise Cases

NVIDIA and partners (e.g., Dell, HPE) have published case studies highlighting H100 PCIe deployments in enterprise AI and HPC workloads.

Key Strengths

The H100 PCIe excels at AI training and inference, offering substantial performance gains in deep learning workloads due to its advanced tensor cores and high memory bandwidth. It is also well-suited for scientific simulations and data analytics, providing a versatile solution for complex computational tasks.

Limitations

While the H100 PCIe offers excellent performance, it lacks NVLink support, which can be a limitation for applications requiring high-speed inter-GPU communication. Additionally, its high power consumption may necessitate upgrades to power delivery systems in some data centers. Availability can be constrained due to high demand and production limitations.

Also in the Lineup

GeForce RTX 4090 Founders Edition

NVIDIA

GeForce RTX 5080 Founders Edition

NVIDIA

GeForce RTX 5090 RTX 5090

Expert Insight

The H100 represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS

VRAM

TDP

Cores

Information updated daily. Cloud pricing subject to vendor availability.