NVIDIA · Q2 2023

HGX

Rubin NVL8

The NVIDIA HGX Rubin NVL8 is a high-performance GPU module designed for datacenter environments, targeting AI training and high-performance computing workloads. It is part of NVIDIA's Hopper architecture, offering significant advancements in compute capabilities and memory bandwidth. The NVL8 variant is optimized for large-scale deployments, providing exceptional scalability and efficiency.

HGX Rubin NVL8
VRAM
192GB GB
FP32 TFLOPS
236 TFLOPS
CUDA Cores
16896

Provider Marketplace

Cheapest
$2.00/hour
Starting from
Best Value
$2.48/hour
Starting from
Enterprise Choice
$50.44/hour
Starting from

All Cloud Providers

5 Options available
On-DemandGlobal Availability
$2.00/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$2.48/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$2.95/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$3.29/ hour
Estimated Cost
Provision
On-DemandGlobal Availability
$50.44/ hour
Estimated Cost
Provision

Compute Performance

FP64118 TFLOPS TFLOPS
FP32236 TFLOPS TFLOPS
TF32944 TFLOPS (Dense), 1888 TFLOPS (Sparse) TFLOPS
FP161888 TFLOPS (Dense), 3776 TFLOPS (Sparse) TFLOPS
BF161888 TFLOPS (Dense), 3776 TFLOPS (Sparse) TFLOPS
FP83776 TFLOPS (Dense), 7552 TFLOPS (Sparse) TFLOPS
INT83776 TOPS (Dense), 7552 TOPS (Sparse) TOPS
INT47552 TOPS (Dense), 15104 TOPS (Sparse) TOPS

Architecture

MicroarchitectureRubin
Process NodeTSMC 4NP
Die Size
Transistors
Compute Units
Tensor Cores
RT Cores
Matrix EngineTransformer Engine (FP8/FP16/BF16)
Base Clock
Boost Clock
Transformer EngineYes (Gen 3)
Sparse AccelerationSupported (2:4 structured sparsity)
Dynamic PrecisionSupported (FP4/FP6/FP8/FP16/BF16)

Memory & VRAM

Memory TypeHBM3e
Total Capacity192GB GB
Bandwidth8.0TB/s
Bus Width6144-bit
HBM Stacks6
ECC SupportYes (Inline)
Unified MemoryYes (CUDA Unified Memory)
Compression
NUMA Awareness
Memory PoolingYes (NVLink memory pooling)

Connectivity & Scaling

InterconnectNVLink Switch
GenerationNVLink 5
IB Bandwidth1.8 TB/s
PCIe InterfacePCIe Gen 5 x16
CXL Support
TopologyFully-connected NVLink domain via NVLink Switch
Max GPUs/Node8
Scale-OutYes
GPUDirect RDMAYes
P2P MemoryYes

Virtualization

MIG SupportSupported
MIG Partitions7 instances (max)
SR-IOVNot Supported
vGPU ReadinessSupported (NVIDIA vGPU)
K8s ReadinessCertified (NVIDIA GPU Operator)
GPU SharingMIG, Time-Slicing, MPS, vGPU
Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP1200-1400 (per GPU, estimated for Blackwell NVL GPU in HGX Rubin NVL8 configuration) W
Peak Power11000-12000 (system-level, for 8x GB200-class GPUs plus supporting components)
Idle Power1800-2200 (system-level, estimated)
Perf / Watt2.5-3.5 TFLOPS FP8/W (system-level, estimated for Blackwell architecture)
PSU RequiredN/A
ConnectorsBusbar (rack-level DC distribution, no standard PCIe/12VHPWR connectors)
Thermal Limits35-40°C inlet air (typical data center spec; liquid cooling recommended for full performance)
EfficiencyN/A

Physical Design

Form FactorHGX baseboard (8x NVIDIA Blackwell NVL GPUs, SXM5 modules)
FHFLN/A
Slot WidthN/A
Dimensions445 mm x 410 mm x 70 mm
Weight18–22 kg
CoolingDirect liquid cooling (DLC)
Rack DensityDesigned for high-density multi-GPU server integration (8 GPUs per 4U server)

Thermals & Cooling

AirflowDirect-to-chip liquid cooling
Temp Range0°C to 45°C
ThrottlingThermal-based clock reduction at Tjunction limit
Noise LevelNot Applicable (Passive Module)
Liquid CoolingDirect-to-chip liquid cooling
DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported
ROCmNot Supported
oneAPINot Supported
PyTorchOfficially supported
TensorFlowOfficially supported
JAXSupported via CUDA backend
HuggingFaceOptimized (CUDA kernels available)
Triton ServerSupported
DockerOfficial container images available
Compiler StackMature CUDA compiler stack
Kernel OptimStandard driver-based support
Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro
Preconfigured4U 8-GPU systems
DGX/HGXCore of an HGX baseboard
Rack-ScaleNVLink Switch System, InfiniBand scale-out
Edge DeployNot suitable for edge deployment due to high TDP
Ref ArchitecturesNVIDIA MGX, SuperPOD

System Compatibility

CPU PairingIntegrated with platform CPU (HGX/DGX architecture)
NUMAPlatform-specific NUMA topology; memory locality critical for optimal performance
Required PCIeNot Applicable (SXM/OAM)
MotherboardPlatform-specific (HGX/NVL baseboard)
Rack PowerContact vendor for rack power planning
BIOS Limits
CXL ReadyNot Supported
OS CompatRHEL and Ubuntu LTS supported; Windows Server support not published

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe HGX Rubin NVL8 is optimized for high efficiency with NVLink providing direct high-bandwidth connections between GPUs.
2-GPUNear-linear scaling due to NVLink bridge, allowing efficient data transfer between two GPUs.
4-GPUContinues near-linear scaling with NVSwitch, minimizing latency and maximizing bandwidth across GPUs.
8-GPUMaintains near-linear scaling across all 8 GPUs, leveraging NVSwitch for optimal interconnect performance.
64+ GPUScalability is impacted by InfiniBand/Ethernet overhead, but multi-rail networking and GPUDirect RDMA help mitigate latency issues.

Scaling Characteristics

Cross-Node LatencyLow latency achieved through GPUDirect RDMA and InfiniBand support, ensuring efficient cross-node communication.
Network BottlenecksPotential bottlenecks include VRAM pressure and host-to-device bridge limitations, though NVLink mitigates many interconnect issues.
ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron.

Workload Readiness

LLM Training

The HGX Rubin NVL8, likely based on a recent architecture such as Hopper or Blackwell, is well-suited for training large language models up to 400B+ parameters in a multi-node setup due to its high VRAM capacity and advanced interconnects.

LLM Inference

Optimized for high throughput inference with advanced tensor cores, capable of handling large token-per-second rates and providing ample KV cache headroom for large models.

Vision Training

Highly efficient for vision training tasks, leveraging its advanced tensor cores and large VRAM to handle complex models and datasets efficiently.

Diffusion Models

Well-suited for training and inference of diffusion models, benefiting from high computational throughput and memory bandwidth.

Multimodal AI

Excellent for multimodal AI tasks, combining high computational power and memory capacity to process diverse data types simultaneously.

Reinforcement Learning

Ideal for reinforcement learning workloads, offering fast computation and high memory bandwidth to support complex simulations and model updates.

HPC / Simulation

Strong performance in HPC simulations with robust FP64 support, making it suitable for scientific and engineering simulations requiring high precision.

Scientific Computing

Highly capable for scientific computing tasks, providing excellent performance in both FP32 and FP64 operations, crucial for various scientific applications.

Edge Inference

Not optimal for edge inference due to likely high TDP and large form factor, better suited for data center environments.

Real-Time Serving

Capable of real-time AI serving with low latency and high throughput, leveraging its advanced architecture and tensor cores.

Fine-Tuning

Highly efficient for full fine-tuning of large models due to its substantial VRAM and computational power.

LoRA Efficiency

Efficient for LoRA fine-tuning, providing sufficient resources for parameter-efficient training methods.

Market Authority

Key Strengths

This GPU excels at large-scale AI training and inference tasks, offering superior performance in deep learning frameworks. Its architecture is optimized for high throughput and low latency, making it ideal for complex simulations and scientific computing. The NVL8's scalability and efficiency make it a standout choice for demanding datacenter applications.

Limitations

While the HGX Rubin NVL8 offers exceptional performance, its high power requirements and need for advanced cooling solutions can be a trade-off for some deployments. Additionally, its availability may be limited due to high demand and production constraints, potentially impacting procurement timelines for large-scale projects.

Expert Insight

The HGX represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS
VRAM
TDP
Cores
Information updated daily. Cloud pricing subject to vendor availability.