NVIDIA · Q2 2023

HGX

Name: NVIDIA HGX Rubin NVL8
Brand: NVIDIA
Availability: InStock
Rating: 4.8 (12 reviews)

Rubin NVL8

The NVIDIA HGX Rubin NVL8 is a high-performance GPU module designed for datacenter environments, targeting AI training and high-performance computing workloads. It is part of NVIDIA's Hopper architecture, offering significant advancements in compute capabilities and memory bandwidth. The NVL8 variant is optimized for large-scale deployments, providing exceptional scalability and efficiency.

VRAM

192GB GB

FP32 TFLOPS

236 TFLOPS

CUDA Cores

16896

Provider Marketplace

Cheapest

$2.00/hour

Starting from

OCI (Oracle Cloud Infrastructure)Visit

Best Value

$2.48/hour

Starting from

Google Cloud Visit

Enterprise Choice

$50.44/hour

Starting from

CoreWeave Visit

All Cloud Providers

5 Options available

OCI (Oracle Cloud Infrastructure)Cheapest

On-Demand•Global Availability

$2.00/ hour

Estimated Cost

Provision

Google Cloud

On-Demand•Global Availability

$2.48/ hour

Estimated Cost

Provision

Nebius

On-Demand•Global Availability

$2.95/ hour

Estimated Cost

Provision

Lambda

On-Demand•Global Availability

$3.29/ hour

Estimated Cost

Provision

CoreWeave

On-Demand•Global Availability

$50.44/ hour

Estimated Cost

Provision

Compute Performance

FP64118 TFLOPS TFLOPS

FP32236 TFLOPS TFLOPS

TF32944 TFLOPS (Dense), 1888 TFLOPS (Sparse) TFLOPS

FP161888 TFLOPS (Dense), 3776 TFLOPS (Sparse) TFLOPS

BF161888 TFLOPS (Dense), 3776 TFLOPS (Sparse) TFLOPS

FP83776 TFLOPS (Dense), 7552 TFLOPS (Sparse) TFLOPS

INT83776 TOPS (Dense), 7552 TOPS (Sparse) TOPS

INT47552 TOPS (Dense), 15104 TOPS (Sparse) TOPS

Architecture

MicroarchitectureRubin

Process NodeTSMC 4NP

Die Size—

Transistors—

Compute Units—

Tensor Cores—

RT Cores—

Matrix EngineTransformer Engine (FP8/FP16/BF16)

Base Clock—

Boost Clock—

Transformer EngineYes (Gen 3)

Sparse AccelerationSupported (2:4 structured sparsity)

Dynamic PrecisionSupported (FP4/FP6/FP8/FP16/BF16)

Memory & VRAM

Memory TypeHBM3e

Total Capacity192GB GB

Bandwidth8.0TB/s

Bus Width6144-bit

HBM Stacks6

ECC SupportYes (Inline)

Unified MemoryYes (CUDA Unified Memory)

Compression—

NUMA Awareness—

Memory PoolingYes (NVLink memory pooling)

Connectivity & Scaling

InterconnectNVLink Switch

GenerationNVLink 5

IB Bandwidth1.8 TB/s

PCIe InterfacePCIe Gen 5 x16

CXL Support—

TopologyFully-connected NVLink domain via NVLink Switch

Max GPUs/Node8

Scale-OutYes

GPUDirect RDMAYes

P2P MemoryYes

Virtualization

MIG SupportSupported

MIG Partitions7 instances (max)

SR-IOVNot Supported

vGPU ReadinessSupported (NVIDIA vGPU)

K8s ReadinessCertified (NVIDIA GPU Operator)

GPU SharingMIG, Time-Slicing, MPS, vGPU

Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP1200-1400 (per GPU, estimated for Blackwell NVL GPU in HGX Rubin NVL8 configuration) W

Peak Power11000-12000 (system-level, for 8x GB200-class GPUs plus supporting components)

Idle Power1800-2200 (system-level, estimated)

Perf / Watt2.5-3.5 TFLOPS FP8/W (system-level, estimated for Blackwell architecture)

PSU RequiredN/A

ConnectorsBusbar (rack-level DC distribution, no standard PCIe/12VHPWR connectors)

Thermal Limits35-40°C inlet air (typical data center spec; liquid cooling recommended for full performance)

EfficiencyN/A

Physical Design

Form FactorHGX baseboard (8x NVIDIA Blackwell NVL GPUs, SXM5 modules)

FHFLN/A

Slot WidthN/A

Dimensions445 mm x 410 mm x 70 mm

Weight18–22 kg

CoolingDirect liquid cooling (DLC)

Rack DensityDesigned for high-density multi-GPU server integration (8 GPUs per 4U server)

Thermals & Cooling

AirflowDirect-to-chip liquid cooling

Temp Range0°C to 45°C

ThrottlingThermal-based clock reduction at Tjunction limit

Noise LevelNot Applicable (Passive Module)

Liquid CoolingDirect-to-chip liquid cooling

DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported

ROCmNot Supported

oneAPINot Supported

PyTorchOfficially supported

TensorFlowOfficially supported

JAXSupported via CUDA backend

HuggingFaceOptimized (CUDA kernels available)

Triton ServerSupported

DockerOfficial container images available

Compiler StackMature CUDA compiler stack

Kernel OptimStandard driver-based support

Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro

Preconfigured4U 8-GPU systems

DGX/HGXCore of an HGX baseboard

Rack-ScaleNVLink Switch System, InfiniBand scale-out

Edge DeployNot suitable for edge deployment due to high TDP

Ref ArchitecturesNVIDIA MGX, SuperPOD

System Compatibility

CPU PairingIntegrated with platform CPU (HGX/DGX architecture)

NUMAPlatform-specific NUMA topology; memory locality critical for optimal performance

Required PCIeNot Applicable (SXM/OAM)

MotherboardPlatform-specific (HGX/NVL baseboard)

Rack PowerContact vendor for rack power planning

BIOS Limits—

CXL ReadyNot Supported

OS CompatRHEL and Ubuntu LTS supported; Windows Server support not published

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe HGX Rubin NVL8 is optimized for high efficiency with NVLink providing direct high-bandwidth connections between GPUs.

2-GPUNear-linear scaling due to NVLink bridge, allowing efficient data transfer between two GPUs.

4-GPUContinues near-linear scaling with NVSwitch, minimizing latency and maximizing bandwidth across GPUs.

8-GPUMaintains near-linear scaling across all 8 GPUs, leveraging NVSwitch for optimal interconnect performance.

64+ GPUScalability is impacted by InfiniBand/Ethernet overhead, but multi-rail networking and GPUDirect RDMA help mitigate latency issues.

Scaling Characteristics

Cross-Node LatencyLow latency achieved through GPUDirect RDMA and InfiniBand support, ensuring efficient cross-node communication.

Network BottlenecksPotential bottlenecks include VRAM pressure and host-to-device bridge limitations, though NVLink mitigates many interconnect issues.

ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron.

Workload Readiness

LLM Training

The HGX Rubin NVL8, likely based on a recent architecture such as Hopper or Blackwell, is well-suited for training large language models up to 400B+ parameters in a multi-node setup due to its high VRAM capacity and advanced interconnects.

LLM Inference

Optimized for high throughput inference with advanced tensor cores, capable of handling large token-per-second rates and providing ample KV cache headroom for large models.

Vision Training

Highly efficient for vision training tasks, leveraging its advanced tensor cores and large VRAM to handle complex models and datasets efficiently.

Diffusion Models

Well-suited for training and inference of diffusion models, benefiting from high computational throughput and memory bandwidth.

Multimodal AI

Excellent for multimodal AI tasks, combining high computational power and memory capacity to process diverse data types simultaneously.

Reinforcement Learning

Ideal for reinforcement learning workloads, offering fast computation and high memory bandwidth to support complex simulations and model updates.

HPC / Simulation

Strong performance in HPC simulations with robust FP64 support, making it suitable for scientific and engineering simulations requiring high precision.

Scientific Computing

Highly capable for scientific computing tasks, providing excellent performance in both FP32 and FP64 operations, crucial for various scientific applications.

Edge Inference

Not optimal for edge inference due to likely high TDP and large form factor, better suited for data center environments.

Real-Time Serving

Capable of real-time AI serving with low latency and high throughput, leveraging its advanced architecture and tensor cores.

Fine-Tuning

Highly efficient for full fine-tuning of large models due to its substantial VRAM and computational power.

LoRA Efficiency

Efficient for LoRA fine-tuning, providing sufficient resources for parameter-efficient training methods.

Market Authority

Key Strengths

This GPU excels at large-scale AI training and inference tasks, offering superior performance in deep learning frameworks. Its architecture is optimized for high throughput and low latency, making it ideal for complex simulations and scientific computing. The NVL8's scalability and efficiency make it a standout choice for demanding datacenter applications.

Limitations

While the HGX Rubin NVL8 offers exceptional performance, its high power requirements and need for advanced cooling solutions can be a trade-off for some deployments. Additionally, its availability may be limited due to high demand and production constraints, potentially impacting procurement timelines for large-scale projects.

Also in the Lineup

GeForce RTX 4090 Founders Edition

NVIDIA

GeForce RTX 5080 Founders Edition

NVIDIA

GeForce RTX 5090 RTX 5090

Expert Insight

The HGX represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS

VRAM

TDP

Cores

Information updated daily. Cloud pricing subject to vendor availability.