NVIDIA · Q2 2023

HGX

Name: NVIDIA HGX B300
Brand: NVIDIA
Availability: InStock
Rating: 4.8 (12 reviews)

B300

The NVIDIA HGX B300 is a high-performance computing platform designed for AI training and inference, as well as scientific computing workloads. It is part of NVIDIA's HGX series, which is tailored for datacenter environments requiring massive parallel processing power. The B300 variant is built on the latest GPU architecture, offering significant improvements in performance and efficiency over previous generations.

VRAM

192GB GB

FP32 TFLOPS

180 TFLOPS

CUDA Cores

8192

Provider Marketplace

Cheapest

$0.00/month

Starting from

Vultr Visit

Best Value

$0.00/month

Starting from

Vultr Visit

Enterprise Choice

$0.00/hour

Starting from

GMO GPU Cloud Visit

All Cloud Providers

2 Options available

VultrCheapest

On-Demand•Global Availability

$0.00/ month

Estimated Cost

Provision

GMO GPU Cloud

On-Demand•Global Availability

$0.00/ hour

Estimated Cost

Provision

Compute Performance

FP6445 TFLOPS TFLOPS

FP32180 TFLOPS TFLOPS

TF32360 TFLOPS (Dense), 720 TFLOPS (Sparse) TFLOPS

FP16720 TFLOPS (Dense), 1440 TFLOPS (Sparse) TFLOPS

BF16720 TFLOPS (Dense), 1440 TFLOPS (Sparse) TFLOPS

FP81440 TFLOPS (Dense), 2880 TFLOPS (Sparse) TFLOPS

INT82880 TOPS (Dense), 5760 TOPS (Sparse) TOPS

INT45760 TOPS (Dense), 11520 TOPS (Sparse) TOPS

Architecture

MicroarchitectureBlackwell

Process NodeTSMC 4NP

Die SizeDual-die (total ~1140 mm²)

Transistors208B (dual-die)

Compute Units288 SMs

Tensor Cores5th Gen, 1152 Tensor Cores

RT Cores—

Matrix EngineTransformer Engine (FP8/FP16/BF16)

Base Clock—

Boost Clock—

Transformer EngineYes (Gen 2)

Sparse AccelerationSupported (2:4 structured sparsity)

Dynamic PrecisionSupported (FP4/FP6/FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeHBM3e

Total Capacity192GB GB

Bandwidth8TB/s

Bus Width6144-bit

HBM Stacks6

ECC SupportYes (Inline)

Unified MemoryYes (CUDA Unified Memory)

Compression—

NUMA Awareness—

Memory PoolingNVLink memory pooling supported

Connectivity & Scaling

InterconnectNVLink

GenerationNVLink 5

IB Bandwidth1.8 TB/s

PCIe InterfacePCIe Gen 5 xx16 per GPU via baseboard

CXL Support—

TopologyFully-connected NVLink mesh (all-to-all)

Max GPUs/Node8

Scale-OutYes (InfiniBand NDR, RoCE v2)

GPUDirect RDMAYes

P2P MemoryYes

Virtualization

MIG SupportSupported

MIG Partitions10 instances (max)

SR-IOVNot Supported

vGPU ReadinessSupported (NVIDIA vGPU)

K8s ReadinessCertified (NVIDIA GPU Operator)

GPU SharingMIG, Time-Slicing, MPS, vGPU

Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDPN/A W

Peak PowerN/A

Idle PowerN/A

Perf / WattN/A

PSU RequiredBusbar-powered rack, N/A

ConnectorsDirect busbar connection, N/A

Thermal LimitsDesigned for liquid cooling; typical inlet temp 35°C, max 40°C

EfficiencySystem-level efficiency depends on rack and facility; not officially disclosed

Physical Design

Form FactorHGX B300 baseboard (8x SXM5 modules)

FHFLN/A

Slot WidthN/A

Dimensions445 x 410 mm

WeightN/A

CoolingDirect liquid cooling (DLC)

Rack DensityDesigned for high-density GPU compute nodes in 4U or 6U server chassis

Thermals & Cooling

AirflowServer chassis airflow required (Not Published)

Temp Range—

ThrottlingThermal-based clock reduction at Tjunction limit

Noise LevelNot Applicable (Passive Module)

Liquid Cooling—

DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported

ROCmNot Supported

oneAPINot Supported

PyTorchOfficially supported

TensorFlowOfficially supported

JAXSupported via CUDA backend

HuggingFaceOptimized (CUDA kernels available)

Triton ServerSupported

DockerOfficial container images available

Compiler StackMature CUDA compiler stack

Kernel OptimUpstream Linux support for datacenter GPUs documented

Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro

Preconfigured4U 8-GPU systems

DGX/HGXCore of an HGX baseboard

Rack-ScaleNVLink Switch System, InfiniBand scale-out

Edge DeployNot suitable for edge deployment due to high TDP

Ref ArchitecturesNVIDIA MGX, SuperPOD

System Compatibility

CPU PairingIntegrated with platform CPU (HGX/DGX architecture)

NUMAPlatform-specific NUMA topology; memory locality critical for optimal performance

Required PCIeNot Applicable (SXM/OAM)

MotherboardPlatform-specific (HGX/NVL baseboard)

Rack PowerContact vendor for rack power planning

BIOS Limits—

CXL ReadyNot Supported

OS CompatRHEL, Ubuntu LTS, and Windows Server supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUOptimal performance with full utilization of GPU resources.

2-GPUNear-linear scaling with NVLink bridge, minimal overhead.

4-GPUEfficient scaling with NVLink, leveraging NVSwitch for high bandwidth.

8-GPUNear-linear scaling up to 8 GPUs with NVSwitch, maximizing NVLink bandwidth.

64+ GPUScalability impacted by InfiniBand/Ethernet overhead, but mitigated by GPUDirect RDMA and multi-rail networking.

Scaling Characteristics

Cross-Node LatencyLow latency with GPUDirect RDMA support, optimized for distributed training.

Network BottlenecksPotential bottlenecks at Host-to-Device bridge if not using NVLink, otherwise limited by network bandwidth.

ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism with frameworks like DeepSpeed and Megatron.

Workload Readiness

LLM Training

The HGX B300, likely based on a high-performance architecture such as Hopper or Blackwell, is suitable for training large models up to 400B+ parameters in a multi-node setup due to its substantial VRAM and interconnect capabilities.

LLM Inference

Optimized for high throughput inference with advanced Tensor cores, providing excellent token-per-second performance and ample KV cache for large models.

Vision Training

Highly capable for vision training tasks, leveraging its architecture's advanced Tensor cores and large VRAM to efficiently handle large datasets and complex models.

Diffusion Models

Well-suited for diffusion models, offering high computational throughput and memory bandwidth to manage the iterative processes involved in such models.

Multimodal AI

The architecture supports multimodal AI tasks effectively, with strong parallel processing capabilities and sufficient memory to handle diverse data types simultaneously.

Reinforcement Learning

Excellent for reinforcement learning, providing fast computation and large memory capacity to support complex environments and large-scale simulations.

HPC / Simulation

Strong FP64 performance makes it ideal for HPC simulations, offering the precision and computational power needed for scientific and engineering applications.

Scientific Computing

Highly efficient for scientific computing tasks, with robust double precision capabilities and high memory bandwidth to support intensive calculations.

Edge Inference

Not optimal for edge inference due to high power consumption and large form factor, better suited for data center environments.

Real-Time Serving

Capable of real-time AI serving with low latency and high throughput, leveraging its architecture's advanced processing capabilities.

Fine-Tuning

Highly efficient for full fine-tuning of large models, thanks to its substantial VRAM and advanced architecture.

LoRA Efficiency

Efficient for LoRA applications, providing sufficient computational resources and memory to handle parameter-efficient tuning methods.

Market Authority

Key Strengths

The HGX B300 excels at large-scale AI training and inference tasks, offering unparalleled performance for deep learning models. Its architecture is optimized for high throughput and low latency, making it ideal for scientific simulations and complex data analytics. The platform's scalability and efficiency set it apart from alternatives.

Limitations

While the HGX B300 offers exceptional performance, its high power consumption and cooling requirements may limit its use in smaller or less equipped datacenters. Additionally, its availability may be constrained by supply chain factors, and its cost can be prohibitive for smaller organizations.

Also in the Lineup

GeForce RTX 4090 Founders Edition

NVIDIA

GeForce RTX 5080 Founders Edition

NVIDIA

GeForce RTX 5090 RTX 5090

Expert Insight

The HGX represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS

VRAM

TDP

Cores

Information updated daily. Cloud pricing subject to vendor availability.