NVIDIA · Q2 2023

HGX

B300

The NVIDIA HGX B300 is a high-performance computing platform designed for AI training and inference, as well as scientific computing workloads. It is part of NVIDIA's HGX series, which is tailored for datacenter environments requiring massive parallel processing power. The B300 variant is built on the latest GPU architecture, offering significant improvements in performance and efficiency over previous generations.

HGX B300
VRAM
192GB GB
FP32 TFLOPS
180 TFLOPS
CUDA Cores
8192

Provider Marketplace

Cheapest
$0.00/month
Starting from
Best Value
$0.00/month
Starting from
Enterprise Choice
$0.00/hour
Starting from

All Cloud Providers

2 Options available
Vultr favicon
VultrCheapest
On-DemandGlobal Availability
$0.00/ month
Estimated Cost
Provision
On-DemandGlobal Availability
$0.00/ hour
Estimated Cost
Provision

Compute Performance

FP6445 TFLOPS TFLOPS
FP32180 TFLOPS TFLOPS
TF32360 TFLOPS (Dense), 720 TFLOPS (Sparse) TFLOPS
FP16720 TFLOPS (Dense), 1440 TFLOPS (Sparse) TFLOPS
BF16720 TFLOPS (Dense), 1440 TFLOPS (Sparse) TFLOPS
FP81440 TFLOPS (Dense), 2880 TFLOPS (Sparse) TFLOPS
INT82880 TOPS (Dense), 5760 TOPS (Sparse) TOPS
INT45760 TOPS (Dense), 11520 TOPS (Sparse) TOPS

Architecture

MicroarchitectureBlackwell
Process NodeTSMC 4NP
Die SizeDual-die (total ~1140 mm²)
Transistors208B (dual-die)
Compute Units288 SMs
Tensor Cores5th Gen, 1152 Tensor Cores
RT Cores
Matrix EngineTransformer Engine (FP8/FP16/BF16)
Base Clock
Boost Clock
Transformer EngineYes (Gen 2)
Sparse AccelerationSupported (2:4 structured sparsity)
Dynamic PrecisionSupported (FP4/FP6/FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeHBM3e
Total Capacity192GB GB
Bandwidth8TB/s
Bus Width6144-bit
HBM Stacks6
ECC SupportYes (Inline)
Unified MemoryYes (CUDA Unified Memory)
Compression
NUMA Awareness
Memory PoolingNVLink memory pooling supported

Connectivity & Scaling

InterconnectNVLink
GenerationNVLink 5
IB Bandwidth1.8 TB/s
PCIe InterfacePCIe Gen 5 xx16 per GPU via baseboard
CXL Support
TopologyFully-connected NVLink mesh (all-to-all)
Max GPUs/Node8
Scale-OutYes (InfiniBand NDR, RoCE v2)
GPUDirect RDMAYes
P2P MemoryYes

Virtualization

MIG SupportSupported
MIG Partitions10 instances (max)
SR-IOVNot Supported
vGPU ReadinessSupported (NVIDIA vGPU)
K8s ReadinessCertified (NVIDIA GPU Operator)
GPU SharingMIG, Time-Slicing, MPS, vGPU
Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDPN/A W
Peak PowerN/A
Idle PowerN/A
Perf / WattN/A
PSU RequiredBusbar-powered rack, N/A
ConnectorsDirect busbar connection, N/A
Thermal LimitsDesigned for liquid cooling; typical inlet temp 35°C, max 40°C
EfficiencySystem-level efficiency depends on rack and facility; not officially disclosed

Physical Design

Form FactorHGX B300 baseboard (8x SXM5 modules)
FHFLN/A
Slot WidthN/A
Dimensions445 x 410 mm
WeightN/A
CoolingDirect liquid cooling (DLC)
Rack DensityDesigned for high-density GPU compute nodes in 4U or 6U server chassis

Thermals & Cooling

AirflowServer chassis airflow required (Not Published)
Temp Range
ThrottlingThermal-based clock reduction at Tjunction limit
Noise LevelNot Applicable (Passive Module)
Liquid Cooling
DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported
ROCmNot Supported
oneAPINot Supported
PyTorchOfficially supported
TensorFlowOfficially supported
JAXSupported via CUDA backend
HuggingFaceOptimized (CUDA kernels available)
Triton ServerSupported
DockerOfficial container images available
Compiler StackMature CUDA compiler stack
Kernel OptimUpstream Linux support for datacenter GPUs documented
Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro
Preconfigured4U 8-GPU systems
DGX/HGXCore of an HGX baseboard
Rack-ScaleNVLink Switch System, InfiniBand scale-out
Edge DeployNot suitable for edge deployment due to high TDP
Ref ArchitecturesNVIDIA MGX, SuperPOD

System Compatibility

CPU PairingIntegrated with platform CPU (HGX/DGX architecture)
NUMAPlatform-specific NUMA topology; memory locality critical for optimal performance
Required PCIeNot Applicable (SXM/OAM)
MotherboardPlatform-specific (HGX/NVL baseboard)
Rack PowerContact vendor for rack power planning
BIOS Limits
CXL ReadyNot Supported
OS CompatRHEL, Ubuntu LTS, and Windows Server supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUOptimal performance with full utilization of GPU resources.
2-GPUNear-linear scaling with NVLink bridge, minimal overhead.
4-GPUEfficient scaling with NVLink, leveraging NVSwitch for high bandwidth.
8-GPUNear-linear scaling up to 8 GPUs with NVSwitch, maximizing NVLink bandwidth.
64+ GPUScalability impacted by InfiniBand/Ethernet overhead, but mitigated by GPUDirect RDMA and multi-rail networking.

Scaling Characteristics

Cross-Node LatencyLow latency with GPUDirect RDMA support, optimized for distributed training.
Network BottlenecksPotential bottlenecks at Host-to-Device bridge if not using NVLink, otherwise limited by network bandwidth.
ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism with frameworks like DeepSpeed and Megatron.

Workload Readiness

LLM Training

The HGX B300, likely based on a high-performance architecture such as Hopper or Blackwell, is suitable for training large models up to 400B+ parameters in a multi-node setup due to its substantial VRAM and interconnect capabilities.

LLM Inference

Optimized for high throughput inference with advanced Tensor cores, providing excellent token-per-second performance and ample KV cache for large models.

Vision Training

Highly capable for vision training tasks, leveraging its architecture's advanced Tensor cores and large VRAM to efficiently handle large datasets and complex models.

Diffusion Models

Well-suited for diffusion models, offering high computational throughput and memory bandwidth to manage the iterative processes involved in such models.

Multimodal AI

The architecture supports multimodal AI tasks effectively, with strong parallel processing capabilities and sufficient memory to handle diverse data types simultaneously.

Reinforcement Learning

Excellent for reinforcement learning, providing fast computation and large memory capacity to support complex environments and large-scale simulations.

HPC / Simulation

Strong FP64 performance makes it ideal for HPC simulations, offering the precision and computational power needed for scientific and engineering applications.

Scientific Computing

Highly efficient for scientific computing tasks, with robust double precision capabilities and high memory bandwidth to support intensive calculations.

Edge Inference

Not optimal for edge inference due to high power consumption and large form factor, better suited for data center environments.

Real-Time Serving

Capable of real-time AI serving with low latency and high throughput, leveraging its architecture's advanced processing capabilities.

Fine-Tuning

Highly efficient for full fine-tuning of large models, thanks to its substantial VRAM and advanced architecture.

LoRA Efficiency

Efficient for LoRA applications, providing sufficient computational resources and memory to handle parameter-efficient tuning methods.

Market Authority

Key Strengths

The HGX B300 excels at large-scale AI training and inference tasks, offering unparalleled performance for deep learning models. Its architecture is optimized for high throughput and low latency, making it ideal for scientific simulations and complex data analytics. The platform's scalability and efficiency set it apart from alternatives.

Limitations

While the HGX B300 offers exceptional performance, its high power consumption and cooling requirements may limit its use in smaller or less equipped datacenters. Additionally, its availability may be constrained by supply chain factors, and its cost can be prohibitive for smaller organizations.

Expert Insight

The HGX represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS
VRAM
TDP
Cores
Information updated daily. Cloud pricing subject to vendor availability.