NVIDIA · Q3 2023

L40S

NVL

The NVIDIA L40S NVL is a high-performance datacenter GPU designed for AI, machine learning, and high-performance computing workloads. It is part of the Ada Lovelace architecture, offering enhanced performance and efficiency over previous generations. Targeted at enterprise and cloud environments, it features advanced capabilities for large-scale AI model training and inference, making it a key component in modern AI infrastructure.

L40S NVL
VRAM
48GB GB
FP32 TFLOPS
91.6 TFLOPS
CUDA Cores
18,176
TDP
350W W

Provider Marketplace

Cheapest
$1.58/hour
Starting from
Best Value
$1.58/hour
Starting from
Enterprise Choice
$1.58/hour
Starting from

All Cloud Providers

1 Options available
Atlantic.NetCheapest
On-DemandGlobal Availability
$1.58/ hour
Estimated Cost
Provision

Compute Performance

FP641.4 TFLOPS TFLOPS
FP3291.6 TFLOPS TFLOPS
TF32183.2 TFLOPS (Sparse), 91.6 TFLOPS (Dense) TFLOPS
FP16366.4 TFLOPS (Sparse), 183.2 TFLOPS (Dense) TFLOPS
BF16366.4 TFLOPS (Sparse), 183.2 TFLOPS (Dense) TFLOPS
FP8Not Supported TFLOPS
INT8733 TOPS (Sparse), 366 TOPS (Dense) TOPS
INT4Not Supported TOPS

Architecture

MicroarchitectureAda Lovelace
Process NodeTSMC 4N
Die Size608 mm²
Transistors76.3B
Compute Units142 SMs
Tensor Cores4th Gen, 568 Tensor Cores
RT Cores3rd Gen, 142 RT Cores
Matrix EngineTransformer Engine (FP8/FP16/BF16)
Base Clock1350 MHz
Boost Clock1980 MHz
Transformer EngineYes (Gen 4)
Sparse AccelerationSupported (2:4 structured sparsity)
Dynamic PrecisionSupported (FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeHBM3
Total Capacity48GB GB
Bandwidth1.6TB/s
Bus Width6144-bit
HBM Stacks6
ECC SupportYes (Inline)
Unified MemoryYes (CUDA Unified Memory)
Compression
NUMA Awareness
Memory PoolingNVLink memory pooling supported in NVL systems

Connectivity & Scaling

InterconnectNVLink Switch
GenerationNVLink 4
IB Bandwidth1.8 TB/s
PCIe InterfacePCIe Gen 4 x16
CXL Support
TopologyFully-connected NVLink domain via NVLink Switch
Max GPUs/Node4
Scale-OutYes (via InfiniBand NDR/RoCE v2)
GPUDirect RDMAYes
P2P MemoryYes

Virtualization

MIG SupportSupported
MIG Partitions7 instances (max)
SR-IOVNot Supported
vGPU ReadinessSupported (NVIDIA vGPU)
K8s ReadinessCertified (NVIDIA GPU Operator)
GPU SharingMIG, Time-Slicing, MPS, vGPU
Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP350 W W
Peak Power350-400 W
Idle Power40-60 W
Perf / WattUp to 1.2 TFLOPS FP16/W
PSU RequiredN/A
Connectors1x PCIe 16-pin (12VHPWR) or 2x PCIe 8-pin
Thermal LimitsMax GPU temperature 85°C
EfficiencyN/A

Physical Design

Form FactorSXM5 module
FHFLN/A
Slot WidthN/A
Dimensions112 mm x 157 mm
Weight1.8–2.2 kg
CoolingDirect liquid cooling (DLC) or air cooling (OEM dependent)
Rack DensityOptimized for multi-GPU HGX baseboards and high-density rack deployments

Thermals & Cooling

AirflowDirect-to-chip liquid cooling
Temp Range
ThrottlingStandard thermal protection
Noise LevelNot Applicable (Passive Module)
Liquid CoolingDirect-to-chip liquid cooling
DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported
ROCmNot Supported
oneAPINot Supported
PyTorchOfficially supported
TensorFlowOfficially supported
JAXSupported via CUDA backend
HuggingFaceOptimized (CUDA kernels available)
Triton ServerSupported
DockerOfficial container images available
Compiler StackMature CUDA compiler stack
Kernel OptimStandard driver-based support
Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro
Preconfigured2U/4U universal GPU servers
DGX/HGXNot the core of a DGX system or HGX baseboard
Rack-ScaleInfiniBand scale-out
Edge DeploySuitable for edge deployments with moderate TDP considerations
Ref ArchitecturesNVIDIA MGX, OVX

System Compatibility

CPU PairingIntegrated with platform CPU (HGX/DGX architecture)
NUMAPlatform-specific NUMA topology; memory locality managed by system architecture
Required PCIeNot Applicable (NVL rack-scale system)
MotherboardPlatform-specific (HGX/NVL baseboard)
Rack PowerContact vendor for rack power planning
BIOS Limits
CXL ReadyNot Supported
OS CompatRHEL, Ubuntu LTS, and Windows Server supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe L40S NVL operates efficiently as a standalone unit, leveraging its full PCIe bandwidth.
2-GPUScaling between two GPUs is limited by PCIe lane contention, with a maximum bandwidth of 32GB/s per direction using PCIe Gen4.
4-GPUScaling across four GPUs is further constrained by PCIe bandwidth, leading to diminishing returns due to increased contention.
8-GPUWithout NVLink or NVSwitch, eight GPU scaling is significantly limited by PCIe bandwidth, resulting in sub-linear performance improvements.
64+ GPUAt large scale, InfiniBand or Ethernet overhead becomes significant, and efficient scaling requires optimized network configurations to mitigate these effects.

Scaling Characteristics

Cross-Node LatencyCross-node communication is supported via GPUDirect RDMA, but latency is influenced by network topology and configuration.
Network BottlenecksThe primary bottleneck is the Host-to-Device bridge due to the lack of NVLink, compounded by PCIe bandwidth limitations.
ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron for distributed training.

Workload Readiness

LLM Training

The L40S NVL, based on the Ada Lovelace architecture, is suitable for training large language models up to 70B parameters in a single-node setup due to its high VRAM capacity. For 400B+ models, multi-node configurations are recommended.

LLM Inference

Highly efficient for inference tasks with excellent token-per-second throughput, thanks to 4th-gen Tensor cores and ample VRAM for KV cache management.

Vision Training

Optimized for vision training tasks with significant improvements in throughput and efficiency due to Ada Lovelace architecture and enhanced Tensor cores.

Diffusion Models

Well-suited for diffusion models, leveraging high VRAM and Tensor core capabilities to accelerate training and inference processes.

Multimodal AI

Capable of handling multimodal AI tasks efficiently, benefiting from the architecture's support for diverse data types and operations.

Reinforcement Learning

Effective for reinforcement learning workloads, offering fast computation and high throughput for complex simulations and model updates.

HPC / Simulation

Limited FP64 support; not ideal for HPC simulations requiring high double-precision performance, but can handle mixed-precision tasks efficiently.

Scientific Computing

Suitable for scientific computing tasks that can leverage mixed-precision calculations, but not optimal for those requiring extensive FP64 precision.

Edge Inference

Not ideal for edge inference due to higher power consumption and larger form factor, better suited for data center environments.

Real-Time Serving

Excellent for real-time AI serving, providing low latency and high throughput with advanced Tensor cores and Ada Lovelace architecture.

Fine-Tuning

Highly efficient for full fine-tuning tasks, leveraging high VRAM and advanced architecture to handle large model updates.

LoRA Efficiency

Efficient for LoRA fine-tuning, benefiting from lower VRAM requirements and optimized Tensor core performance.

Market Authority

Key Strengths

The L40S NVL excels in AI and machine learning workloads, particularly in training large neural networks and performing complex inference tasks. Its architecture provides significant performance improvements in FP16 and INT8 operations, making it ideal for deep learning applications. The GPU's high memory bandwidth and capacity also support data-intensive tasks, setting it apart from alternatives.

Limitations

While the L40S NVL offers exceptional performance, it comes with a high power consumption and cost, which may not be suitable for all budgets. Availability can be constrained due to high demand in AI and HPC sectors. Users should also consider the need for advanced cooling solutions to manage its thermal output effectively.

Expert Insight

The L40S represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS
VRAM
TDP
Cores
Information updated daily. Cloud pricing subject to vendor availability.