NVIDIA · August 2023

L40S

PCIe Gen4 x16

The NVIDIA L40S is a high-performance datacenter GPU designed for AI, machine learning, and graphics-intensive workloads. It is part of the Ada Lovelace architecture, offering significant improvements in performance and efficiency over previous generations. Targeted at enterprise and cloud environments, the L40S excels in delivering accelerated computing power for demanding applications.

L40S PCIe Gen4 x16
VRAM
48GB GB
FP32 TFLOPS
91.6 TFLOPS
CUDA Cores
18,176
TDP
350W W

Compute Performance

FP641.4 TFLOPS TFLOPS
FP3291.6 TFLOPS TFLOPS
TF32183.2 TFLOPS (Sparse), 91.6 TFLOPS (Dense) TFLOPS
FP16366.1 TFLOPS (Sparse), 183.1 TFLOPS (Dense) TFLOPS
BF16366.1 TFLOPS (Sparse), 183.1 TFLOPS (Dense) TFLOPS
FP8Not Supported TFLOPS
INT8733.2 TOPS (Sparse), 366.6 TOPS (Dense) TOPS
INT4Not Supported TOPS

Architecture

MicroarchitectureAda Lovelace
Process NodeTSMC 4N
Die Size609 mm²
Transistors76.3B
Compute Units142 SMs
Tensor Cores4th Gen, 568 Tensor Cores
RT Cores3rd Gen, 142 RT Cores
Matrix EngineTensor Core (FP8/FP16/BF16/INT8)
Base Clock1350 MHz
Boost Clock1980 MHz
Transformer EngineYes (Gen 4)
Sparse AccelerationSupported (2:4 structured sparsity)
Dynamic PrecisionSupported (FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeGDDR6
Total Capacity48GB GB
Bandwidth864GB/s
Bus Width384-bit
HBM Stacks
ECC SupportYes (Inline)
Unified MemoryYes (CUDA Unified Memory)
Compression
NUMA Awareness
Memory PoolingNot Supported

Connectivity & Scaling

InterconnectPCIe
GenerationPCIe Gen4
IB Bandwidth32 GB/s bi-directional
PCIe InterfaceGen4 x16
CXL Support
TopologyPCIe switch or direct PCIe peer-to-peer
Max GPUs/Node4
Scale-OutYes, via PCIe-based networking (InfiniBand, RoCE v2)
GPUDirect RDMAYes
P2P MemoryYes, via PCIe BAR1

Virtualization

MIG SupportNot Supported
MIG PartitionsN/A
SR-IOVNot Supported
vGPU ReadinessSupported (NVIDIA vGPU)
K8s ReadinessCertified (NVIDIA GPU Operator)
GPU SharingTime-Slicing, vGPU, MPS
Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP300 W W
Peak Power320-350 W
Idle Power35-45 W
Perf / WattUp to 2.5 TFLOPS FP32/W
PSU RequiredN/A
Connectors1x 8-pin PCIe
Thermal LimitsMax GPU temperature 85°C
EfficiencyN/A

Physical Design

Form FactorPCIe Gen4 x16
FHFLFull Height, Full Length (FHFL)
Slot WidthDual slot
Dimensions267 mm x 112 mm
Weight1.5–1.8 kg
CoolingPassive
Rack DensityStandard PCIe server GPU density

Thermals & Cooling

AirflowRequires front-to-back chassis airflow (Not Published)
Temp Range0°C to 45°C
ThrottlingThermal-based clock reduction at Tjunction limit
Noise LevelNot Applicable (Passive Module)
Liquid CoolingAir-cooled
DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported
ROCmNot Supported
oneAPINot Supported
PyTorchOfficially supported
TensorFlowOfficially supported
JAXSupported via CUDA backend
HuggingFaceOptimized (CUDA kernels available)
Triton ServerSupported
DockerOfficial container images available
Compiler StackMature CUDA compiler stack
Kernel OptimUpstream Linux support via NVIDIA datacenter drivers
Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro
Preconfigured2U/4U universal GPU servers
DGX/HGXNot the core of a DGX system or HGX baseboard
Rack-ScaleInfiniBand scale-out
Edge DeploySuitable for edge deployments with moderate TDP considerations
Ref ArchitecturesNVIDIA MGX, OVX

System Compatibility

CPU PairingDual-socket Intel Xeon Scalable or AMD EPYC 7003/9004 class recommended
NUMAStandard NUMA behavior
Required PCIePCIe Gen 4 x16 required
MotherboardFull-length, double-width PCIe Gen4 x16 slot required
Rack PowerContact vendor for rack power planning
BIOS LimitsResizable BAR and Above 4G decoding recommended; SR-IOV support Not Published
CXL ReadyNo CXL memory expansion
OS CompatRHEL, Ubuntu LTS, and Windows supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe L40S PCIe Gen4 x16 offers high efficiency for single GPU workloads, leveraging the full 32GB/s bandwidth of PCIe Gen4.
2-GPUScaling to two GPUs is feasible but limited by PCIe lane contention, with potential bottlenecks in P2P communication.
4-GPUFour GPU scaling is constrained by the PCIe Gen4 bandwidth, leading to diminishing returns due to increased contention and limited P2P bandwidth.
8-GPUScaling to eight GPUs is significantly limited by PCIe bandwidth, resulting in sub-linear performance gains due to contention and lack of NVLink support.
64+ GPUAt large scales, InfiniBand or Ethernet overhead becomes significant, and PCIe limitations further exacerbate scaling inefficiencies.

Scaling Characteristics

Cross-Node LatencyCross-node communication is supported via GPUDirect RDMA, but latency is impacted by the absence of NVLink and reliance on PCIe.
Network BottlenecksThe primary bottleneck is the Host-to-Device bridge and lack of NVLink, leading to limited P2P bandwidth and potential VRAM pressure.
ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron.

Workload Readiness

LLM Training

The L40S, based on the Ada Lovelace architecture, is suitable for training models up to 70B parameters in a single-node setup. For larger models, multi-node configurations are recommended due to its PCIe Gen4 x16 interface and substantial VRAM capacity.

LLM Inference

Highly efficient for LLM inference with strong token-per-second performance, thanks to 4th-gen Tensor cores and ample VRAM for KV cache management.

Vision Training

Optimized for vision training tasks with Ada Lovelace architecture, providing excellent throughput and efficiency for large-scale image datasets.

Diffusion Models

Well-suited for diffusion models due to its high computational throughput and advanced tensor core capabilities, enabling efficient model training and inference.

Multimodal AI

Capable of handling multimodal AI workloads effectively, leveraging its robust architecture and VRAM to manage complex data types and models.

Reinforcement Learning

Supports reinforcement learning tasks with high parallelism and fast computation, benefiting from Ada Lovelace's architectural enhancements.

HPC / Simulation

Limited FP64 performance typical of GPUs not specifically designed for HPC, but can still support some HPC simulations with mixed precision.

Scientific Computing

While not optimized for FP64-heavy tasks, it can handle scientific computing workloads that benefit from mixed precision and parallel processing.

Edge Inference

Not ideal for edge inference due to higher power consumption and larger form factor, better suited for data center deployments.

Real-Time Serving

Excellent for real-time AI serving with low latency and high throughput, supported by advanced tensor cores and fast memory access.

Fine-Tuning

Highly efficient for full fine-tuning tasks, leveraging its large VRAM and compute capabilities to handle extensive model updates.

LoRA Efficiency

Efficient for LoRA fine-tuning, benefiting from lower VRAM requirements and the GPU's ability to perform rapid, iterative updates.

Market Authority

Cloud Adoption

Google Cloud publicly confirmed L4 adoption, but not L40S; no public confirmation for L40S by AWS, Azure, or other hyperscalers as of June 2024

Research Citations

Very limited; a handful of preprints and technical reports mention L40S, but not widespread in peer-reviewed literature

Community Benchmarks

Some independent benchmarks published on forums (e.g., ServeTheHome, Reddit) and vendor blogs, but no standardized or large-scale community benchmarks

GitHub Support

Minimal; a few repositories reference L40S in configuration files or README, but no major open-source frameworks list explicit L40S optimization or support

Enterprise Cases

NVIDIA has published select customer spotlights (e.g., for digital twin and visualization workloads), but no detailed, independently verified enterprise case studies

Key Strengths

The L40S excels in AI training and inference, high-performance computing, and rendering tasks. Its advanced architecture and enhanced tensor cores make it particularly effective for deep learning workloads, offering superior performance and efficiency. The GPU's capabilities in real-time ray tracing and graphics rendering also make it a strong choice for visual computing applications.

Limitations

While the L40S offers impressive performance, it may be overkill for less demanding applications, leading to underutilization. Its high power consumption and cooling requirements can be a consideration for energy-conscious deployments. Availability may be limited initially due to high demand and production constraints.

Expert Insight

The L40S represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS
VRAM
TDP
Cores
Information updated daily. Cloud pricing subject to vendor availability.