NVIDIA · August 2023

L40S

Name: NVIDIA L40S PCIe Gen4 x16
Brand: NVIDIA
Rating: 4.8 (12 reviews)

PCIe Gen4 x16

The NVIDIA L40S is a high-performance datacenter GPU designed for AI, machine learning, and graphics-intensive workloads. It is part of the Ada Lovelace architecture, offering significant improvements in performance and efficiency over previous generations. Targeted at enterprise and cloud environments, the L40S excels in delivering accelerated computing power for demanding applications.

VRAM

48GB GB

FP32 TFLOPS

91.6 TFLOPS

CUDA Cores

18,176

TDP

350W W

Compute Performance

FP641.4 TFLOPS TFLOPS

FP3291.6 TFLOPS TFLOPS

TF32183.2 TFLOPS (Sparse), 91.6 TFLOPS (Dense) TFLOPS

FP16366.1 TFLOPS (Sparse), 183.1 TFLOPS (Dense) TFLOPS

BF16366.1 TFLOPS (Sparse), 183.1 TFLOPS (Dense) TFLOPS

FP8Not Supported TFLOPS

INT8733.2 TOPS (Sparse), 366.6 TOPS (Dense) TOPS

INT4Not Supported TOPS

Architecture

MicroarchitectureAda Lovelace

Process NodeTSMC 4N

Die Size609 mm²

Transistors76.3B

Compute Units142 SMs

Tensor Cores4th Gen, 568 Tensor Cores

RT Cores3rd Gen, 142 RT Cores

Matrix EngineTensor Core (FP8/FP16/BF16/INT8)

Base Clock1350 MHz

Boost Clock1980 MHz

Transformer EngineYes (Gen 4)

Sparse AccelerationSupported (2:4 structured sparsity)

Dynamic PrecisionSupported (FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeGDDR6

Total Capacity48GB GB

Bandwidth864GB/s

Bus Width384-bit

HBM Stacks—

ECC SupportYes (Inline)

Unified MemoryYes (CUDA Unified Memory)

Compression—

NUMA Awareness—

Memory PoolingNot Supported

Connectivity & Scaling

InterconnectPCIe

GenerationPCIe Gen4

IB Bandwidth32 GB/s bi-directional

PCIe InterfaceGen4 x16

CXL Support—

TopologyPCIe switch or direct PCIe peer-to-peer

Max GPUs/Node4

Scale-OutYes, via PCIe-based networking (InfiniBand, RoCE v2)

GPUDirect RDMAYes

P2P MemoryYes, via PCIe BAR1

Virtualization

MIG SupportNot Supported

MIG PartitionsN/A

SR-IOVNot Supported

vGPU ReadinessSupported (NVIDIA vGPU)

K8s ReadinessCertified (NVIDIA GPU Operator)

GPU SharingTime-Slicing, vGPU, MPS

Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP300 W W

Peak Power320-350 W

Idle Power35-45 W

Perf / WattUp to 2.5 TFLOPS FP32/W

PSU RequiredN/A

Connectors1x 8-pin PCIe

Thermal LimitsMax GPU temperature 85°C

EfficiencyN/A

Physical Design

Form FactorPCIe Gen4 x16

FHFLFull Height, Full Length (FHFL)

Slot WidthDual slot

Dimensions267 mm x 112 mm

Weight1.5–1.8 kg

CoolingPassive

Rack DensityStandard PCIe server GPU density

Thermals & Cooling

AirflowRequires front-to-back chassis airflow (Not Published)

Temp Range0°C to 45°C

ThrottlingThermal-based clock reduction at Tjunction limit

Noise LevelNot Applicable (Passive Module)

Liquid CoolingAir-cooled

DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported

ROCmNot Supported

oneAPINot Supported

PyTorchOfficially supported

TensorFlowOfficially supported

JAXSupported via CUDA backend

HuggingFaceOptimized (CUDA kernels available)

Triton ServerSupported

DockerOfficial container images available

Compiler StackMature CUDA compiler stack

Kernel OptimUpstream Linux support via NVIDIA datacenter drivers

Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro

Preconfigured2U/4U universal GPU servers

DGX/HGXNot the core of a DGX system or HGX baseboard

Rack-ScaleInfiniBand scale-out

Edge DeploySuitable for edge deployments with moderate TDP considerations

Ref ArchitecturesNVIDIA MGX, OVX

System Compatibility

CPU PairingDual-socket Intel Xeon Scalable or AMD EPYC 7003/9004 class recommended

NUMAStandard NUMA behavior

Required PCIePCIe Gen 4 x16 required

MotherboardFull-length, double-width PCIe Gen4 x16 slot required

Rack PowerContact vendor for rack power planning

BIOS LimitsResizable BAR and Above 4G decoding recommended; SR-IOV support Not Published

CXL ReadyNo CXL memory expansion

OS CompatRHEL, Ubuntu LTS, and Windows supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe L40S PCIe Gen4 x16 offers high efficiency for single GPU workloads, leveraging the full 32GB/s bandwidth of PCIe Gen4.

2-GPUScaling to two GPUs is feasible but limited by PCIe lane contention, with potential bottlenecks in P2P communication.

4-GPUFour GPU scaling is constrained by the PCIe Gen4 bandwidth, leading to diminishing returns due to increased contention and limited P2P bandwidth.

8-GPUScaling to eight GPUs is significantly limited by PCIe bandwidth, resulting in sub-linear performance gains due to contention and lack of NVLink support.

64+ GPUAt large scales, InfiniBand or Ethernet overhead becomes significant, and PCIe limitations further exacerbate scaling inefficiencies.

Scaling Characteristics

Cross-Node LatencyCross-node communication is supported via GPUDirect RDMA, but latency is impacted by the absence of NVLink and reliance on PCIe.

Network BottlenecksThe primary bottleneck is the Host-to-Device bridge and lack of NVLink, leading to limited P2P bandwidth and potential VRAM pressure.

ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron.

Workload Readiness

LLM Training

The L40S, based on the Ada Lovelace architecture, is suitable for training models up to 70B parameters in a single-node setup. For larger models, multi-node configurations are recommended due to its PCIe Gen4 x16 interface and substantial VRAM capacity.

LLM Inference

Highly efficient for LLM inference with strong token-per-second performance, thanks to 4th-gen Tensor cores and ample VRAM for KV cache management.

Vision Training

Optimized for vision training tasks with Ada Lovelace architecture, providing excellent throughput and efficiency for large-scale image datasets.

Diffusion Models

Well-suited for diffusion models due to its high computational throughput and advanced tensor core capabilities, enabling efficient model training and inference.

Multimodal AI

Capable of handling multimodal AI workloads effectively, leveraging its robust architecture and VRAM to manage complex data types and models.

Reinforcement Learning

Supports reinforcement learning tasks with high parallelism and fast computation, benefiting from Ada Lovelace's architectural enhancements.

HPC / Simulation

Limited FP64 performance typical of GPUs not specifically designed for HPC, but can still support some HPC simulations with mixed precision.

Scientific Computing

While not optimized for FP64-heavy tasks, it can handle scientific computing workloads that benefit from mixed precision and parallel processing.

Edge Inference

Not ideal for edge inference due to higher power consumption and larger form factor, better suited for data center deployments.

Real-Time Serving

Excellent for real-time AI serving with low latency and high throughput, supported by advanced tensor cores and fast memory access.

Fine-Tuning

Highly efficient for full fine-tuning tasks, leveraging its large VRAM and compute capabilities to handle extensive model updates.

LoRA Efficiency

Efficient for LoRA fine-tuning, benefiting from lower VRAM requirements and the GPU's ability to perform rapid, iterative updates.

Market Authority

Cloud Adoption

Google Cloud publicly confirmed L4 adoption, but not L40S; no public confirmation for L40S by AWS, Azure, or other hyperscalers as of June 2024

Research Citations

Very limited; a handful of preprints and technical reports mention L40S, but not widespread in peer-reviewed literature

Community Benchmarks

Some independent benchmarks published on forums (e.g., ServeTheHome, Reddit) and vendor blogs, but no standardized or large-scale community benchmarks

GitHub Support

Minimal; a few repositories reference L40S in configuration files or README, but no major open-source frameworks list explicit L40S optimization or support

Enterprise Cases

NVIDIA has published select customer spotlights (e.g., for digital twin and visualization workloads), but no detailed, independently verified enterprise case studies

Key Strengths

The L40S excels in AI training and inference, high-performance computing, and rendering tasks. Its advanced architecture and enhanced tensor cores make it particularly effective for deep learning workloads, offering superior performance and efficiency. The GPU's capabilities in real-time ray tracing and graphics rendering also make it a strong choice for visual computing applications.

Limitations

While the L40S offers impressive performance, it may be overkill for less demanding applications, leading to underutilization. Its high power consumption and cooling requirements can be a consideration for energy-conscious deployments. Availability may be limited initially due to high demand and production constraints.

Also in the Lineup

GeForce RTX 4090 Founders Edition

NVIDIA

GeForce RTX 5080 Founders Edition

NVIDIA

GeForce RTX 5090 RTX 5090

Expert Insight

The L40S represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS

VRAM

TDP

Cores

Information updated daily. Cloud pricing subject to vendor availability.