NVIDIA · August 2023

L40S

L40S

The NVIDIA L40S is a high-performance GPU designed for datacenter environments, targeting AI workloads, graphics rendering, and virtualization. It is part of the Ada Lovelace architecture, offering enhanced performance and efficiency over previous generations. The L40S is tailored for enterprise applications, providing robust support for AI and graphics-intensive tasks.

L40S L40S
VRAM
48GB GB
FP32 TFLOPS
91.6 TFLOPS
CUDA Cores
18176

Provider Marketplace

Cheapest
$1.67/hour
Starting from
Best Value
$1.67/hour
Starting from
Enterprise Choice
$1.67/hour
Starting from

All Cloud Providers

1 Options available
Vultr favicon
VultrCheapest
On-DemandGlobal Availability
$1.67/ hour
Estimated Cost
Provision

Compute Performance

FP641.4 TFLOPS TFLOPS
FP3291.6 TFLOPS TFLOPS
TF32183.2 TFLOPS (Sparse), 91.6 TFLOPS (Dense) TFLOPS
FP16366.1 TFLOPS (Sparse), 183.1 TFLOPS (Dense) TFLOPS
BF16366.1 TFLOPS (Sparse), 183.1 TFLOPS (Dense) TFLOPS
FP8Not Supported TFLOPS
INT8733.2 TOPS (Sparse), 366.6 TOPS (Dense) TOPS
INT4Not Supported TOPS

Architecture

MicroarchitectureAda Lovelace
Process NodeTSMC 4N
Die Size609 mm²
Transistors76.3B
Compute Units142 SMs
Tensor Cores4th Gen, 568 Tensor Cores
RT Cores3rd Gen, 142 RT Cores
Matrix EngineTransformer Engine (FP8/FP16/BF16)
Base Clock1290 MHz
Boost Clock1980 MHz
Transformer EngineYes (Gen 4)
Sparse AccelerationSupported (2:4 structured sparsity)
Dynamic PrecisionSupported (FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeGDDR6
Total Capacity48GB GB
Bandwidth864GB/s
Bus Width384-bit
HBM Stacks
ECC SupportYes (Inline)
Unified MemoryYes (CUDA Unified Memory)
Compression
NUMA Awareness
Memory PoolingNot Supported

Connectivity & Scaling

InterconnectPCIe
GenerationPCIe Gen 4
IB Bandwidth64 GB/s (bi-directional per card)
PCIe InterfacePCIe Gen 4 xx16
CXL Support
TopologyPCIe peer-to-peer
Max GPUs/Node2
Scale-OutYes (via PCIe/InfiniBand/RoCE)
GPUDirect RDMAYes
P2P MemoryYes (PCIe BAR1, limited)

Virtualization

MIG SupportNot Supported
MIG PartitionsN/A
SR-IOVNot Supported
vGPU ReadinessSupported (NVIDIA vGPU)
K8s ReadinessCertified (NVIDIA GPU Operator)
GPU SharingTime-Slicing, vGPU, MPS
Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP350 W W
Peak Power350-400 W
Idle Power40-60 W
Perf / WattUp to 0.45 TFLOPS FP32/W
PSU RequiredN/A
Connectors1x 16-pin (12VHPWR) or 2x 8-pin PCIe
Thermal LimitsMax GPU temperature 85°C
EfficiencyN/A

Physical Design

Form FactorPCIe Gen4 x16
FHFLFull Height, Full Length (FHFL)
Slot WidthDual slot
Dimensions267 mm x 112 mm
Weight1.8–2.2 kg
CoolingPassive
Rack DensityDesigned for high-density GPU servers

Thermals & Cooling

AirflowRequires front-to-back chassis airflow (Not Published)
Temp Range0°C to 45°C
ThrottlingThermal-based clock reduction at Tjunction limit
Noise LevelNot Applicable (Passive Module)
Liquid CoolingAir-cooled
DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported
ROCmNot Supported
oneAPINot Supported
PyTorchOfficially supported
TensorFlowOfficially supported
JAXSupported via CUDA backend
HuggingFaceOptimized (CUDA kernels available)
Triton ServerSupported
DockerOfficial container images available
Compiler StackMature CUDA compiler stack
Kernel OptimStandard driver-based support
Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro
Preconfigured2U/4U universal GPU servers
DGX/HGXNot typically part of DGX or HGX systems
Rack-ScaleInfiniBand scale-out
Edge DeploySuitable for data center deployments with moderate TDP considerations
Ref ArchitecturesNVIDIA MGX, OVX

System Compatibility

CPU PairingDual-socket Intel Xeon Scalable or AMD EPYC 7003/9004 class recommended
NUMAStandard NUMA behavior
Required PCIePCIe Gen 4 x16 or PCIe Gen 5 x16 recommended
MotherboardFull-length, double-width PCIe x16 slot required; confirm mechanical and power support
Rack PowerContact vendor for rack power planning
BIOS LimitsResizable BAR and Above 4G decoding required; SR-IOV support not published
CXL ReadyNo CXL memory expansion
OS CompatRHEL and Ubuntu LTS supported; Windows supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe L40S GPU, being a PCIe-based card, operates efficiently within its 32GB/s PCIe Gen4 bandwidth limit.
2-GPUScaling between two L40S GPUs is limited by PCIe lane contention, with potential bottlenecks in P2P bandwidth.
4-GPUScaling across four L40S GPUs is further constrained by PCIe bandwidth, leading to diminishing returns as more GPUs are added.
8-GPUWith eight L40S GPUs, scaling is significantly limited by PCIe bandwidth and lane contention, resulting in sub-linear performance improvements.
64+ GPUAt large scales, InfiniBand or Ethernet overhead becomes a significant factor, impacting overall efficiency and necessitating careful network configuration.

Scaling Characteristics

Cross-Node LatencyCross-node communication is supported via GPUDirect RDMA, but latency is influenced by network configuration and bandwidth.
Network BottlenecksThe primary bottleneck is the Host-to-Device bridge due to the lack of NVLink, compounded by VRAM pressure in large models.
ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron.

Workload Readiness

LLM Training

The L40S, based on the Ada Lovelace architecture, is suitable for training models up to 70B parameters in a single-node setup due to its high VRAM capacity. For 400B+ models, multi-node configurations are necessary.

LLM Inference

Highly efficient for inference tasks with strong token-per-second performance, leveraging 4th-gen Tensor cores and ample VRAM for KV cache management.

Vision Training

Excellent for vision training tasks, benefiting from Ada Lovelace's architecture enhancements and substantial VRAM, supporting large batch sizes and complex models.

Diffusion Models

Well-suited for diffusion models, offering fast training and inference capabilities due to its high computational throughput and advanced Tensor cores.

Multimodal AI

Capable of handling multimodal AI workloads efficiently, thanks to its robust architecture and large memory bandwidth, enabling seamless integration of diverse data types.

Reinforcement Learning

Effective for reinforcement learning, providing rapid model updates and environment interactions due to its high compute power and memory efficiency.

HPC / Simulation

Limited FP64 support typical of Ada Lovelace architecture, making it less ideal for double-precision HPC simulations but still viable for mixed-precision tasks.

Scientific Computing

Suitable for scientific computing tasks that can leverage mixed-precision calculations, though not optimal for those requiring extensive double-precision computations.

Edge Inference

Not ideal for edge inference due to higher TDP and larger form factor, better suited for data center deployments.

Real-Time Serving

Highly capable for real-time AI serving, with low latency and high throughput enabled by advanced Tensor cores and efficient architecture.

Fine-Tuning

Highly efficient for full fine-tuning tasks, leveraging its large VRAM to accommodate extensive model parameters and gradients.

LoRA Efficiency

Efficient for LoRA fine-tuning, allowing for parameter-efficient training with reduced VRAM requirements, making it versatile for various model sizes.

Market Authority

Cloud Adoption

NVIDIA confirms L40S available on Google Cloud and Oracle Cloud Infrastructure

Research Citations

Limited; a handful of preprints and technical reports reference L40S, but not widespread in peer-reviewed literature

Community Benchmarks

Some independent benchmarks published by Lambda Labs and select cloud providers

GitHub Support

Initial support in major deep learning frameworks (PyTorch, TensorFlow) via CUDA compatibility; no widespread L40S-specific optimizations

Enterprise Cases

NVIDIA and partners (e.g., Dell, Supermicro) have published solution briefs and customer references highlighting L40S in enterprise AI and visualization workloads

Key Strengths

The L40S excels in AI training and inference, offering significant performance improvements for deep learning models. It is also highly effective for graphics rendering and virtualization, making it a versatile choice for mixed workloads in datacenters.

Limitations

While the L40S offers impressive performance, it may come at a higher cost compared to other GPUs in its class. Availability might be limited due to high demand, and users should ensure their systems can accommodate its power and cooling requirements.

Expert Insight

The L40S represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS
VRAM
TDP
Cores
Information updated daily. Cloud pricing subject to vendor availability.