NVIDIA · August 2023

L40S

Name: NVIDIA L40S L40S
Brand: NVIDIA
Availability: InStock
Rating: 4.8 (12 reviews)

The NVIDIA L40S is a high-performance GPU designed for datacenter environments, targeting AI workloads, graphics rendering, and virtualization. It is part of the Ada Lovelace architecture, offering enhanced performance and efficiency over previous generations. The L40S is tailored for enterprise applications, providing robust support for AI and graphics-intensive tasks.

VRAM

48GB GB

FP32 TFLOPS

91.6 TFLOPS

CUDA Cores

18176

Provider Marketplace

Cheapest

$1.67/hour

Starting from

Vultr Visit

Best Value

$1.67/hour

Starting from

Vultr Visit

Enterprise Choice

$1.67/hour

Starting from

Vultr Visit

All Cloud Providers

1 Options available

VultrCheapest

On-Demand•Global Availability

$1.67/ hour

Estimated Cost

Provision

Compute Performance

FP641.4 TFLOPS TFLOPS

FP3291.6 TFLOPS TFLOPS

TF32183.2 TFLOPS (Sparse), 91.6 TFLOPS (Dense) TFLOPS

FP16366.1 TFLOPS (Sparse), 183.1 TFLOPS (Dense) TFLOPS

BF16366.1 TFLOPS (Sparse), 183.1 TFLOPS (Dense) TFLOPS

FP8Not Supported TFLOPS

INT8733.2 TOPS (Sparse), 366.6 TOPS (Dense) TOPS

INT4Not Supported TOPS

Architecture

MicroarchitectureAda Lovelace

Process NodeTSMC 4N

Die Size609 mm²

Transistors76.3B

Compute Units142 SMs

Tensor Cores4th Gen, 568 Tensor Cores

RT Cores3rd Gen, 142 RT Cores

Matrix EngineTransformer Engine (FP8/FP16/BF16)

Base Clock1290 MHz

Boost Clock1980 MHz

Transformer EngineYes (Gen 4)

Sparse AccelerationSupported (2:4 structured sparsity)

Dynamic PrecisionSupported (FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeGDDR6

Total Capacity48GB GB

Bandwidth864GB/s

Bus Width384-bit

HBM Stacks—

ECC SupportYes (Inline)

Unified MemoryYes (CUDA Unified Memory)

Compression—

NUMA Awareness—

Memory PoolingNot Supported

Connectivity & Scaling

InterconnectPCIe

GenerationPCIe Gen 4

IB Bandwidth64 GB/s (bi-directional per card)

PCIe InterfacePCIe Gen 4 xx16

CXL Support—

TopologyPCIe peer-to-peer

Max GPUs/Node2

Scale-OutYes (via PCIe/InfiniBand/RoCE)

GPUDirect RDMAYes

P2P MemoryYes (PCIe BAR1, limited)

Virtualization

MIG SupportNot Supported

MIG PartitionsN/A

SR-IOVNot Supported

vGPU ReadinessSupported (NVIDIA vGPU)

K8s ReadinessCertified (NVIDIA GPU Operator)

GPU SharingTime-Slicing, vGPU, MPS

Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP350 W W

Peak Power350-400 W

Idle Power40-60 W

Perf / WattUp to 0.45 TFLOPS FP32/W

PSU RequiredN/A

Connectors1x 16-pin (12VHPWR) or 2x 8-pin PCIe

Thermal LimitsMax GPU temperature 85°C

EfficiencyN/A

Physical Design

Form FactorPCIe Gen4 x16

FHFLFull Height, Full Length (FHFL)

Slot WidthDual slot

Dimensions267 mm x 112 mm

Weight1.8–2.2 kg

CoolingPassive

Rack DensityDesigned for high-density GPU servers

Thermals & Cooling

AirflowRequires front-to-back chassis airflow (Not Published)

Temp Range0°C to 45°C

ThrottlingThermal-based clock reduction at Tjunction limit

Noise LevelNot Applicable (Passive Module)

Liquid CoolingAir-cooled

DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported

ROCmNot Supported

oneAPINot Supported

PyTorchOfficially supported

TensorFlowOfficially supported

JAXSupported via CUDA backend

HuggingFaceOptimized (CUDA kernels available)

Triton ServerSupported

DockerOfficial container images available

Compiler StackMature CUDA compiler stack

Kernel OptimStandard driver-based support

Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro

Preconfigured2U/4U universal GPU servers

DGX/HGXNot typically part of DGX or HGX systems

Rack-ScaleInfiniBand scale-out

Edge DeploySuitable for data center deployments with moderate TDP considerations

Ref ArchitecturesNVIDIA MGX, OVX

System Compatibility

CPU PairingDual-socket Intel Xeon Scalable or AMD EPYC 7003/9004 class recommended

NUMAStandard NUMA behavior

Required PCIePCIe Gen 4 x16 or PCIe Gen 5 x16 recommended

MotherboardFull-length, double-width PCIe x16 slot required; confirm mechanical and power support

Rack PowerContact vendor for rack power planning

BIOS LimitsResizable BAR and Above 4G decoding required; SR-IOV support not published

CXL ReadyNo CXL memory expansion

OS CompatRHEL and Ubuntu LTS supported; Windows supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe L40S GPU, being a PCIe-based card, operates efficiently within its 32GB/s PCIe Gen4 bandwidth limit.

2-GPUScaling between two L40S GPUs is limited by PCIe lane contention, with potential bottlenecks in P2P bandwidth.

4-GPUScaling across four L40S GPUs is further constrained by PCIe bandwidth, leading to diminishing returns as more GPUs are added.

8-GPUWith eight L40S GPUs, scaling is significantly limited by PCIe bandwidth and lane contention, resulting in sub-linear performance improvements.

64+ GPUAt large scales, InfiniBand or Ethernet overhead becomes a significant factor, impacting overall efficiency and necessitating careful network configuration.

Scaling Characteristics

Cross-Node LatencyCross-node communication is supported via GPUDirect RDMA, but latency is influenced by network configuration and bandwidth.

Network BottlenecksThe primary bottleneck is the Host-to-Device bridge due to the lack of NVLink, compounded by VRAM pressure in large models.

ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron.

Workload Readiness

LLM Training

The L40S, based on the Ada Lovelace architecture, is suitable for training models up to 70B parameters in a single-node setup due to its high VRAM capacity. For 400B+ models, multi-node configurations are necessary.

LLM Inference

Highly efficient for inference tasks with strong token-per-second performance, leveraging 4th-gen Tensor cores and ample VRAM for KV cache management.

Vision Training

Excellent for vision training tasks, benefiting from Ada Lovelace's architecture enhancements and substantial VRAM, supporting large batch sizes and complex models.

Diffusion Models

Well-suited for diffusion models, offering fast training and inference capabilities due to its high computational throughput and advanced Tensor cores.

Multimodal AI

Capable of handling multimodal AI workloads efficiently, thanks to its robust architecture and large memory bandwidth, enabling seamless integration of diverse data types.

Reinforcement Learning

Effective for reinforcement learning, providing rapid model updates and environment interactions due to its high compute power and memory efficiency.

HPC / Simulation

Limited FP64 support typical of Ada Lovelace architecture, making it less ideal for double-precision HPC simulations but still viable for mixed-precision tasks.

Scientific Computing

Suitable for scientific computing tasks that can leverage mixed-precision calculations, though not optimal for those requiring extensive double-precision computations.

Edge Inference

Not ideal for edge inference due to higher TDP and larger form factor, better suited for data center deployments.

Real-Time Serving

Highly capable for real-time AI serving, with low latency and high throughput enabled by advanced Tensor cores and efficient architecture.

Fine-Tuning

Highly efficient for full fine-tuning tasks, leveraging its large VRAM to accommodate extensive model parameters and gradients.

LoRA Efficiency

Efficient for LoRA fine-tuning, allowing for parameter-efficient training with reduced VRAM requirements, making it versatile for various model sizes.

Market Authority

Cloud Adoption

NVIDIA confirms L40S available on Google Cloud and Oracle Cloud Infrastructure

Research Citations

Limited; a handful of preprints and technical reports reference L40S, but not widespread in peer-reviewed literature

Community Benchmarks

Some independent benchmarks published by Lambda Labs and select cloud providers

GitHub Support

Initial support in major deep learning frameworks (PyTorch, TensorFlow) via CUDA compatibility; no widespread L40S-specific optimizations

Enterprise Cases

NVIDIA and partners (e.g., Dell, Supermicro) have published solution briefs and customer references highlighting L40S in enterprise AI and visualization workloads

Key Strengths

The L40S excels in AI training and inference, offering significant performance improvements for deep learning models. It is also highly effective for graphics rendering and virtualization, making it a versatile choice for mixed workloads in datacenters.

Limitations

While the L40S offers impressive performance, it may come at a higher cost compared to other GPUs in its class. Availability might be limited due to high demand, and users should ensure their systems can accommodate its power and cooling requirements.

Also in the Lineup

GeForce RTX 4090 Founders Edition

NVIDIA

GeForce RTX 5080 Founders Edition

NVIDIA

GeForce RTX 5090 RTX 5090

Expert Insight

The L40S represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS

VRAM

TDP

Cores

Information updated daily. Cloud pricing subject to vendor availability.