NVIDIA · Q3 2023

L40S

Name: NVIDIA L40S NVL
Brand: NVIDIA
Availability: InStock
Rating: 4.8 (12 reviews)

NVL

The NVIDIA L40S NVL is a high-performance datacenter GPU designed for AI, machine learning, and high-performance computing workloads. It is part of the Ada Lovelace architecture, offering enhanced performance and efficiency over previous generations. Targeted at enterprise and cloud environments, it features advanced capabilities for large-scale AI model training and inference, making it a key component in modern AI infrastructure.

VRAM

48GB GB

FP32 TFLOPS

91.6 TFLOPS

CUDA Cores

18,176

TDP

350W W

Provider Marketplace

Cheapest

$1.58/hour

Starting from

Atlantic.Net Visit

Best Value

$1.58/hour

Starting from

Atlantic.Net Visit

Enterprise Choice

$1.58/hour

Starting from

Atlantic.Net Visit

All Cloud Providers

1 Options available

Atlantic.NetCheapest

On-Demand•Global Availability

$1.58/ hour

Estimated Cost

Provision

Compute Performance

FP641.4 TFLOPS TFLOPS

FP3291.6 TFLOPS TFLOPS

TF32183.2 TFLOPS (Sparse), 91.6 TFLOPS (Dense) TFLOPS

FP16366.4 TFLOPS (Sparse), 183.2 TFLOPS (Dense) TFLOPS

BF16366.4 TFLOPS (Sparse), 183.2 TFLOPS (Dense) TFLOPS

FP8Not Supported TFLOPS

INT8733 TOPS (Sparse), 366 TOPS (Dense) TOPS

INT4Not Supported TOPS

Architecture

MicroarchitectureAda Lovelace

Process NodeTSMC 4N

Die Size608 mm²

Transistors76.3B

Compute Units142 SMs

Tensor Cores4th Gen, 568 Tensor Cores

RT Cores3rd Gen, 142 RT Cores

Matrix EngineTransformer Engine (FP8/FP16/BF16)

Base Clock1350 MHz

Boost Clock1980 MHz

Transformer EngineYes (Gen 4)

Sparse AccelerationSupported (2:4 structured sparsity)

Dynamic PrecisionSupported (FP8/FP16/BF16/TF32)

Memory & VRAM

Memory TypeHBM3

Total Capacity48GB GB

Bandwidth1.6TB/s

Bus Width6144-bit

HBM Stacks6

ECC SupportYes (Inline)

Unified MemoryYes (CUDA Unified Memory)

Compression—

NUMA Awareness—

Memory PoolingNVLink memory pooling supported in NVL systems

Connectivity & Scaling

InterconnectNVLink Switch

GenerationNVLink 4

IB Bandwidth1.8 TB/s

PCIe InterfacePCIe Gen 4 x16

CXL Support—

TopologyFully-connected NVLink domain via NVLink Switch

Max GPUs/Node4

Scale-OutYes (via InfiniBand NDR/RoCE v2)

GPUDirect RDMAYes

P2P MemoryYes

Virtualization

MIG SupportSupported

MIG Partitions7 instances (max)

SR-IOVNot Supported

vGPU ReadinessSupported (NVIDIA vGPU)

K8s ReadinessCertified (NVIDIA GPU Operator)

GPU SharingMIG, Time-Slicing, MPS, vGPU

Virt EfficiencyNear bare-metal (vendor claim)

Power & Efficiency

TDP350 W W

Peak Power350-400 W

Idle Power40-60 W

Perf / WattUp to 1.2 TFLOPS FP16/W

PSU RequiredN/A

Connectors1x PCIe 16-pin (12VHPWR) or 2x PCIe 8-pin

Thermal LimitsMax GPU temperature 85°C

EfficiencyN/A

Physical Design

Form FactorSXM5 module

FHFLN/A

Slot WidthN/A

Dimensions112 mm x 157 mm

Weight1.8–2.2 kg

CoolingDirect liquid cooling (DLC) or air cooling (OEM dependent)

Rack DensityOptimized for multi-GPU HGX baseboards and high-density rack deployments

Thermals & Cooling

AirflowDirect-to-chip liquid cooling

Temp Range—

ThrottlingStandard thermal protection

Noise LevelNot Applicable (Passive Module)

Liquid CoolingDirect-to-chip liquid cooling

DC HeatHigh (rack-scale deployment recommended)

Software Ecosystem

CUDACUDA 12.x supported

ROCmNot Supported

oneAPINot Supported

PyTorchOfficially supported

TensorFlowOfficially supported

JAXSupported via CUDA backend

HuggingFaceOptimized (CUDA kernels available)

Triton ServerSupported

DockerOfficial container images available

Compiler StackMature CUDA compiler stack

Kernel OptimStandard driver-based support

Driver StabilityEnterprise-grade stability

Server & Deployment

OEM AvailabilityTier-1 OEMs: Dell, HPE, Supermicro

Preconfigured2U/4U universal GPU servers

DGX/HGXNot the core of a DGX system or HGX baseboard

Rack-ScaleInfiniBand scale-out

Edge DeploySuitable for edge deployments with moderate TDP considerations

Ref ArchitecturesNVIDIA MGX, OVX

System Compatibility

CPU PairingIntegrated with platform CPU (HGX/DGX architecture)

NUMAPlatform-specific NUMA topology; memory locality managed by system architecture

Required PCIeNot Applicable (NVL rack-scale system)

MotherboardPlatform-specific (HGX/NVL baseboard)

Rack PowerContact vendor for rack power planning

BIOS Limits—

CXL ReadyNot Supported

OS CompatRHEL, Ubuntu LTS, and Windows Server supported

Benchmarks & Throughput

Structured Sparsity

Supported (up to 2x vs dense)

Transformer Throughput

Supported (Transformer Engine)

Multi-GPU Scalability

Scaling Efficiency

Single GPUThe L40S NVL operates efficiently as a standalone unit, leveraging its full PCIe bandwidth.

2-GPUScaling between two GPUs is limited by PCIe lane contention, with a maximum bandwidth of 32GB/s per direction using PCIe Gen4.

4-GPUScaling across four GPUs is further constrained by PCIe bandwidth, leading to diminishing returns due to increased contention.

8-GPUWithout NVLink or NVSwitch, eight GPU scaling is significantly limited by PCIe bandwidth, resulting in sub-linear performance improvements.

64+ GPUAt large scale, InfiniBand or Ethernet overhead becomes significant, and efficient scaling requires optimized network configurations to mitigate these effects.

Scaling Characteristics

Cross-Node LatencyCross-node communication is supported via GPUDirect RDMA, but latency is influenced by network topology and configuration.

Network BottlenecksThe primary bottleneck is the Host-to-Device bridge due to the lack of NVLink, compounded by PCIe bandwidth limitations.

ParallelismSupports Data, Model, Pipeline, and Tensor Parallelism, compatible with frameworks like DeepSpeed and Megatron for distributed training.

Workload Readiness

LLM Training

The L40S NVL, based on the Ada Lovelace architecture, is suitable for training large language models up to 70B parameters in a single-node setup due to its high VRAM capacity. For 400B+ models, multi-node configurations are recommended.

LLM Inference

Highly efficient for inference tasks with excellent token-per-second throughput, thanks to 4th-gen Tensor cores and ample VRAM for KV cache management.

Vision Training

Optimized for vision training tasks with significant improvements in throughput and efficiency due to Ada Lovelace architecture and enhanced Tensor cores.

Diffusion Models

Well-suited for diffusion models, leveraging high VRAM and Tensor core capabilities to accelerate training and inference processes.

Multimodal AI

Capable of handling multimodal AI tasks efficiently, benefiting from the architecture's support for diverse data types and operations.

Reinforcement Learning

Effective for reinforcement learning workloads, offering fast computation and high throughput for complex simulations and model updates.

HPC / Simulation

Limited FP64 support; not ideal for HPC simulations requiring high double-precision performance, but can handle mixed-precision tasks efficiently.

Scientific Computing

Suitable for scientific computing tasks that can leverage mixed-precision calculations, but not optimal for those requiring extensive FP64 precision.

Edge Inference

Not ideal for edge inference due to higher power consumption and larger form factor, better suited for data center environments.

Real-Time Serving

Excellent for real-time AI serving, providing low latency and high throughput with advanced Tensor cores and Ada Lovelace architecture.

Fine-Tuning

Highly efficient for full fine-tuning tasks, leveraging high VRAM and advanced architecture to handle large model updates.

LoRA Efficiency

Efficient for LoRA fine-tuning, benefiting from lower VRAM requirements and optimized Tensor core performance.

Market Authority

Key Strengths

The L40S NVL excels in AI and machine learning workloads, particularly in training large neural networks and performing complex inference tasks. Its architecture provides significant performance improvements in FP16 and INT8 operations, making it ideal for deep learning applications. The GPU's high memory bandwidth and capacity also support data-intensive tasks, setting it apart from alternatives.

Limitations

While the L40S NVL offers exceptional performance, it comes with a high power consumption and cost, which may not be suitable for all budgets. Availability can be constrained due to high demand in AI and HPC sectors. Users should also consider the need for advanced cooling solutions to manage its thermal output effectively.

Also in the Lineup

GeForce RTX 4090 Founders Edition

NVIDIA

GeForce RTX 5080 Founders Edition

NVIDIA

GeForce RTX 5090 RTX 5090

Expert Insight

The L40S represents a strategic leap in AI compute. When comparing cloud providers, consider not just the hourly rate, but also the interconnect bandwidth (InfiniBand/NVLink) and regional availability which can significantly impact total cost of ownership for large-scale training.

Glossary Terms

FP32 TFLOPS

VRAM

TDP

Cores

Information updated daily. Cloud pricing subject to vendor availability.