Amazon FSx for Lustre increases throughput to GPU instances by up to 12x

Published

Today, we are announcing support for Elastic Fabric Adapter (EFA) and NVIDIA GPUDirect Storage (GDS) on Amazon FSx for Lustre. EFA is a network interface for Amazon EC2 instances that makes it possible to run applications requiring high levels of inter-node communications at scale. GDS is a technology that creates a direct data path between local or remote storage and GPU memory. With these enhancements, Amazon FSx for Lustre with EFA/GDS support provides up to 12 times higher (up to 1200 Gbps) per-client throughput compared to the previous FSx for Lustre version.

You can use FSx for Lustre to build and run the most performance demanding applications, such as deep learning training, drug discovery, financial modeling, and autonomous vehicle development. As datasets grow and new technologies emerge, you can adopt increasingly powerful GPU and HPC instances such as Amazon EC2 P5, Trn1, and Hpc7a. Until now, when accessing FSx for Lustre file systems, the use of traditional TCP networking limited throughput to 100 Gbps for individual client instances. This adoption is driving the need for FSx for Lustre file systems to provide the performance necessary to optimally utilize the increasing network bandwidth of these cutting-edge EC2 instances when accessing large datasets.

With EFA and GDS support in FSx for Lustre, you can now achieve up to 1,500 Gbps throughput per client instance (fifteen times more throughput than previously) when using P5 GPU instances and NVIDIA CUDA in your applications.

With this new capability, you can fully utilize the network bandwidth of the most powerful compute instances and accelerate your machine learning (ML) and HPC workloads. EFA enhances performance by bypassing the operating system and using the AWS Scalable Reliable Datagram (SRD) protocol to optimize data transfer. GDS further improves performance by enabling direct data transfer between the file system and GPU memory, bypassing the CPU and eliminating redundant memory copies.

Let’s see how this works in practice.

Creating an Amazon FSx for Lustre file system with EFA enabled
To get started, in the Amazon FSx console, I choose Create file system and then Amazon FSx for Lustre.

I enter a name for the file system. In the Deployment and storage type section, I select Persistent, SSD and the new with EFA enabled option. I select 1000 MB/s/TiB in the Throughput per unit of storage section. With these settings, I enter 4.8 TiB for Storage capacity, which is the minimum supported with these settings.

Console screenshot.

For networking, I use the default virtual private cloud (VPC) and an EFA-enabled security group. I leave all other options to their default values.

Console screenshot.

I review all the options and proceed to create the file system. After a few minutes, the file system is ready to be used.

Mounting an Amazon FSx for Lustre file system with EFA enabled from an Amazon EC2 instance
In the Amazon EC2 console, I choose Launch instance, enter a name for the instance, and select the Ubuntu Amazon Machine Image (AMI). For Instance type, I select trn1.32xlarge.

Console screenshot.

In Network settings, I edit the default settings and select the same subnet used by the FSx Lustre file system. In Firewall (security groups), I select three existing security groups: the EFA-enabled security group used by the FSx for Lustre file system, the default security group, and a security group that provides Secure Shell (SSH) access.

Console screenshot.

In Advanced network configuration, I select ENA and EFA as Interface type. Without this setting, the instance would use traditional TCP networking and the connection with the FSx for Lustre file system would still be limited to 100 Gbps in throughput.

Console screenshot.

To have more throughput, I can add more EFA network interfaces, depending on the instance type.

I launch the instance and, when the instance is ready, I connect using EC2 Instance Connect and follow the instructions for installing the Lustre client in the FSx for Lustre User Guide and configuring EFA clients.

Then, I follow the instructions for mounting an FSx for Lustre file system from an EC2 instance.

I create a folder to use as mount point:

sudo mkdir -p /fsx

I select the file system in the FSx console and lookup the DNS name and Mount name. Using these values, I mount the file system:

sudo mount -t lustre -o relatime,flock file_system_dns_name@tcp:/mountname /fsx

EFA is automatically used when you access an EFA-enabled file system from client instances that support EFA and are using Lustre version 2.15 or higher.

Things to know
EFA and GDS support is available today with no additional cost on new Amazon FSx for Lustre file systems in all AWS Regions where persistent 2 is offered. FSx for Lustre automatically uses EFA when customers access an EFA-enabled file system from client instances that support EFA, without requiring any additional configuration. For a list of EC2 client instances that support EFA, see supported instance types in the Amazon EC2 User Guide. This network specifications table describes network bandwidths and EFA support for instance types in the accelerated computing category.

To use EFA-enabled instances with FSx for Lustre file systems, you must use Lustre 2.15 clients on Ubuntu 22.04 with kernel 6.8 or higher.

Note that your client instances and your file systems must be located in the same subnet within your Amazon Virtual Private Cloud (Amazon VPC) connection.

GDS is automatically supported on EFA-enabled file systems. To use GDS with your FSx for Lustre file systems, you need the NVIDIA Compute Unified Device Architecture (CUDA) package, the open source NVIDIA driver, and the NVIDIA GPUDirect Storage Driver installed on your client instance. These packages come preinstalled on the AWS Deep Learning AMI. You can then use your CUDA-enabled application to use GPUDirect storage for data transfer between your file system and GPUs.

When planning your deployment, note that EFA-enabled file systems have larger minimum storage capacity increments than file systems that are not EFA-enabled. For instance, if you choose the 1,000 MB/s/TiB throughput tier, the minimum storage capacity for EFA-enabled file systems starts at 4.8 TiB as compared to 1.2TB for FSx for Lustre file systems not enabling EFA. If you’re looking to migrate your existing workloads, you can use AWS DataSync to move your data from an existing file system to a new one that supports EFA and GDS.

For maximum flexibility, FSx for Lustre maintains compatibility with both EFA and non-EFA workloads. When accessing an EFA-enabled file system, traffic from non-EFA client instances automatically flows over traditional TCP/IP networking using Elastic Network Adapter (ENA), allowing seamless access for all workloads without any additional configuration.

To learn more about EFA and GDS support on FSx for Lustre, including detailed setup instructions and best practices, visit the Amazon FSx for Lustre documentation. Get started today and experience the fastest storage performance available for your GPU instances in the cloud.

Danilo

Update 11/27: post updated to reflect 12x throughput

from AWS News Blog https://aws.amazon.com/blogs/aws/amazon-fsx-for-lustre-unlocks-full-network-bandwidth-and-gpu-performance/

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.