AWS Promises AI Training at Half the Cost

AWS says its new Trn1 instances will reduce the cost of AI and ML implementation.

Amazon’s Seattle headquarters. (Source: Amazon.)

Amazon’s Seattle headquarters. (Source: Amazon.)

Earlier this week, Amazon Web Services (AWS) announced the availability of its Amazon Elastic Compute Cloud (EC2) Trn1 instances. AWS claims that training AI and machine learning algorithms on Trn1 instances will cost half the price compared to other GPU-based instances, allowing engineers to cut training time, iterate models faster and improve accuracy when designing applications like natural language processing, speech and image recognition, semantic search, recommendation engines, forecasting and fraud.

EC2 Trn1 instances are powered by AWS-designed Trainium chips, which Amazon says are purpose-built for high-performance machine learning training.

Trn1 instances feature:

  • Up to 16 AWS Trainium accelerators
  • Up to 800 Gbps of networking bandwidth
  • NeuronLink, a high-speed, intra-instance interconnect for faster training
  • Up to 8 TB of local NVMe SSD storage
  • Supported data types: FP32, TF32, BF16, FP16, and configurable FP8

Trn1 instances are available in U.S. East (North Virginia) and West (Oregon) with other regions coming soon. They will be available for purchase as On-Demand Instances, with Savings Plans, as Reserved Instances, or as Spot Instances, as well as via other AWS services including Amazon SageMaker, Amazon Elastic Kubernetes Services (Amazon EKS), Amazon Elastic Container Service (Amazon ECS) and AWS Batch.

As engineers are tasked to come up with more complex and smarter products, there is an increased need to train AI and machine learning algorithms. And as these algorithms themselves become more complex it becomes more expensive and time consuming to train. Additionally, with AWS Neuron SDK customers can get started with little changes to their initial code.

“Over the years we have seen machine learning go from a niche technology used by the largest enterprises to a core part of many of our customers’ businesses, and we expect machine learning training will rapidly make up a large portion of their compute needs,” said David Brown, vice president of Amazon EC2 at AWS, in a company press release. “AWS Trainium is our second-generation machine learning chip purpose built for high-performance training. Trn1 instances powered by AWS Trainium will help our customers reduce their training time from months to days, while being more cost efficient.”

The new instance has already made a difference in the development of tools that would turn an engineer’s head. Eric Steinberger, co-founder and CEO at Magic, a product and research company that develops AI for the workplace, had nothing but praise for the new offering.

“Training large autoregressive transformer-based models is an essential component of our work,” said Steinberger in the news release. “AWS Trainium-powered Trn1 instances are designed specifically for these workloads, offering near-infinite scalability, fast inter-node networking, and advanced support for 16-bit and 8-bit data types. Trn1 instances will help us train large models faster, at a lower cost.”

Written by

Shawn Wasserman

For over 10 years, Shawn Wasserman has informed, inspired and engaged the engineering community through online content. As a senior writer at WTWH media, he produces branded content to help engineers streamline their operations via new tools, technologies and software. While a senior editor at Engineering.com, Shawn wrote stories about CAE, simulation, PLM, CAD, IoT, AI and more. During his time as the blog manager at Ansys, Shawn produced content featuring stories, tips, tricks and interesting use cases for CAE technologies. Shawn holds a master’s degree in Bioengineering from the University of Guelph and an undergraduate degree in Chemical Engineering from the University of Waterloo.