AWS introduces serverless options for big data analytics in the cloud.
Flexibility is the new buzzword when it comes to analytics at scale. As companies identify new ways to integrate analytics into their businesses and technical decision-making, data science and engineering teams want the flexibility to introduce end-to-end data pipelines when needed, without worrying about coordinating shared infrastructure. Many AI and ML-based analytics use highly variable data, and it can be easier for analytics teams to troubleshoot and try new tasks without the operational overhead involved with shared systems.
To meet this demand, Amazon Web Services (AWS) announced in 2021 that it would be introducing serverless options for some of its top services: Amazon EMR, Amazon MSK, and Amazon Redshift. In mid-July 2022, the company finally announced that the serverless products are generally available to the public. The key to these new offerings is their flexibility and independence from shared infrastructure, making it easier and more cost-effective for engineers to modernize their analytics pipelines without worrying about capacity planning.
“With these new serverless options, customers can run even the most variable and intermittent analytics workloads and expand the use of analytics throughout their organizations without worrying about provisioning or scaling capacity—or incurring excess cost,” said Swami Sivasubramanian, vice president of Database, Analytics, and Machine Learning at AWS.
The new serverless solutions will be added to the existing serverless business intelligence tool, Amazon QuickSight, and the serverless data integration service, AWS Glue.
What Is Serverless Analytics?
Serverless analytics relies on the sharing economy, just like Uber or Airbnb. But instead of paying for a single car ride or a short lodging stay, a company pays for an individual analytics job execution.
Others have described serverless computing as being a bit like a car rental service. With a car rental service, you rent a vehicle to get to your destination, whether you need to drive for 10 minutes or 10 hours. Although you will drive the vehicle, you don’t need to pay for the car to be built or for its maintenance. You simply pay for the gas and the time you use the vehicle.
Serverless analytics works in much the same way. You determine what analytics job you want to run and pay for each workload that you ultimately execute. Some serverless solutions don’t even let clients define the resource they want to use for a workload. The biggest benefit with this approach is streamlining the analytics process, so that you no longer need to manage capacity. Instead, serverless services ensure that all jobs are executed in a highly available manner.
This differs from traditional analytics workflows, where data scientists and engineers need to manage and execute an array of workloads within on-premises resources, ensuring that capacity is not exceeded. With the recent shift to cloud-based storage and analytics, many companies are hoping to focus their time and effort on running increasingly complex analytics and interpreting insights from their data, as opposed to managing IT infrastructure. In many respects, serverless analytics has been touted as one of the key strategies for companies to realize the full potential of the cloud.
Amazon originally offered serverless computing to its customers in 2014 with the introduction of Amazon Lambda. With Lambda, customers no longer needed to provision or manage their compute resources, making it easier to scale and manage highly variable workloads. In recent years, serverless solutions have also emerged from most major cloud companies, including IBM Cloud Functions, Azure Functions, and Google Cloud Functions. Now, AWS is expanding its serverless options to some of its most popular solutions to make it even easier for companies to develop serverless analytics pipelines.
A Suite of Serverless Options from AWS
Building on the success of Amazon Lambda, AWS seems determined to make serverless analytics as accessible as possible. Now, it is introducing serverless options for big data analytics services.
First, the serverless Amazon EMR will allow engineers to run analytics using Apache Spark and Hive or other open-source big data frameworks without needing to modify or manage any underlying infrastructure. With this solution, engineers can define the framework they want to run. Then, the serverless platform will automatically provision, manage and scale whatever computing and memory resources are needed. Analytics jobs can be submitted through either the Amazon EMR API, the AWS command line interface, or an integrated development environment with Amazon EMR studio.
Next, a new serverless offering is Amazon MSK. It can be used for real-time data ingestion and streamlining for IoT devices or anywhere where data is continuously generated. The serverless version of the solution will also automatically provision, manage and scale clusters to support unpredictable real-time workloads without requiring any kind of capacity management. To access the serverless solution, analytics teams can create a cluster in the Amazon MSK console and integrate new or current Apache Kafka clients to their data sources.
Finally, Amazon Redshift serverless will help engineers analyze petabytes of data without worrying about managing clusters. Anyone who is already using Redshift clusters can also move them to the serverless option using the console or API without changing any of their existing applications.
Together, the three services will expand serverless options within AWS and solidify the company’s reputation when it comes to serverless analytics
Why Do We Need Serverless Analytics?
In its press release, AWS highlighted that many engineers are already satisfied with the existing EMR, MSK, and Redshift platforms thanks to their ability to be finely tuned to meet specific needs. However, many data scientists and engineers are now working with highly variable workloads that require a lot of manual input to execute. AWS seems to be addressing these clients when it comes to its new serverless solutions, offering to manage the underlying infrastructure so that companies can focus on simply running and interpreting diverse workloads—no matter their scale.
One of the big draws to the solutions will be the pay-per-use system, which will help companies of all sizes scale their analytics and manage volatility without modifying their existing infrastructure. For anyone working with rapidly changing workloads or unpredictable datasets, the real selling point seems to be this flexibility, in addition to removing the domain expertise of managing compute resources.
AWS highlighted several companies that are already using its serverless solutions to streamline and simplify their data workloads. For example, David Ortiz, the senior manager of engineering at Amobee, an advertising company, described how the company successfully utilizes the flexibility of Amazon EMR to scale its resources up or down based on workloads. But some of the company’s infrequent and heavy workloads disrupted its existing clusters. Instead of needing to create and manage new clusters for these jobs, the serverless solution lets Amobee use the CPU and memory resources it requires for these heavy workloads without modifying its infrastructure.
Glas Data, in contrast, is a data management company that services the agricultural industry and is currently using the serverless Amazon MSK platform. The company describes its real-time data generation as highly variable, with analyses and alerts requiring a huge range of computing capacity.
“This workload variability makes it difficult to predict which action will be taken at what time, causing us to monitor and adjust capacity constantly to avoid unexpected capacity constraints. Amazon MSK Serverless automatically scales capacity up and down based on workload requirements, removing the system administration overhead and freeing us up to develop our solution without worrying about memory and storage constraints or incurring excess costs,” said Robert Sanders, CTO and founder at Glas Data.
Finally, Huron, a consulting firm, uses Amazon Redshift Serverless to reduce data engineering latency and help the company get through its backlog of workloads. Harry Gollakota, a data engineer at Huron, highlighted that this lets his team spend more time gaining insights from their data, as opposed to just managing and running jobs.
Through these and other client testimonies, it becomes clear that the analytics industry is really investing in these solutions for their flexibility and independence from infrastructure. With highly variable, real-time data, a serverless solution can let companies focus on insights instead of managing capacity or dealing with operational overhead. It will be interesting to see how Google responds to this new suite of AWS serverless services and what the competition may look like for the serverless market within a few years.