New — Amazon Athena for Apache Spark

When Jeff Barr first announced Amazon Athena in 2016, it changed my perspective on interacting with data. With Amazon Athena, I can interact with my data in just a few steps—starting from creating a table in Athena, loading data using connectors, and querying using the ANSI SQL standard.

Over time, various industries, such as financial services, healthcare, and retail, have needed to run more complex analyses for a variety of formats and sizes of data. To facilitate complex data analysis, organizations adopted Apache Spark. Apache Spark is a popular, open-source, distributed processing system designed to run fast analytics workloads for data of any size.

However, building the infrastructure to run Apache Spark for interactive applications is not easy. Customers need to provision, configure, and maintain the infrastructure on top of the applications. Not to mention performing optimal tuning resources to avoid slow application starts and suffering from idle costs.

Introducing Amazon Athena for Apache Spark
Today, I’m pleased to announce Amazon Athena for Apache Spark. With this feature, we can run Apache Spark workloads, use Jupyter Notebook as the interface to perform data processing on Athena, and programmatically interact with Spark applications using Athena APIs. We can start Apache Spark in under a second without having to manually provision the infrastructure.

Here’s a quick preview:

How It Works
Since Amazon Athena for Apache Spark runs serverless, this benefits customers in performing interactive data exploration to gain insights without the need to provision and maintain resources to run Apache Spark. With this feature, customers can now build Apache Spark applications using the notebook experience directly from the Athena console or programmatically using APIs.

The following figure explains how this feature works:

On the Athena console, you can now run notebooks and run Spark applications with Python using Jupyter notebooks. In this Jupyter notebook, customers can query data from various sources and perform multiple calculations and data visualizations using Spark applications without context switching.

Amazon Athena integrates with AWS Glue Data Catalog, which helps customers to work with any data source in AWS Glue Data Catalog, including data in Amazon S3. This opens possibilities for customers in building applications to analyze and visualize data to explore data to prepare data sets for machine learning pipelines.

As I demonstrated in the demo preview section, the initialization for the workgroup running the Apache Spark engine takes under a second to run resources for interactive workloads. To make this possible, Amazon Athena for Apache Spark uses Firecracker, a lightweight micro-virtual machine, which allows for instant startup time and eliminates the need to maintain warm pools of resources. This benefits customers who want to perform interactive data exploration to get insights without having to prepare resources to run Apache Spark.

Get Started with Amazon Athena for Apache Spark
Let’s see how we can use Amazon Athena for Apache Spark. In this post, I will explain step-by-step how to get started with this feature.

The first step is to create a workgroup. In the context of Athena, a workgroup helps us to separate workloads between users and applications.

To create a workgroup, from the Athena dashboard, select Create Workgroup.

On the next page, I give the name and description for this workgroup.

On the same page, I can choose Apache Spark as the engine for Athena. In addition, I also need to specify a service role with appropriate permissions to be used inside a Jupyter notebook. Then, I check Turn on example notebook, which makes it easy for me to get started with Apache Spark inside Athena. I also have the option to encrypt Jupyter notebooks managed by Athena or use the key I have configured in AWS Key Management Service (AWS KMS).

After that, I need to define an Amazon Simple Storage Service (Amazon S3) bucket to store calculation results from the Jupyter notebook. Once I’m sure of all the configurations for this workgroup, I just have to select Create workgroup.

Now, I can see the workgroup already created in Athena.

To see the details of this workgroup, I can select the link from the workgroup. Since I also checked the Turn on example notebook when creating this workgroup, I have a Jupyter notebook to help me get started. Amazon Athena also provides flexibility for me to import existing notebooks that I can upload from my laptop with Import file or create new notebooks from scratch by selecting Create notebook.

When I select the Jupyter notebook example, I can start building my Apache Spark application.

When I run a Jupyter notebook, it automatically creates a session in the workgroup. Subsequently, each time I run a calculation inside the Jupyter notebook, all results will be recorded in the session. This way, Athena provides me with full information to review each calculation by selecting Calculation ID, which took me to the Calculation details page. Here, I can review the Code and also Results for the calculation.

In the session, I can adjust the Coordinator size and Executor size, with 1 data processing unit (DPU) by default. A DPU consists of 4 vCPU and 16 GB of RAM. Changing to a larger DPU allows me to process tasks faster if I have complex calculations.

Programmatic API Access
In addition to using the Athena console, I can also use programmatic access to interact with the Spark application inside Athena. For example, I can create a workgroup with the create-work-group command, start a notebook with create-notebook, and run a notebook session with start-session.

Using programmatic access is useful when I need to execute commands such as building reports or computing data without having to open the Jupyter notebook.

With my Jupyter notebook that I’ve created before, I can start a session by running the following command with the AWS CLI:

$> aws athena start-session \
    --work-group <WORKGROUP_NAME>\
    --engine-configuration '{"CoordinatorDpuSize": 1, "MaxConcurrentDpus":20, "DefaultExecutorDpuSize": 1, "AdditionalConfigs":{"NotebookId":"<NOTEBOOK_ID>"}}'
    --notebook-version "Jupyter 1"
    --description "Starting session from CLI"

{
    "SessionId":"<SESSION_ID>",
    "State":"CREATED"
}

Then, I can run a calculation using the start-calculation-execution API.

$ aws athena start-calculation-execution \
    --session-id "<SESSION_ID>"
    --description "Demo"
    --code-block "print(5+6)"

{
    "CalculationExecutionId":"<CALCULATION_EXECUTION_ID>",
    "State":"CREATING"
}

In addition to using code inline, with the --code-block flag, I can also pass input from a Python file using the following command:

$ aws athena start-calculation-execution \
    --session-id "<SESSION_ID>"
    --description "Demo"
    --code-block file://<PYTHON FILE>

{
    "CalculationExecutionId":"<CALCULATION_EXECUTION_ID>",
    "State":"CREATING"
}

Pricing and Availability
Amazon Athena for Apache Spark is available today in the following AWS Regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland). To use this feature, you are charged based on the amount of compute usage defined by the data processing unit or DPU per hour. For more information see our pricing page here.

To get started with this feature, see Amazon Athena for Apache Spark to learn more from the documentation, understand the pricing, and follow the step-by-step walkthrough.

Happy building,

— Donnie

from AWS News Blog https://aws.amazon.com/blogs/aws/new-amazon-athena-for-apache-spark/

Leave a comment Cancel reply