Exploring Amazon Athena

Exploring-Amazons-Athena-1.png

A Technical Review

One of the most recent offerings added to the AWS Big Data toolbelt is Amazon Athena. As with the Lambda service, Athena is a closer shift to the pay-per-use model. You no longer need a cluster of servers, or a single EC2 instance, in order to perform ad-hoc queries on your data. Instead, Athena charges only for the data that is scanned during the query process, even forgiving charges for items such as DDL statements and failed queries.

Amazon Athena is built on Presto, an Open Source SQL Query Engine originally developed by Facebook. Unlike the full version of Presto, with Amazon Athena you can run on an EMR cluster. One of the main differences between Athena and the full version of Presto is that you are creating a read-only set of tables on data located in S3 buckets. The Athena Client is built right into the Amazon console, enabling you to get up and running with ease. If you and your team are increasing use of the Athena service for ad-hoc query analysis on data being stored, there is also a JDBC driver provided which can be added to a number of clients.

Setting up the driver is not a complex task, although it does require an Amazon Access Key and Secret Access Key as part of the parameters. You will need to have each key available during the setup process. In order to begin using the service, you will need to create one or more external tables using Hive DDL statements.

Athena supports a wide variety of data formats such as CSV, JSON, ORC, or Parquet. If the last two formats aren't familiar to you yet, then I recommend becoming more acquainted with them. The ORC and Parquet formats already are converted into columnar formats. Since Presto (on which Athena is built) is a columnar engine, loading data in native columnar formats not only gives a much better performance but also reduces cost when using the service. In order to convert native text, CSV, JSON, or XML files to ORC or Parquet format, you will first need to spin up an EMR cluster. Once the spin up is complete, render the files out to the new desired format using your desired language such as Spark.

Athena also supports Apache Web logs. Even if you’re beginning to explore Big Data and only have access to data from your web logs, you are able to quickly start pushing your logs to an S3 bucket and begin analyzing trends. If you are wondering how to quickly move those logs into S3, the complimentary service Kinesis Firehose is available for utilization.

As more of your weblogs are pushed to the S3 bucket, you will start to see trends from the data itself. You can find the number of distinct browsers, the top distinct IP addresses, and even the top times of day which your site is being accessed. As you build these queries in Athena, keep in mind that you have a way to show off your findings in a more graphical format.

An exciting feature of Athena is its ability to integrate with the visualization service of Amazon QuickSight. Once connected, it uses the advanced queries that your team built using ANSI SQL in Athena to visually show results in a number of different interactive charts. Athena is a great tool for team members who are beginning a data lake build or have data collected in S3 but don't want the complexity or cost of spinning up a full Elastic MapReduce Cluster. As wonderful as Athena is from both a pricing and ease of querying perspective, it does have a few bumps, which are easily resolved by an experienced Big Data team.

What do you think about Amazon Athena? Do you plan to use it?

Always be in the Know, Subscribe to the Relus Cloud Blog!