Athena

Next let's dive into Amazon Athena.

It's a really cool query engine for doing interactive queries on your data in an S3 data lake and

it's completely serverless which is really super interesting.

So what is Athena its official definition is an interactive query service for S3.

So basically it's a SQL interface to your data being stored in some S3 data lake.

There's no need for you to load your data from S3 into Athena.

The data actually stays in S3 and Athena just knows how to interpret that data and query it interactively

under the hood it's using Presto.

You might remember Presto from our Elastic MapReduce lectures and it's just a very highly customized

and pre configured Presto instance for you with a nice little user interface on top of it.

But the great thing about Athena is that it's completely serverless so you do not need to manage any servers.

You don't need to provision servers you have to even think about how it works.

You just think about using it really.

Athena supports a wide variety of data formats that might reside in your S3 Buckets.

Those include and these might be important to remember CSV,

JSON, ORC, Parquet, and Avro.

And this is as good of a time as any to get into how these different formats differ from each other

because depending on your application a different format might make sense right.

So the good thing about CSV which is comma separated value and its cousin TSV tab separated value lists

is that they're human readable.

You can look at them in a text editor and make sense out of the data that's in there very easily.

It's just one line per row of data.

Every line contains a bunch of fields separated by commas.

Easy peasy right.

Same thing with JSON.

JSON has a bit more structure to it and that you can have more hierarchical data inside of it but it's

still human readable so it's still one row per document and each document is something you can look

at and understand.

But then if you want to really do things that scale and do things more efficiently you should be looking

at maybe non-human readable formats such as ORC and Parquet. ORC and Parquet are both examples of columnar

or formats that are also splittable.

So instead of organizing data by rows it's organizing it by each column.

So if you have an application that's doing queries on your data based on specific columns like let's

say you have a use case where you're always looking up records by there I don't know user I.D. or something

that column or storage makes it very efficient to go and retrieve data for specific columns value the

other advanced to ORC and Parquet is that they are splittable.

So even though they're organized and compressible in very cool ways they're still splittable.

So in a big data setting those files can still be split into chunks that can be distributed across an

entire cluster.

So you can still have a massive ORC or a massive Parquet file and still have the ability for your cluster

to split that data and view it across different instances.

Avro is also an example of a splittable file format but it is not columnar so it is not human

readable.

However it is for a more row based storage where you might be looking at an entire rows worth of data at

a time.

Generally speaking but it is still splittable so it's important remember that these are all splittable

formats because that can be important on the exam.

All right let's put that little diversion aside and get back to Athena.

So another important point is that Athena doesn't really care if your data in S3 is structured or semi

structured or structured it can work with glue and the glue data catalog to impart structure on that

data and make it something that you can query from a SQL command.

Some examples of usage here given by Amazon include the ad hoc querying of web logs so Athena is given

as a better example of how to deal with querying Web log data in S3 and they would rather have you use

this than Elasticsearch.

For example you can also use it for querying staging data before loading it into redshift.

So maybe you have a bunch of data being dumped into S3 and you want to transform it and load it into

a big redshift data warehouse.

But maybe you want to build a play around that data and investigate it and see what it looks like beforehand.

Athena might be a good way of sort of getting a bigger picture of what that data looks like before you

actually commit it into a data warehouse.

It's also useful looking at other logs besides weblogs in S3 including CloudTrail logs, CloudFront

logs, VPC logs, elastic load balancer logs whatever you have if it's an S3 Athena can query it.

It also offers integration with tools like Jupiter and Zeppelin and RStudio notebooks because you

can just treat it like a database.

It has ODBC and JDBC interfaces so you can treat Athena like any other relational database that you might

interact with or integrate with.

That also includes QuickSight.

You can also integrate Amazon's QuickSight visualization tool into Athena to actually use that as sort

of the glue well glue is a poor choice of word because we are actually using the Glue service for this

but the thing that connects your unstructured data in S3 to a more structured visualization or analysis

tool such as QuickSight.

So speaking of glue let's talk about how Athena integrates with AWS glue to actually impart structure

on your unstructured data in S3 so that it can actually query it like a database.

So again let's see a little brief recap on how glue works.

You might have a glue crawler populating the glue data catalogue for your S3 data that's looking at

what's stored in S3 and tried to extract columns and table definitions out of it for you and you can

use the glue console to refine that definition as needed.

So once you have a glue data catalogue published for your S3 data Athena will see it automatically and

it can build a table from it automatically as well.

So any time Athena see something in your glue data catalogue in your account it's going to make a table

for that for you so you can query it just like you would any other SQL database.

And it's not just Athena that can use that glue data catalogue either it will allow any other analytics

tool to visualize or analyze that data as well.

For example RDS, redshift, redshift spectrum, EMR, any application that's compatible with an Apache hive

metastore as well because remember the glue data catalog can be used as a hive metastore too.

But in this example we're using it to expose that table definition to Amazon Athena and with Athena

integrated with AWS's glue data catalog that allows you to create a unified metadata repository

across various services, crawl data to discover schemas, populate your catalog with new and modified table

and partition definitions, and maintain schema versioning all under the hood.

And Athena just sits on top of that and provides a SQL interface to that underlying glue structure

the cost model is very simple because it is a serverless application.

You just pay as you go for the activity that you actually consume it charges you five dollars currently

per terabyte scan which is pretty generous.

Important to remember that successful or cancelled queries do count toward that scanned number there

you do get charged for those but any failed queries are not charged.

So any successful or cancelled queries you will be billed for failed queries however are free.

There is also no charge for any DDL operations such as create alter or drop.

Those are free as well.

Now a very important point is that if you want to save money using Athena it can save you a lot of money

by converting your data to a columnar format like we talked about with ORC and Parquet.

So not only does that give you better performance for applications that are typically querying on a

small number of columns it can also save you 30 to 90 percent in terms of money as well.

That's because it allows Athena to selectively read only the required columns when doing a query and

processing your data.

So if you're only accessing certain columns from your queries by having a columnar format you're reducing

the amount of data that you actually have to scan with Athena.

And remember you are charged by terabytes scan.

So by reducing that scanning you win.

So remember Athena works best with columnar format.

It can save you a lot of money and give you a better performance as well.

Examples of column formats include ORC and Parquet.

All right.

And in addition to Athena of course Glue and S3 have their own charges as well.

Athena just sits on top of Glue to get the table definition for your S3 data and your data is still being

stored in S3.

So there are separate charges for Glue in S3 in addition to using Athena as well.

Also I should point out that partitioning your data can also help Athena to reduce costs as well.

So if you do have your data structured in S3 in different partitions such as by date or hour or something

queries that are restricted to that given partition will also scan less data as well so in addition

to using a columnar format partitioning your data within S3 can also save you money with Athena

as well.

Let's talk about security too that's always important with Athena it offers many different ways of securing

your Athena traffic.

One is through access control who can actually get to Athena and query your information in the first

place and you can use IAM access control lists and S3 bucket policies to restrict access to that information.

The IAM policies that are relevant here are the Amazon Athena full access and AWS QuickSight Athena

access policies.

You can also encrypt your results if you are sensitive about the results of your queries and you can

encrypt them at rest in a staging directory in S3 and that can be encrypted in several different

ways.

You can.

Again these are important remember guys so you can encrypt your S3 data results in server side encryption

using an S3 managed key that's called SSE S3.

You can also do server side encryption of your results using a KMS Key that's called SSE KMS or

you can encrypt them on the client side with a KMS Key as well.

That's called CSE KMS and depending on your security requirements one or more of these may make

sense.

So think about how that data is flowing where it's being stored and how you want it to be secured.

You can also have cross account access defined in S3 bucket policy.

So it is possible for you to access Athena to a S3 bucket that is not owned by your account.

If that other S3 bucket has a policy that grants access to your account it's possible for an Athena

console in one account to access a data lake stored in another account.

You can set that up. As far as in transit security goes

you can use TLS the transport layer security for all the traffic going between Athena and S3

as well that can be set up too for in transit security with Athena.

Always important remember security aspects guys.

That's a huge part of the exam.

Finally let's talk about anti patterns again these come straight out of the AWS big data White

Paper of things they don't want you to use Athena for one is for highly formatted reports or for

visualization at the end of the day Athena is just a SQL query engine if you want to do nicely formatted

stuff for visualizing things with charts and graphs.

Well that's what QuickSight is for and we'll be getting to that shortly.

Also if you want to be doing ETL extract transform and load operations using Athena that's generally

not the best tool to use glue is an alternative there and using glue ETL instead you can also do that

with Apache Spark or what have you for a larger scale tests.

So that's Athena in a nutshell.

Let's go and play with it shall we.

Athena

Report Page