5 minute read

This is useful for anyone who’s working with or recently setup AWS RDS. I’ll only consider working with the Console instead of the CLI. This works for most of the big name relational OLTP databases:

  • MariaDB
  • MySQL
  • PostgreSQL

Note:

Before I dive in, AWS works very hard to make sure every service is secure. Part of that security is making sure the right users and/or groups of users are not interfering with products or services which are not applicable to them. In order to do that, Identity and Access Management (IAM) is used to manage individuals and/or groups of individuals access to products and services. To that end, if a policy is not enabled for a certain user or group then by default, they are denied that ability to interact with those products and services. You may need to enable certain policies before progressing further.

Use Case

Let’s say you have one of the aforementioned databases and it’s hosted on AWS RDS. You want to query that data directly without having to leave AWS, or you want to query that data for analytics (think OLAP or ETL/ELT), or share that data with some other service or partner. Enter RDS Snapshots.

“Amazon RDS creates a storage volume snapshot of your DB instance, backing up the entire DB instance and not just individual databases. Creating this DB snapshot on a Single-AZ DB instance results in a brief I/O suspension that can last from a few seconds to a few minutes, depending on the size and class of your DB instance. For MariaDB, MySQL, Oracle, and PostgreSQL, I/O activity is not suspended on your primary during backup for Multi-AZ deployments, because the backup is taken from the standby. For SQL Server, I/O activity is suspended briefly during backup for Multi-AZ deployments.” – You can learn more about snapshots here.

The nice thing about snapshots is they let you take a picture of your RDS data at a single point in time. You can setup recurring snapshots to run weekly or daily, whatever suits your purposes, and you can pick at what times you want a snapshot to run. Perfect for ETL or analytics. However, these snapshots themselves are useless since they’re not accessible for querying. Side note, if you’ve setup Aurora you can directly query your data – you don’t need to do snapshots.

Athena

A quick overview of Athena. It makes analyzing data directly from S3 using standard SQL incredibly simple. It comes with certain benefits that most serverless products have, such as, no need to worry about scaling, setting up infrastructure or management thereof. It’s secure, leveraging AWS IAM for access, and highly available. It’s easy to query, leveraging Presto for speed and data analytics. When combined with AWS Glue you can enrich datasets using other data stores or federated queries. Of course, no data science discussion would be complete without talking about Machine Learning. Athena makes it easy to connect to SageMaker and invoke your models right there.

Here are the main steps:

  1. Generate a snapshot
  2. Export snapshot to S3
  3. Access your data with Athena

Prerequisite:

Before we begin this will make your life easier:

  • Make sure you have all the appropriate privileges to interact with RDS and S3
  • Select the appropriate Availability Zone (AZ)
  • Create a S3 bucket for your snapshots (step 2)
  • Create IAM role (step 2)
  • Create Encryption (step 2)
  • Create a S3 bucket for your queries (step 3)

Generate a snapshot

Generating a snapshot is easy. Type RDS into the search bar, select Databases on the left-hand menu, select the database you want to take a snapshot of, then select the Actions dropdown, click Take snapshot. Give the snapshot a name – I like to do the {database}-{date}. This will trigger a manual snapshot. Wait for it finish before progressing.

Export snapshot to S3

While you’re in the Amazon RDS module, select Snapshots. From there you should see the manual snapshot you just named. Select the snapshot, select Actions, and click Export to Amazon S3.

You now need to fill out configurations for the export:

  • Specify a name for the export. I like to go with {snapshot-name}-export.
  • Select All or Partial
    • If you select Partial you will need to know how the tables in your databases are constructed (i.e., schema.table)
  • Good practice to add a prefix (i.e., “demo/test/snapshots”). This is location within the bucket
  • Select the appropriate IAM role
    • If you don’t have one, you can create one which will generate the appropriate policy specifications
  • Select Encryption
    • You probably already have a RDS key when you created your database but if you don’t you can generate one using Key Management Service (KMS) on AWS.

Then select export to Amazon S3. This process can take some time.

Access your data with Athena

In order to leverage Athena, we need to use AWS Glue. We’re going to use a Crawler to extract data from S3 and put it into Athena for us to query.

Steps:

  • Within the AWS Console, head over to search bar, type in Crawler
  • Click Add crawler
  • Type in a crawler name
  • Select Data store and crawl all folders
    • Or specify certain folders
  • Select S3 for the data store
  • Find the path were the exported S3 Snapshot is and select that
    • Add another data store or select “No” and hit next
  • Update, choose, or create an IAM role
    • Note: This IAM role will need to have access to the RDS key you setup for your database to decrypt the data in S3. You can add it to the list of users by going to AWS KMS, selecting add, and finding the new or existing IAM role.
  • Select the frequency in which the crawler should run
  • Specify a name for the database that will be created due to the crawler
  • Click Finish

After this a functional database is now in your possetion to query using Athena, we just need to select it. Before you begin, you’ll need to enable a place on S3 to save your queries. Create a new bucket and point your queries to that bucket.

Now you are all set. Head over to Athena, select query editor, and select the Database you just created.

Comments