This is useful for anyone who’s working with or recently setup AWS RDS. I’ll only consider working with the Console instead of the CLI. This works for most of the big name relational OLTP databases:
Before I dive in, AWS works very hard to make sure every service is secure. Part of that security is making sure the right users and/or groups of users are not interfering with products or services which are not applicable to them. In order to do that, Identity and Access Management (IAM) is used to manage individuals and/or groups of individuals access to products and services. To that end, if a policy is not enabled for a certain user or group then by default, they are denied that ability to interact with those products and services. You may need to enable certain policies before progressing further.
Let’s say you have one of the aforementioned databases and it’s hosted on AWS RDS. You want to query that data directly without having to leave AWS, or you want to query that data for analytics (think OLAP or ETL/ELT), or share that data with some other service or partner. Enter RDS Snapshots.
“Amazon RDS creates a storage volume snapshot of your DB instance, backing up the entire DB instance and not just individual databases. Creating this DB snapshot on a Single-AZ DB instance results in a brief I/O suspension that can last from a few seconds to a few minutes, depending on the size and class of your DB instance. For MariaDB, MySQL, Oracle, and PostgreSQL, I/O activity is not suspended on your primary during backup for Multi-AZ deployments, because the backup is taken from the standby. For SQL Server, I/O activity is suspended briefly during backup for Multi-AZ deployments.” – You can learn more about snapshots here.
The nice thing about snapshots is they let you take a picture of your RDS data at a single point in time. You can setup recurring snapshots to run weekly or daily, whatever suits your purposes, and you can pick at what times you want a snapshot to run. Perfect for ETL or analytics. However, these snapshots themselves are useless since they’re not accessible for querying. Side note, if you’ve setup Aurora you can directly query your data – you don’t need to do snapshots.
A quick overview of Athena. It makes analyzing data directly from S3 using standard SQL incredibly simple. It comes with certain benefits that most serverless products have, such as, no need to worry about scaling, setting up infrastructure or management thereof. It’s secure, leveraging AWS IAM for access, and highly available. It’s easy to query, leveraging Presto for speed and data analytics. When combined with AWS Glue you can enrich datasets using other data stores or federated queries. Of course, no data science discussion would be complete without talking about Machine Learning. Athena makes it easy to connect to SageMaker and invoke your models right there.
Before we begin this will make your life easier:
Generating a snapshot is easy. Type RDS into the search bar, select Databases on the left-hand menu, select the database you want to take a snapshot of, then select the Actions dropdown, click Take snapshot. Give the snapshot a name – I like to do the {database}-{date}. This will trigger a manual snapshot. Wait for it finish before progressing.
While you’re in the Amazon RDS module, select Snapshots. From there you should see the manual snapshot you just named. Select the snapshot, select Actions, and click Export to Amazon S3.
You now need to fill out configurations for the export:
Then select export to Amazon S3. This process can take some time.
In order to leverage Athena, we need to use AWS Glue. We’re going to use a Crawler to extract data from S3 and put it into Athena for us to query.
Steps:
After this a functional database is now in your possetion to query using Athena, we just need to select it. Before you begin, you’ll need to enable a place on S3 to save your queries. Create a new bucket and point your queries to that bucket.
Now you are all set. Head over to Athena, select query editor, and select the Database you just created.