05

Feb

2015

Amazon S3 - Amazon's 'Sorta' Simple Storage Solution

By: February 5, 2015

Ever since Amazon released S3 (Simple Storage Solution) in 2006, it has revolutionized the way companies store and analyze data. However, there are a few ways a developer can get tripped up using this otherwise simple service.

One major perk of using S3 is that data can be seamlessly transferred to and from HDFS, which when combined with Amazon’s Elastic Map Reduce (EMR), makes Big Data analytics extremely accessible. However, since S3 is not the typical POSIX filesystem that many developers have come to expect, it can lead to some unexpected behavior.

Below are some of the issues -- along with some solutions -- that we've come across here at AWeber.

Timestamps In S3 Object Names

Like all good engineers, I try to adhere to the ISO-8601 date format whenever possible. It provides easy parse-ability, it sorts well and provides consistent date/time information. (In case you need a reminder, the ISO-8601 format looks like this: '2015-01-16T02:46:38+00:00'). With this format, it's fairly common practice to name log files with a format like 'server_access_logs_2015-01-16T02:00:00.gz' or something similar to provide a unique name.

While this works fine when you initially store in S3, it becomes a problem when you attempt to analyze the data in those log files using EMR. The problem stems from an unresolved issue in Apache Hadoop that deals with ':' characters in the file name. Here’s an example of what this might look like:

Suppose you have an S3 bucket called:

In this bucket there are 3 objects (files):

If you wanted to parse these logs and aggregate metrics using Apache Pig for all logs on 2015-01-16, you would normally write something like:

Due to the colons in the filenames, however, this will fail with the following exception:

Now suppose you are a clever engineer and decide to rename the objects you are analyzing and replace the colon character with a dash. Since we're only interested in logs from 1/16, we'll rename those and leave the logs from 1/17 with colon characters in tact:

Again, we try a Pig load statement to load all of the log data from the 16th:

Surprisingly, this will again fail with the same error. This time the error will reference log_2015-01-17T01:00:00.gz as being an invalid file. This happens even though log_2015-01-17T01:00:00.gz is not a pattern match in the load statement. When you use the '*' wildcard, Pig will scan all of the objects in the bucket and attempt to validate them.

If you are actively streaming data to S3, it may not be possible to rename all of the objects in a bucket before you do analysis. The objects must either be referenced without using the '*' wildcard or moved to another bucket.

One workaround is to use the ‘?’ wildcard instead, which represents a single character:

Periods, S3 Buckets and SSL Hostname Verification

As a software engineer, it's common to name things like 'com.aweber.widget'. The period separator between words makes compound words easier to read, and prevents us from using characters that might be invalid, like <space> or '/'. So naturally when naming S3 buckets, it's common practice to name them in the same way.

Bucket names like 'com.aweber.bucket' are technically valid, but they may pose some particularly annoying issues when accessing them over HTTPS.

The problem lies with the standard way in which S3 buckets are referenced. Generally, they are referred to like this: https://<bucket_name>.s3.amazonaws.com. When using an HTTPS connection, the url must match Amazon's certificate: *.s3.amazonaws.com. Because of the way SSL hostname verification works, com-aweber-bucket.s3.amazonaws.com will match the wildcard, but com.aweber.bucket.s3.amazonaws.com will not.

Some workarounds to this issue include turning off SSL hostname verification altogether, which can leave you exposed. Amazon does provide an alternative (albeit less aesthetically pleasing) method of referencing S3 buckets. This style of referencing an S3 bucket is called Path Style which looks like this: https://s3.amazonaws.com/com.aweber.bucket.

When using Path Style references, you must also reference the AWS region where your bucket resides in order to connect. If you're getting a '301 Moved Permanently' response, the missing region is probably the issue. In Python, using the boto library, you can specify a region and Path Style like this:

Or, if you prefer the easier and more aesthetically pleasing route, use dashes instead.

Configuring Flume for S3 Server Side Encryption

Apache Flume is a distributed service for collecting, aggregating and moving large amounts of log data. Oftentimes, the end-point in a flume implementation is an S3 bucket. Like any good company that wants to protect its data, you'll probably want to encrypt it.

Fortunately, Amazon recently added server side encryption for data at rest in S3. With server side encryption, Amazon encrypts your data before writing to disk on S3 and decrypts it when on download. What isn’t entirely clear, however, was how to put the two together.

After some deep digging, we came up with this configuration snippet that will allow you to stream data to S3 using Apache Flume and server-side encryption. Just add this property to your core-site.xml file and all of the data streamed through Flume will be encrypted:

The Storage Solution Your Data Deserves

As a data scientist, I’m a huge fan of Amazon S3 and many of the related Amazon Web Services. While there are some frustrations that exist in implementing server side encryption, as well as naming S3 objects with periods and colons, the benefits far outweigh them.

Here at AWeber, S3 has transformed the way we store and manage data. It has allowed us to retain more data in an accessible format, which in turn allows us to analyze and derive insights that were previously inaccessible.

Ultimately, this allows us more time to spend providing remarkable experiences to our customers. And at the end of the day, that’s what matters most.

Ben English
@bdenglish

Ben is a Data Scientist at AWeber and a self diagnosed data addict