On Software Engineering: Amazon S3

What is it?

Cloud storage for private and public data
Highly available, durable, large scale storage
Data can be uploaded or downloaded from AWS Console or through a REST-based APIs
Data can be made publicly or privately available through an HTTP endpoint. The endpoint can be secured with a signed Url

Usages

Data sharing on the Web
Application Backend
Content management (documents, pictures, media)
Application logs
Backups and archives
Web Content (Images, JavaScript)
Documentation - versioning and security
Bootstrapping - Store scripts that drive EC2 instances
Can be used to host static websites (Route 53 points to S3 buckets)
S3 integrates with CloudFront. Serve contents of S3 from global edge locations (cached at edge locations). Download (static content) and Streaming (serve media files from buckets to players).

Storage Model

Data is stored as object in a bucket
Buckets are like folders
An object can be any data type
Object key is unique within a bucket; slashes can be used within object keys to mimic a filing system
and has a unique name within a given bucket
S3 namespaces must be globally unique - bucketname + s3 url + objectkey
An object is stored within a single region
An object is replicated in multiple availability zones within a region and is hence eventually consistent

New objects - immediately consistent
Updates - eventually consistent
Deletes - eventually consistent

S3 provides various levels of durability and availability - Standard (high, high), Reduced Redundancy (medium, high), Glacier (high, low, 3-5 hour restore time)
S3 creates a partition based on the first letter of the object key. By changing/managing the key names (e.g. reversing the string so that the first letter is different for various keys) we can spread keys among partitions
S3 supports server side encryption (encrypts data at rest; S3 manages the keys). Can specify whether to encrypt content or not by checking the box for a bucket or specify in the PUT header for an object

Addressing Scheme

Normal Addressing

s3-{region}.amazonaws.com/{bucket-name}/{object-key}

{bucket-name}.s3-{region}.amazonaws.com/{object-key}

Website Addressing

{bucket-name}.s3-website-{region}.amazonasws.com

Versioning

Can be Enabled on bucket level
Keep old copies of objects

Security

By default all buckets are private.

Bucket Policies

Specify access at bucket level. Provides fine grained control.
Specify the permission, resource to which the permission applies, and the users that have that permission (no IAM required).
The policy is not reusable since it is tied to a specific bucket.

IAM

Role-based access control. Provides fine grained control.
First define what is allowed for a bucket (the policy) and then specify who is covered under that policy. Policy is reusable.
For example, create a policy for the bucket (e.g. PutObject right to an S3 resource). Assign policy to role or user/group in IAM.

ACL

Coarse grained control. Apply access control to a specific S3 bucket or object level (does not support a wildcard specification of an S3 resource)
Specify action and the who is allowed to perform that action.

Time bombed Urls

Can also generate time-bombed Url to an object (use Python boto api)

Metadata

System generated

Date, size, checksum, server side encryption enabled or not, object version id, delete marker, storage class, redirects for location (useful for hosting static website or redirecting to a different version of object)

User generated

key-value pairs

x-amz-meta-{your key name}

Data Lifecycle Management

Automatic deletion (specify after how many days to automatically delete an object)
Archive to Glacier (specify after how many days to automatically change the storage class

Best Practices

To distribute data evenly across S3 resources, whenever possible, use a random prefix when naming keys. This ensures that a single machine/disk/partition does not become a bottleneck.
Amazon CloudFront can be placed in front of S3 to increase throughput of GETs and PUTs
When uploading large data, use parallel threads and multipart upload
Wehn reading large data objects, use parallel threads and specify range of data to read

On Software Engineering

Pages

Tuesday, September 2, 2014

Amazon S3 - Summary