What is it?
- Cloud storage for private and public data
- Highly available, durable, large scale storage
- Data can be uploaded or downloaded from AWS Console or through a REST-based APIs
- Data can be made publicly or privately available through an HTTP endpoint. The endpoint can be secured with a signed Url
Usages
- Data sharing on the Web
- Application Backend
- Content management (documents, pictures, media)
- Application logs
- Backups and archives
- Web Content (Images, JavaScript)
- Documentation - versioning and security
- Bootstrapping - Store scripts that drive EC2 instances
- Can be used to host static websites (Route 53 points to S3 buckets)
- S3 integrates with CloudFront. Serve contents of S3 from global edge locations (cached at edge locations). Download (static content) and Streaming (serve media files from buckets to players).
Storage Model
- Data is stored as object in a bucket
- Buckets are like folders
- An object can be any data type
- Object key is unique within a bucket; slashes can be used within object keys to mimic a filing system
- and has a unique name within a given bucket
- S3 namespaces must be globally unique - bucketname + s3 url + objectkey
- An object is stored within a single region
- An object is replicated in multiple availability zones within a region and is hence eventually consistent
- New objects - immediately consistent
- Updates - eventually consistent
- Deletes - eventually consistent
- S3 provides various levels of durability and availability - Standard (high, high), Reduced Redundancy (medium, high), Glacier (high, low, 3-5 hour restore time)
- S3 creates a partition based on the first letter of the object key. By changing/managing the key names (e.g. reversing the string so that the first letter is different for various keys) we can spread keys among partitions
- S3 supports server side encryption (encrypts data at rest; S3 manages the keys). Can specify whether to encrypt content or not by checking the box for a bucket or specify in the PUT header for an object
Addressing Scheme
Normal Addressing
s3-{region}.amazonaws.com/{bucket-name}/{object-key}
or
{bucket-name}.s3-{region}.amazonaws.com/{object-key}
Website Addressing
{bucket-name}.s3-website-{region}.amazonasws.com
Versioning
- Can be Enabled on bucket level
- Keep old copies of objects
Security
By default all buckets are private.Bucket Policies
- Specify access at bucket level. Provides fine grained control.
- Specify the permission, resource to which the permission applies, and the users that have that permission (no IAM required).
- The policy is not reusable since it is tied to a specific bucket.
IAM
- Role-based access control. Provides fine grained control.
- First define what is allowed for a bucket (the policy) and then specify who is covered under that policy. Policy is reusable.
- For example, create a policy for the bucket (e.g. PutObject right to an S3 resource). Assign policy to role or user/group in IAM.
ACL
- Coarse grained control. Apply access control to a specific S3 bucket or object level (does not support a wildcard specification of an S3 resource)
- Specify action and the who is allowed to perform that action.
Time bombed Urls
- Can also generate time-bombed Url to an object (use Python boto api)
Metadata
System generated
Date, size, checksum, server side encryption enabled or not, object version id, delete marker, storage class, redirects for location (useful for hosting static website or redirecting to a different version of object)
User generated
key-value pairs
x-amz-meta-{your key name}
Data Lifecycle Management
- Automatic deletion (specify after how many days to automatically delete an object)
- Archive to Glacier (specify after how many days to automatically change the storage class
Best Practices
- To distribute data evenly across S3 resources, whenever possible, use a random prefix when naming keys. This ensures that a single machine/disk/partition does not become a bottleneck.
- Amazon CloudFront can be placed in front of S3 to increase throughput of GETs and PUTs
- When uploading large data, use parallel threads and multipart upload
- Wehn reading large data objects, use parallel threads and specify range of data to read