On Software Engineering: Amazon Redshift

Amazon Redshift is a database designed for building large database warehouses. Standard Postgres (PSQL) command line tools can be used with Redshift.

Redshift comprises of one or more nodes. Together these nodes form a Redshift Cluster. Cluster management is controlled through AWS Identity and Access Management services. Multiple database can be created in a single cluster.

Database and table permissions are handled just like any other Postgres database.

Redshift uses a variety of techniques to facilitate efficient storage and querying of petabytes of data.

Redshift does not use indexes. Primary keys can be specified but are not enforced. Primary keys are used for query optimization only. This makes the writes to Redshift incredibly fast.
To compensate for the absence of indexes, Redshift employs massively parallel processing (MPP) architecture, parallelizing and distributing SQL operations to nodes in a cluster. Redshift has a leader node that receives a query. Based on the query, the leader node requests data from 1..n nodes. The nodes run queries in parallel and returns results to the leader node that passes the results back to the client.
Column data can be automatically compressed. Or the administrator can specify the most optimal compression scheme for each column.
Redshift is a columnar store. A single column can be projected from a table without having to read any other column in each table row.
Parameter Group defines global settings for all the databases created in a cluster.

Tools

SQL Manager Lite tool from EMS Software
SQL Workbench
Postgresql ODBC driver
Npgsql is the .NET data provider for Postgresql
For Cluster Management - use AWS CLI

Best Practices - Cluster

Create at least two nodes per cluster. If you use a single-node cluster and if the node fails, the only possible data recovery is from the last snapshot of the cluster.
Be aware of the maintenance window. Once a week, a 30-minute window is specified when Redshift patches are applied. You can pick the time and day of week for this maintenance window.
Take a snapshot of the cluster before performing any maintenance
Make sure you can backup and restore cluster. Cluster is available during restore (not read-only). Cluster is in read-only mode during resize.

Best Practices - Security

Use VPC security group that only allows All TCP Traffic on 5439 (port for Redshift) only. Require SSL for access.
Security at database level works like any other relational database:

Create user groups.
Grant permissions to these groups.
Associate users with groups.

Access to cluster management can be secured with IAM roles.

Only provide "SELECT" permission to report users.

Joins

Nested Loop (worst)
Hash Join
Merge Join (data resides on the same slice)

Best Practices - Loading Large Data Set

COPY command is the fastest way to load data. Data can be loaded from S3, DynamoDB, and files (SSH).
Do not use one big file (split into files and utilize hw better)
When preparing data load file, do escape the delimiter in text fields. Replace CRLF/LF with a different character. If \ is used as the ESCAPE character, be sure to escape \ in the text fields (\\). When exporting from another database, remove non-printable characters from data (explicitly specify code page 1252 in SQL Server BCP utility).
Use the NOLOAD keyword with a COPY command to validate the data in the input files before actually loading the data.
Load a sample of data (a few million) with COMPUPDATE and STATUPDATE on. See what kind of compression encoding Redshift comes up with. Explicitly specify column encoding in the table CREATE. Drop and recreate table with new encodings.
If possible, presort data in the big data file (use the same as the table sort key).
When loading time series data, use time as the sort key and pre-sort the data file before load to speed up load.
Split the big data load file into smaller files; Redshift can load files in parallel. If the file is too large, the whole process may abort abruptly without any reason (currently a single file containing over 250 million rows fails to load).
When loading data from S3 into Redshift, be aware that S3 is eventually consistent. If there are a large number of files that are being uploaded into S3, they might become available in an order different the upload order. Use manifest file to be specify file (retries)
Compress each file before uploading to S3
Turn COMPUPDATE and STATUPDATE off during the load
Run ANALYZE after the load.
Update system stats after data load. Without the update, query optimization suffers.
Check distribution of rows amongst slices after load (see query below)
Don't clear and reload

Best Practices - Schema Management

Design schema that is optimal for storage and query of data.
Define public views for the end-user. Guards end-user from changes in underlying schema
Run ANALYZE and VACUUM commands on a regular basis. Especially, after you insert, update, or delete a significant number of rows in a table
Use distribution keys that evenly distributes data to multiple nodes yet keeping data that needs to be joined together on the same node.
Sort key: use the attribute you will be filtering data upon
Distribution key: use the attribute on which you will be joining data. Say we have two tables: A and B and say we will be joining rows from these tables on a column called COL1. It would be advisable to set COL1 as the distribution key for BOTH table A and B because that would keep the rows that need be joined on the same node.
Use the sort keys when joining table, if possible.
For lookup tables, DISTSTYLE ALL can be specified to make a copy of the table available in the first slice of each node (helps JOINs)
Slices rather than nodes that determine performance
Use EXPLAIN to understand the query plan. Query performance is logged in system tables.
Do not use Select *. Instead, specify desired columns explicitly (Redshift is a columnar database).
Instead of deleting or vacuuming an existing table, it might be quicker to create a new table with the desired data from the existing table.
Workload Management - queries can be routed to different queues to meet expected run times
Can write UDFs in Python
ANALYZE compression weekly
VACUUM weekly
Don't compress sort keys
Don't create columns wider than needed (VARCHAR), wastes memory
Interleaved sort by each column individually, combine using a math function. Goes to a single block. Use INTERLEAVED keyword in CREATE. VACUUM REDINDEX to optimize index.
Use caching layer for dashboard
Use multiple queues for queries
Redshift Console - can view query plan graphically

Connectivity

JDBC dbUrl

jdbc:redshift://<clusterurl>:5439/<schemaname>;AuthMech=PREFER

Properties props = new Properties();

props.setProperty("user", <username>);

props.setProperty("password", <pwd>);

Connection conn = DriverManager.getConnection(<dbUrl>, props);

Escape single quotes in SQL with \.
Example: \'2016-03-10\'
In Java String variable, the value should be \\'2016-03-10\\'

ODBC

Driver={Amazon Redshift (x86)}; Server=<clusterurl>; Database=<schemaname>; UID=<username>; PWD=<password; Port=5439; sslmode=allow

Useful queries

http://github.com/awslabs/amazon-redshift-utils

Database Table attributes

-- note the compression encoding applied to each column
select "column", type, encoding, distkey, sortkey

from pg_table_def where tablename = '<tablename>';

Data Usage

select
trim(pgdb.datname) as Database,
trim(pgn.nspname) as Schema,
trim(a.name) as Table,
b.mbytes,
a.rows
from (
select db_id, id, name, sum(rows) as rows
from stv_tbl_perm a
group by db_id, id, name
) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (
select tbl, count(*) as mbytes
from stv_blocklist
group by tbl
) b on a.id = b.tbl
order by mbytes desc, a.db_id, a.name;

Data Distribution

-- how the data for a particular table is distributed amongst slices
select slice, col, num_values, minvalue, maxvalue
from svv_diskusage
where name='<tablename>' and col =0
order by slice, col;

Database object information

pg_table_def - table and column definitions
stv_tbl_perm - information about tables including temporary tables
stv_locks - current table updates

Dependencies

SELECT * FROM find_depend;

Create temporary tables

-- temporary tables are session-specific (global temporary tables are not supported)
CREATE TEMPORARY TABLE a_temp_table (like existing_table_name)

-- Create table from another table with data
select * into new_table from existing_table;

Users and Groups

SELECT groname,usename from pg_group,pg_user where usesysid = any(grolist);

select * from pg_user;

SELECT * from pg_tables;

SELECT tableowner FROM pg_tables
GROUP BY tableowner
order BY tableowner;

SELECT * FROM pg_views;

SELECT VIEWOWNER FROM pg_views
GROUP BY viewowner
order BY viewowner;

Use ALTER command to change owner of a table or view (only admin can change owner).

Permissions

GRANT SELECT ON [view or table name] TO GROUP endusers;

GRANT ALL ON [view or table name] TO joe;

Queues and Queries

select * from wlm_queue_state_vw;

create view WLM_QUERY_STATE_VW as
select query, (service_class-5) as queue, slot_count, trim(wlm_start_time) as start_time, trim(state) as state, trim(queue_time) as queue_time, trim(exec_time) as exec_time

from stv_wlm_query_state;

select * from wlm_query_state_vw;

Running Queries

SELECT pid, trim(user_name), starttime, substring(query,1,20)
FROM stv_recents
WHERE status='Running';

// Queries in progress
SELECT slice, query, trim(label) querylabel, pid,
starttime, substring(text,1,20) querytext
FROM stv_inflight;

// Queries whose memory usage spilled to disk
SELECT query, step, rows, workmem, label, is_diskbased
FROM svl_query_summary
WHERE is_diskbased='t' order by workmem;

Query Performance

EXPLAIN SELECT u.gender, g.gamename, count(*) AS purchases

FROM user_profile AS u, games AS g, order_line AS ol

GROUP BY u.gender, g.gamename

ORDER BY purchases DESC;

S3 Copy Audit

stl_s3client - time taken to load
stl_load_errors - errors encountered during load
stl_loaderror_detail - details of load errors
STV_load_state - current state of copy command and completion percentage
STL_tr_conflit - for troubleshooting locking issues
STL_file_scan - which nodes were accessed during copy
STL_load_commits - query, filename, number of rows, slice affected by the load

Maintenance

svl_vacuum_percentage - how much space can be reclaimed. use VACUUM command (options: FULL, SORT ONLY, DELETE ONLY)
stv_table_perm - row count
SVV_diskusage - storage allocation allocation for tables and databases at block level
select sum(used-tossed) as used, sum(capacity) as capacity from stv_partitions -- disk space used

Monitoring

Select * from stl_ddltext where userid > 1 order by pid, starttime, sequence -- who ran a DDL query
STL_wlm_error - errors
STL_wlm_query - queries
STV_classification_config - current configuration valuess
STV_wlm_query_queue_state - current queue status
STV_wlm_query_state - state of queries
stv_recents - running queries

Recent Connections and Queries

select * from stl_connection_log where remotehost like '52%' order by recordtime desc;

select * from stl_query order by endtime desc limit 10;

select * from stl_load_errors order by starttime desc;

select * from stl_error where userid!=0 order by recordtime desc;

On Software Engineering

Pages

Monday, September 29, 2014

Amazon Redshift - Summary