On Software Engineering: Database Design

Anatomy of a modern Client-Server Database

Modern Client-Server (CS) Databases, such as Oracle and MS SQL Server, share various design elements. At the core of it all, Databases manage large amount of data stored in one or more disk files. They also facilitate ad hoc querying of such data. The following are the common design elements of a modern Client-Server Database.

Data Files

CS Databases use one or more physical data files on disk to store information in a manner optimized for updates and lookups. For example, a database may store raw data in one disk file location and an index to the data into another file location to optimize access. The raw data may be stored sequentially whereas the indexed data may be stored in a sorted fashion to optimize access speed. CS Databases also utilize low-level, raw disk IO instead of OS provided IO functions to further enhance performance. Oracle stores its data in files with an extension .DBF whereas MS SQL Server uses .MDF and .LDF extensions for its data files. Databases also use log files to provide transactional capabilities.

Managing Process

Oracle refers to the Manage Process as the Service or the Instance. Oracle also identifies this process with a unique identifier called the System Identifier (SID). MS SQL Server refers to the Managing Process as an Instance. We think of this Managing Process as the Database Server.

CS Databases usually run on a machine other than the client machine. It follows then a separate, stand-alone process must manage the data files. A separate process provides isolation for database operations immunizing it from a misbehaving Client Application and vice versa. On Windows platform, the Managing Process is implemented as a Windows Service. Specific to a Managing Process are in-memory data structures used to cache recently or commonly used data. Oracle calls this memory area, System Global Area (SGA).

An important sub-subsystem of the Managing Process is the SQL Query Processor and Optimizer. The objective of this subsystem is to understand and process data definition and data manipulation requests by the Client Application in an efficient manner. The Managing Process also enforces ACID qualities of a database:

Atomicity - Either all or none of the tasks in a transaction are performed
Consistency - The database remains in a consistent regardless of successful or unsuccessful transactions
Isolation - Results of operations during a transaction must remain invisible to other transactions and queries
Durability - Once a transaction is completed, the resultant changes persist

Communication Channel

The Client Application communicates with a CS database over a network. The CS Database provides a communication channel over which it can exchange data with a Client Application.

A Client Application needs the following information to connect to an MS SQL database:

Name of the Host (IP address or Name of the Server hosting the Managing Process)
Network Protocol used by the Database for communication such as TCP/IP
Port number used by the Managing Process (defaults to 1433 in case of TCP/IP)
Instance name. An Initial Catalog may be specified. A Catalog can be thought of a Logical Database or Schema. A Catalog is a container for SQL Database objects such as Tables, Views and Stored Procedures. If not specified, the Master database is the default Initial Catalog. The Catalog dictates what Database objects are initially visible to the Client Application. Database objects in other Catalogs may be accessed, subject to object permissions available to the Client Application.

A Client Application needs the following information to connect to an Oracle database:

Name of the Host (IP address or Name of the Server hosting the Managing Process)
Network Protocol used by the Database for communication such as TCP/IP
Port number used by the Managing Process (defaults to 1521 in case of TCP/IP)
Service Name or SID (System ID). In Oracle there is no Initial Catalog per se as is the case with the MS SQL Server. In Oracle, the set of SQL Database objects visible to a Client Application correspond to the SQL objects owned by the authenticated User of the Client Application. In other words, for Oracle, User and Schema are synonymous. Oracle Schemas loosely correspond to MS SQL Catalogs. In case of both databases, however, user permissions determine what is accessible to the authenticated user.

Authentication

The Database either uses the authentication services provided by the Operating System (Windows Authentication) or maintains its own list of usernames (login)/passwords in its System tables for authentication purposes. The latter is called SQL Server Authentication mode in MS SQL Server and OS Authentication in Oracle.

Before the Database allows a Client Application to connect to it, it authenticates the identity of the party using the Client Application. The Client Application presents username (login) and/or password to the Database to prove its identity. Alternatively, depending upon the Database authentication scheme in effect, the identity of the Client Application is deciphered from the Network identity of the party using the Client Application (Windows Authentication).

Authorization

Before a database responds to any data queries or updates, the Database determines the permissions granted to the party using the Client Application. The Database stores such permissions its System tables. An example of permission is a permission to SELECT rows from a certain TABLE. A Database allows permissions to be aggregated into Roles that can be granted to Users of a Database. The Managing Process is responsible for limiting operations of the Client Application to permissions granted to the authenticated User of the Client Application.

The Art of Database Schema Design

Identify entities. For example, for an ordering system entities may include customer, order, and order lines.
Create a table for each entity. For now, just define a couple of columns for each entity. For example, create Customer table for customer entity with two columns: an auto-incrementing integer primary key column called Id, and a varchar(50) column called Lastname.
Define rest of the columns. After creating all the tables, we are in better position to decide which columns go in what table. Create rest of the columns for each table. Don't dwell too much on the data type selection for each column at this point.
Review and modify data types for all columns. Now that we have defined all the tables and their columns we are in a better position to decide appropriate and consistent data types across all tables.
Apply NULL constraints. Review all columns and place NOT NULL constraint on desired columns.
Apply UNIQUE constraints. Review all columns and place UNIQUE constraint on desired columns.
Specify DEFAULT VALUEs. Review all columns and provide default values for desired columns.
Add Foreign Key columns to tables. Review all tables and remove any duplicate columns. Identify relationship between tables and add Foreign Key columns.
Define Referential Integrity Constraints. Place FOREIGN KEY constraints on the Foreign Key columns.
Define desired CASCADING update and delete behavior for related tables.
Write scripts to populate tables that contain permanent data such as lookup lists.
Define INDEXes on multiple columns to enforce UNQIUE constraint (if any).
Define INDEXes for desired columns to improve performance.
Write scripts to populate tables with sample data to verify the correctness of the schema.
Write scripts to execute typical sql requests expected from a database client. Verify the manipulated data.
Iterate over the above steps several times. Draw or generate an Entity Relationship diagram to visually inspect the schema.
Implement Views, Triggers, and Stored Procedures to make the schema more amenable to database users and/or applications.

What is Database Normalization?

Database Normalization is the process of reorganizing database tables where columns (and consequently the accompanying data) are moved to different tables to avoid anomalies when inserting, updating and deleting data. Anomalies occur because in a database that is not normalized, same data may repeat in multiple rows and unrelated datum may be placed in the same table. As a result, performing a delete may delete unintended data. Updates may or may not update all instances of the duplicated data and extraneous inserts may occur.

We normalize database to keep data in our database in a consistent state. As a side benefit, the process of database normalization also helps correct identification and modeling of business entities.

During the normalization process, columns are extracted from one table and placed into a new table. However, when the two tables are joined using a common column (primary/foreign key), the same result set as before is obtained. This normalization helps remove duplicate data from the database and aims to keep the same data in one place (single table row). As a result, any inserts, updates and deletes do not compromise consistency and integrity of the database.

The rules for normalizing a database are encapsulated in Normal Forms: 1NF, 2NF, 3NF, BCNF, 4NF and 5NF.

Primary and Foreign Keys

Primary and foreign keys help us create relationships between tables in a relational database.

Primary Key

The primary key of a table is a data value in a single table column or a concatenation of data values in multiple table columns

Primary key values are unique. That is, no two rows in the table have the same primary key values
Primary key values are never NULL for any row in the table
Primary key values uniquely identify a row in a table and, therefore, the primary key values never change over time
All of the above qualities must hold for business data column(s) used for primary keys
Assuming all of the above holds, in practice, a surrogate, integer data type, auto incrementing column is added to the table to serve as the primary key. Since a primary key is typically used as a foreign key in other tables, an integer surrogate key provides an efficient mechanism for implementing such relationships

Foreign Key

Foreign key is a column in a table
In practice, foreign key column is of integer data type
The values in the foreign key column are always primary key values in another related table except when business rules allow the values in the foreign key column to be NULL some of the time
Foreign keys enable enforcement of referential integrity constraints meaning, when present, the foreign key values always refer to a unique row in another table

Example

Table: InvoiceLine (InvoiceLineId, InvoiceId, ProductId, ProductQuantity)
Table: Product (ProductId, ProductName)
(InvoiceId + ProductId) is the natural primary key for the InvoiceLine table. InvoiceLineId is the surrogate primary key for the InvoiceLine table
ProductId is the primary key for the Product table
ProductId in the InvoiceLine table is a foreign key that points to the Product table

Building a Data Warehouse using Fact and Dimension tables

To build a data warehouse, the data from an operational/transactional database is massaged into another database schema (the Data Warehouse) that is optimized for ease of querying and speed of data retrieval. The process of taking the data from an operational/transactional database and populating it into the Data Warehouse is called ETL - Extract, Transform and Load.

One popular approach to creating a Data Warehouse is the use of Fact and Dimension tables. The schema so created is referred to as Star Schema. The schema consists of a few Fact tables and each Fact table has multiple Dimension tables. To visualize, the Fact table is the nucleus of the star and the Dimension tables are the rays shooting out of the star.

Fact Table

The Fact table contains the primary information of interest related to a business entity or process. That is, the Fact table(s) contains the information that you want to analyze or measure. This is typically numeric transaction data related to a business entity or a process. For example, in case of sales at a grocery store, for each sale, the Fact table may contain the total amount of the sale, and the number of items in the sales transaction. In case of a sporting event, the Fact table may have rows with a flag to denote attendance or non-attendance of all season ticket holders for all games in a season.

Dimension Table

The Dimension tables provide context to the information contained in the Fact table. A Dimension table defines an aspect of what is being measured or analyzed in the Fact table. For example, in case of sales at a grocery story, different dimensions of a sales amount or number of items in the sale in the Fact table may include the location of the store, the brands of products sold, and the day of the week the sale is made. In case of a sporting event, the Dimension tables may hold information about the age, gender and the driving distance to the stadium corresponding to attendance or non-attendance of season ticket holders for all games in the season.

By joining the Fact table with one or more Dimension tables, the primary information in the Fact table can be analyzed by filtering, labeling or grouping data contained in the Dimension tables. Generally, an aggregate SQL function is applied to a column in the Fact table and grouping and filtering is applied to columns in the Dimension tables.

Let's analyze season ticket holder attendance for a major league baseball team. Say we want to analyze the attendance pattern of season ticket holders at the end of the season.

Operational/Transactional Database Tables

T_BallParkEntry (Date, SeasonTicketHolderId)

Every entry of every season ticket holder into the stadium is logged in this table

T_SeasonTicketHolder (Id, Name, Gender, DOB, Address, City, State, ZipCode)

Contains information about all season ticket holders

Data Warehouse Tables

Fact Table: F_Attendance (GenderId, AgeId, DrivingDistanceId, AttendedFlag)

The "fact" we are trying to analyze is whether a season ticket holder attended the game. We want to analyze attendance along several dimensions such as gender, age and the number of miles driven to the stadium.

Dimension Table: D_Gender (Id, GenderName)

Male or Female

Dimension Table: D_Age (Id, YearsOld)

Age of the attendee

Dimension Table: D_DrivingDistance (Id, MilesDriven)

Driving distance of the season ticket holder to the stadium

Note:

There is a table corresponding to each dimension of attendance - Gender, Age and Driving Distance
The Fact table has composite primary key spanning multiple columns corresponding to id's of each of the dimension table (GenderId, DAgeId, DrivingDistanceId)

ETL Process

At the end of the baseball season, an ETL operation is performed to load the operational data into the data warehouse as follows:

For Each Game Day in the Season

For Each Season Ticket Holder

Look in T_BallParkEntry table to determine if the Ticket Holder attended the game

Get Ticket Holder's Gender from the T_SeasonTicketHolder table

Calculate Ticket Holder's Age from the T_SeasonTicketHolder table

Calculate Ticket Holder's Driving Distance using the ZipCode in

the T_SeasonTicketHolder table

Insert a row in the F_Attendance table (Insert rows in D_Gender,

D_Age and D_DrivingDistance tables, as necessary)

Analysis

At the end of the ETL process, we have a data warehouse of all attendances (and non-attendances) by season ticket holders during a season. Note that the tables in the data warehouse contain no private season ticket holder information that could identify a particular season ticket holder. Our objective is to analyze the attendance pattern of the entire group of season ticket holders, not any particular season ticket holder.

Now that our data warehouse is setup, we can issue queries to do our attendance analysis.

For example, what is the correlation between the Driving Distance, Gender, Age and missing games?

SELECT D.MilesDriven, E.YearsOld, G.GenderName, Count(A.AttendedFlag) As NotAttended

FROM D_DrivingDistance D, D_Age E, D_Gender G, F_Attendance A

WHERE D.Id = A.DrivingDistanceId

AND E.Id = A.AgeId

AND G.Id = A.GenderId

AND

A.AttendedFlag = 0

GROUP BY

D.MilesDriven, E.YearsOld, G.GenderName;

This query may yield a result such as follows:

MilesDriven, YearsOld, GenderName, NotAttended

5 25 Female 2

5 25 Male 3

5 30 Female 2

5 30 Male 3

10 25 Female 2

10 25 Male 3

10 30 Female 12

10 30 Male 1

Each row above shows games missed based on dimensions of Driving Distance to the Stadium, Age and Gender. It shows Females, 30 years old that have to drive 10 miles to the stadium are most likely to miss games.

By writing a complex query, the above information can be gleaned directly from the Operational/Transactional database. However, as evidenced by the simplicity of query above, it is a lot easier for a data analyst to work with the data warehouse. Not to mention, when using the data warehouse only, the private information of season ticket holders is not exposed. Last but not least, the Operational/Transactional database is not overloaded by data analysis queries.

Writing table row modification history using T-SQL Triggers

Objective

We want to record every change made to rows in a given table. In the example below, Product table contains information about products. Every change to a row in Product table is recorded in the ProductHistory table using database triggers.

Code Example

-- Drop tables

IF OBJECT_ID('Product','U') IS NOT NULL

DROP TABLE Product;

IF OBJECT_ID('ProductHistory','U') IS NOT NULL

DROP TABLE ProductHistory;

-- Create tables

CREATE TABLE Product

(

Id int IDENTITY(100, 1) PRIMARY KEY CLUSTERED,

Name varchar(50) NOT NULL

);

CREATE TABLE ProductHistory (

Id INT IDENTITY (1,1) PRIMARY KEY CLUSTERED NOT NULL,

ProductId INT,

Name VarChar(50),

ChangeType TINYINT DEFAULT(0) NOT NULL,

CreatedOn DATETIME DEFAULT(CURRENT_TIMESTAMP) NOT NULL

);

-- Create trigger that get fired when a row in INSERTed or UPDATEd is Product table

-- Writes changed values to ProductHistory table, with change type 0 (INSERT)

CREATE TRIGGER WriteProductHistoryOnInsert

ON Product

AFTER INSERT

INSERT INTO ProductHistory(ProductId, Name, ChangeType) (SELECT Id, Name, 0 From Inserted);

-- Create trigger that get fired when a row in INSERTed or UPDATEd is Product table

-- Writes changed values to ProductHistory table, with change type 1 (UPDATE)

CREATE TRIGGER WriteProductHistoryOnUpdate

ON Product

AFTER UPDATE

INSERT INTO ProductHistory(ProductId, Name, ChangeType) (SELECT Id, Name, 1 From Inserted);

-- Create trigger that get fired when a row is DELETEd

-- Writes values of deleted row to ProductHistory table, with change type 2 (DELETE)

CREATE TRIGGER WriteProductHistoryOnDelete

ON Product

AFTER DELETE

INSERT INTO ProductHistory(ProductId, Name, ChangeType) (SELECT Id, Name, 2 From Deleted);

-- The following show no data in these tables

SELECT * FROM Product;

SELECT * FROM ProductHistory;

-- Insert a product - Invokes WriteProductHistoryOnModify trigger

INSERT INTO PRODUCT (Name) values ('Prod 1');

-- We see the new row inserted in BOTH tables

SELECT * FROM Product;

SELECT * FROM ProductHistory;

-- Insert another product - Invokes WriteProductHistoryOnModify trigger

INSERT INTO PRODUCT (Name) values ('Prod 2');

-- We see the new row inserted in BOTH tables

SELECT * FROM Product;

SELECT * FROM ProductHistory;

-- Update the first product

Update Product SET Name='Prod ONE' WHERE Name='Prod 1';

-- NO new row in Product table (just an update of name to 'Prod ONE')

SELECT * FROM Product;

-- New row for the updated value in history

SELECT * FROM ProductHistory;

-- Delete the second product

DELETE FROM Product WHERE Name='Prod 2';

-- Row is removed from the Product table

SELECT * FROM Product;

-- Deletion is recorded in the history table

SELECT * FROM ProductHistory;

T-SQL: Checking if a table exists

-- Drop authentication table only if it exists in the schema

IF OBJECT_ID('Authentication','U') IS NOT NULL

DROP TABLE Authentication;

Paging Query Result Set in T-SQL

Problem

We have a large number of products in Product table. We want to retrieve a "page" of these products per query, for example, to display these products on a web page. We want the paged result set to be sorted in the order of Product Name.

Create Table

CREATE TABLE Product

(

Id int IDENTITY(1, 1) PRIMARY KEY CLUSTERED,

Name varchar(50) NOT NULL

);

Populate Table

-- Insert some data in rAnDom order

INSERT INTO Product (Name) VALUES ('MMM');

INSERT INTO Product (Name) VALUES ('ZZZ');

INSERT INTO Product (Name) VALUES ('AAA');

INSERT INTO Product (Name) VALUES ('BBB');

INSERT INTO Product (Name) VALUES ('QQQ');

INSERT INTO Product (Name) VALUES ('RRR');

INSERT INTO Product (Name) VALUES ('SSS');

INSERT INTO Product (Name) VALUES ('TTT');

INSERT INTO Product (Name) VALUES ('UUU');

INSERT INTO Product (Name) VALUES ('VVV');

INSERT INTO Product (Name) VALUES ('WWW');

INSERT INTO Product (Name) VALUES ('XXX');

INSERT INTO Product (Name) VALUES ('YYY');

INSERT INTO Product (Name) VALUES ('CCC');

INSERT INTO Product (Name) VALUES ('DDD');

INSERT INTO Product (Name) VALUES ('EEE');

INSERT INTO Product (Name) VALUES ('FFF');

INSERT INTO Product (Name) VALUES ('NNN');

INSERT INTO Product (Name) VALUES ('OOO');

INSERT INTO Product (Name) VALUES ('PPP');

INSERT INTO Product (Name) VALUES ('GGG');

INSERT INTO Product (Name) VALUES ('HHH');

INSERT INTO Product (Name) VALUES ('III');

INSERT INTO Product (Name) VALUES ('JJJ');

INSERT INTO Product (Name) VALUES ('KKK');

INSERT INTO Product (Name) VALUES ('LLL');

Issue Paged Queries

-- Query Data using Paging

DECLARE @StartRow INT

DECLARE @EndRow INT

-- query rows 1 through 10 in ascending order of product name

SET @StartRow = 1

SET @EndRow = 10

SELECT Id, Name FROM

(SELECT P.Id, P.Name, ROW_NUMBER()

OVER(ORDER BY P.Name) AS RowNumber

FROM Product P) ProductPage

WHERE RowNumber >= @StartRow AND RowNumber <= @EndRow

ORDER BY Name

Result:

55 AAA

56 BBB

66 CCC

67 DDD

68 EEE

69 FFF

73 GGG

74 HHH

75 III

76 JJJ

-- query rows 11 through 20 in ascending order of product name

SET @StartRow = 11

SET @EndRow = 20

SELECT Id, Name FROM

(SELECT P.Id, P.Name, ROW_NUMBER()

OVER(ORDER BY P.Name) AS RowNumber

FROM Product P) ProductPage

WHERE RowNumber >= @StartRow AND RowNumber <= @EndRow

ORDER BY Name

Result:

77 KKK

78 LLL

53 MMM

70 NNN

71 OOO

72 PPP

57 QQQ

58 RRR

59 SSS

60 TTT

-- query rows 21 through 30 in ascending order of product name

SET @StartRow = 21

SET @EndRow = 30

SELECT Id, Name FROM

(SELECT P.Id, P.Name, ROW_NUMBER()

OVER(ORDER BY P.Name) AS RowNumber

FROM Product P) ProductPage

WHERE RowNumber >= @StartRow AND RowNumber <= @EndRow

ORDER BY Name

Result:

61 UUU

62 VVV

63 WWW

64 XXX

65 YYY

54 ZZZ

-- query rows 31 through 40 in ascending order of product name

SET @StartRow = 31

SET @EndRow = 40

SELECT Id, Name FROM

(SELECT P.Id, P.Name, ROW_NUMBER()

OVER(ORDER BY P.Name) AS RowNumber

FROM Product P) ProductPage

WHERE RowNumber >= @StartRow AND RowNumber <= @EndRow

ORDER BY Name

Result:

No rows

Tips for Agile Database Development

If you believed the hype ten years ago, we would all be working with object-oriented databases. The big advantage of an object-oriented database is that your application's domain model and the storage model are the same. For various technical and commercial reasons, the relational databases still power most of our enterprise-level systems and this is not about to change. The implication for the development community is that there continues to be an impendence mismatch between the relational model of the database and the object-oriented semantics of the application domain model. This divide is so profound that our popular system development patterns strictly separate the business layer from the data layer. This dichotomy translates into a significant amount of work for a development team such as creating database schemas, tuning database performance, mapping the domain classes to data access classes and maintaining all of the above throughout the system lifetime. Just like we apply agile practices to the business layer, we must apply similar principles to the data layer, of course with a few twists. Here are some tips on how to maintain the database layer in an agile ecosystem:

Version your database. This may simply mean maintaining an integer value in a table in your database schema to indicate the version of the schema. Over time the database schema will evolve and there will inevitably be a mismatch between the application domain code and the schema it is running against. By checking the supported database version, the application domain code can spot these issues up front.

Check version before accessing database. Make the database access code smart to check the version of the database before performing any operations. If an older version of the application needs to work with a newer schema of the database, write a compatibility layer that makes the database appear like the old schema to the older version of the application. Placing all data access code in one place helps in this scenario.
Create database upgrade and migration scripts. Create and maintain incremental database scripts that migrate a database schema from any released earlier version to any other later schema version up to the very latest version. In order to write good Unit Tests, we need to be able create any released version of the database. These scripts generally accomplish two tasks: Update the database schema and Migrate the data.
Put database access code in one place. Put all database access code in one place and have the application domain code talk to this database access code. That is, do not disperse your SQL Statements all over the place. Furthermore, create a well-defined interface for the database access code. This will facilitate creation of a Mock database access code layer that will be instrumental in creating Unit Tests for the application domain code.
Use source control. Implement source control over the Database Update scripts just like you would over any other kind of source code. This is essential to traceability of the database changes made.
Unit test database access code. Write robust Unit Tests for the database access code. These tests are separate from the Unit Tests for the application domain code. The purpose of these tests is to verify the CRUD (Create, Read, Update and Delete) behavior of your database access code.
Performance test database access code. Write performance tests to write large amounts of data using the database access code to spot inefficiencies.
Create database utilities and scripts. Create tools and scripts to allow testers and developers to create any released version of the database. A particularly useful script is a script that allows us to copy/move a database in its entirety from one machine to another. If a client reports a problem, an identical environment can then be easily created for diagnosis. These tools and scripts not only create appropriate database schema and but also populate appropriate seed data in the tables.
Integrate very often. Make database related activities part of the continuous integration cycle, just like your application domain code.
Automate as much as you can. This includes writing automated tests, continuous integration server and writing simple tools and scripts to recreate a database schema with appropriate configuration/seed data. What cannot be used easily, does not usually get used.
Write a utility for database health check. Write a utility library to verify the structure of a given database. Even though we versioned our database, it is possible some table is missing a column, a constraint is non-existent or a trigger is not present. A utility to verify the validity of a database at database object level (table, views, stored procedures..) is invaluable to diagnosing problems in testing and at client sites.
Version compatibility matrix. Keep a compatibility matrix of what version of application works with what version of the database. Make it freely and widely available to development and onsite teams for early diagnosis of any database related issues.

Pages

Database Design