Sunday, April 13, 2014

Why Big Data?

In the recent years the following significant events have brought about the notion of big data.
  1. The computational power has grown exponentially
  2. Data storage has become plentiful
  3. Many, many new ways have surfaced that facilitate collection of data. Not only Internet has become ubiquitous, but the ways to access the Internet have multiplied. 
Personal computers, tablets, smart phones, appliances, cars, wireless cameras, physical sensors (weather, radiation, traffic, electron microscopes, etc.), and numerous other devices are connected to the Internet. These devices are communicating and streaming large amounts of data over the Internet to the Servers, and to each other.

Collectively, this constitutes big data.

Why does the data get so big? 

Here's a trivial example. We have a web page that will be visited by 1 million users in a 24-hour period. On the average, the user interacts with the page for 5 minutes. Let's say on the average the user click 30 times on various parts of the Web page during the 5 minute interaction. We want to know what captures users' attention. We record mouse-clicks for all users. We aggregate and analyze mouse-click data to find which part of the web page do the users gravitate to. The data points collected will be roughly 30 * 1,000,000 = 30,000,000 for a single day. If we do this for several days, the amount of data can really add up.

Big data is largely unstructured data

With big data, the normal course is to collect data first and then ask questions later. For example, in the previous example we collect lots of mouse click data. Once this data is collected, we can ask many different kinds of questions by slicing and dicing this data. For example, we can calculate the mean time between mouse clicks to estimate how much time a user spends on a particular page.

The volume and the raw nature of data (mouse clicks) make the traditional fixed-schema relational databases unsuitable for data collection. NoSQL databases provide a more natural fit for accumulating and working on big data because they do not require a fixed-schema definition and scale automatically with increased volume. Tools such as Hadoop can then be run against this vast repository of data to summarize the results for further analysis and to draw conclusions.

Volume, Velocity, Variety

These are the three V's of big data.

Volume: The amount of data to be collected and analyzed is very large.

Velocity: The speed at which this data comes at us is very fast. Great amount of storage and computing power is needed to process this streaming data before it becomes irrelevant.

Variety: The data can be structured or unstructured. The challenge of big data analytics is to coalesce a variety of data inputs to draw useful conclusions and to come up actionable recommendations.

The promise of big data

The size of data recorded, accumulated, and streamed by all these devices is huge. But now we have equally enormous computing power and storage to store and analyze this data in various fields such as genomics, astronomy, physical and chemical sciences, law enforcement, security, and social media.

Analysis of big data that may contain millions or even billions of data  points can yield results and trends that cannot be predicted even with the best predictive models. Machine learning algorithms, statistical techniques, or even simply visualization of the big data as charts can provide deep insight into trends and behaviors behind the data.