On Software Engineering: MapReduce

MapReduce is used to process large data sets. It comprises of two steps.

Use a Map function to break-down original (large) data sets into key/value pairs. These are the intermediate key/value pairs.
Use a Reduce function combine the intermediate key/value pairs to arrive at the final answer

MapReduce lends itself very well to parallel processing. Multiple clusters of compute nodes can independently work on different segments of data during both the Map and Reduce steps.

Example

Objective

Produce a word count from a large text corpus

Map Step

Iterate through all documents in the corpus. For each document:

Parse words
Create a key (word) /value (count, 1) pairs for each word found

Reduce Step

Sum together counts for all intermediate keys (words) to arrive at count for each word in the corpus

Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.
See: http://hadoop.apache.org/mapreduce/

On Software Engineering

Pages

Sunday, August 15, 2010

MapReduce

Example

Objective

Map Step

Reduce Step