Sunday, August 15, 2010

MapReduce

MapReduce is used to process large data sets. It comprises of two steps.
  1. Use a Map function to break-down original (large) data sets into key/value pairs. These are the intermediate key/value pairs.
  2. Use a Reduce function combine the intermediate key/value pairs to arrive at the final answer
MapReduce lends itself very well to parallel processing. Multiple clusters of compute nodes can independently work on different segments of data during both the Map and Reduce steps.

Example

Objective

Produce a word count from a large text corpus

 

Map Step

  • Iterate through all documents in the corpus. For each document:
    • Parse words
    • Create a key (word) /value (count, 1) pairs for each word found

Reduce Step

  • Sum together counts for all intermediate keys (words) to arrive at count for each word in the corpus
Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.
See: http://hadoop.apache.org/mapreduce/