Dinesh on Java: MapReduce Flow Chart Sample Example

In this mapreduce tutorial we will explain mapreduce sample example with its flow chart. How to work mapreduce for a job.

A SIMPLE EXAMPLE FOR WORD COUNT

We have a large collection of text documents in a folder. (Just to give a feel size.. we have 1000 documents each with average of 1 Millions words)
What we need to calculate:-
- Count the frequency of each distinct word in the documents?

How would you solve this using simple Java program?

How many lines of codes will u write?

How much will be the program execution time?

To overcome listed above problems into some line using mapreduce program. Now we look into below mapreduce function for understanding how to its work on large dataset.

MAP FUNCTION

Map Functions operate on every key, value pair of data and transformation logic provided in the map function.
Map Function always emits a Key, Value Pair as output
Map(Key1, Valiue1) --> List(Key2, Value2)
Map Function transformation is similar to Row Level Function in Standard SQL
For Each File
- Map Function is
  - Read each line from the input file
    - Tokenize and get each word
      - Emit the word, 1 for every word found

The emitted word, 1 will from the List that is output from the mapper

So who take ensuring the file is distributed and each line of the file is passed to each of the map function?-Hadoop Framework take care about this, no need to worry about the distributed system.

REDUCE FUNCTION

Reduce Functions takes list of value for every key and transforms the data based on the (aggregation) logic provided in the reduce function.
Reduce Function
Reduce(Key2, List(Value2)) --> List(Key3, Value3)
Reduce Functions is similar to Aggregate Functions in Standard SQL

Reduce(Key2, List(Value2)) --> List(Key3, Value3)

For the List(key, value) output from the mapper Shuffle and Sort the data by key
Group by Key and create the list of values for a key

Reduce Function is
- Read each key (word) and list of values (1,1,1..) associated with it.
  - For each key add the list of values to calculate sum
    - Emit the word, sum for every word found

So who is ensuring the shuffle, sort, group by etc?

MAP FUNCTION FOR WORD COUNT

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
While(tokenizer.hasMoreTokens()){

word.set(tokenizer.nextToken());
context.write(word, one);
}
}

REDUCE FUNCTION FOR WORD COUNT

public void reduce(Text key, Iterable <IntWritable> values, Context context) throws IOException, InterruptedException{

int sum = 0;
for(IntWritable val : values){
sum += val.get();
}
context.write(key, new IntWritable(sum));
}

ANATOMY OF A MAPREDUCE PROGRAM

FLOW CHART OF A MAPREDUCE PROGRAM
Suppose we have a file with size about 200 MB, suppose content as follows

-----------file.txt------------
_______File(200 MB)____________
hi how are you
how is your job (64 MB) 1-Split
________________________________
-------------------------------
________________________________
how is your family
how is your brother (64 MB) 2-Split
________________________________
-------------------------------
________________________________
how is your sister
what is the time now (64 MB) 3-Split
________________________________
-------------------------------
_______________________________
what is the strength of hadoop (8 MB) 4-Split
________________________________
-------------------------------

In above file we have divided this file into 4 splits with sizes three splits with size 64 MB and last fourth split with size 8 MB.

Input File Formats:
----------------------------
1. TextInputFormat
2. KeyValueTextInputFormat
3. SequenceFileInputFormat
4. SequenceFileAsTextInputFormat
------------------------------

Lets see in another following figure to understand the process of MAPREDUCE.