Hadoop: How to collect output of Reduce into a Java HashMap - hadoop

I'm using Hadoop to compute co-occurrence similarity between words. I have a file that consists of co-occurring word pairs that looks like:
a b
a c
b c
b d
I'm using a Graph based approach that treats words as nodes and co-occurring words have an edge between them. My algorithm needs to compute the degree of all nodes. I've successfully written a Map-Reduce job to compute the total degree which outputs the following:
a 2
b 3
c 2
d 1
Currently, the output is written back to a file but what I want instead is to capture the result into, say, a java.util.HashMap. I, then, want to use this HashMap in an other Reduce job to compute the final similarity.
Here are my questions:
Is it possible to capture results of reduce job in memory (List, Map). If so, how ?
Is this the best approach ? If not, How should I deal with this ?

There's two possibilities: Or you read the data in your map/reduce task from the distributed file system. Or you add it directly to the distributed cache. I just googled distributed cache size, and it can be controlled:
"The local.cache.size parameter controls the size of the
DistributedCache. By default, it’s set to 10 GB."
Link to cloudera blog
So if you add the output of your first job to the distributed cache of the second you should be fine I think. Tens of thousands of entries are nowhere near the gigabyte range.
Adding a file to the distributed cache goes as follows:
TO READ in your mapper:
Path[] uris = DistributedCache.getLocalCacheFiles(context.getConfiguration());
String patternsFile = uris[0].toString();
BufferedReader in = new BufferedReader(new FileReader(patternsFile));
TO ADD to the DBCache:
DistributedCache.addCacheFile(new URI(file), job.getConfiguration());
while setting up your second job.
Let me know if this does the trick.

Related

How mapper and reducer tasks are assigned

When executing a MR job, Hadoop divides the input data into N Splits and then starts the corresponding N Map programs to process them separately.
1.How is the data divided (splited into different inputSplits)?
2.How is Split scheduled (how do you decide which TaskTracker machine the Map program that handles Split should run on)?
3.How to read the divided data?
4.How Reduce task assigned ?
In hadoop1.X
In hadoop 2.x
Both of the questions has some relationship , so I asked them together , you can show which of the part you are good at .
thanks in advance .
Data is stored/read in HDFS Blocks of a predefined size, and read by various RecordReader types by using byte scanners, and knowing how many bytes to read in order to determine when an InputSplit needs to be returned.
A good exercise to understand it better is to implement your own RecordReader and create small and large files of one small record, one large record, and many records. In the many records case, you try to split a record across two blocks, but that test case should be the same as one large record over two blocks.
Reduce tasks can be set by the client of the MapReduce action.
As of Hadoop 2 + YARN, that image is outdated

Processing and splitting large data using Hadoop Map reduce?

I have a large data in text files (1,000,000 lines) .Each line has 128 columns .
Now i am trying to build a kd tree with this large data . I want to use map reduce for calculations.
Brute Force approach for my problem:
1) write a map reduce job to find variance of each column and select the column with highest variance
2) taking (column name ,variance value ) as inputs write another map reduce job to split the input data into 2 parts . 1 part has all the rows with value less than input value for the given column name the second part has all the rows greater than input value.
3) for each part repeat step 1 and step 2 , continue the process until you are left with 500 values in each part.
the column name , variance value forms a single node for my tree . so with the brute force approach for tree of height 10 i need to run 1024 map reduce jobs.
My questions:
1 ) Is there any way i can improve the efficiency by running less number of map reduce jobs ?
2 ) I am reading the same data every time . Is there any way i can avoid that ?
3 ) are there any other frameworks like pig , hive etc which are efficient for this kind of tasks ?
4 ) Any frameworks using which i can save the data into a data store and easily retrieve data ?
Pleas help ...
Why don't you try using Apache Spark (https://spark.apache.org/) here ?...this seems like a perfect use case for spark
With an MR job per node of the tree you have O(n) = 2^n number of jobs (where n is the height of the tree), which is not good for the overheads of the YARN. But with simple programming tricks you can bring it down to the O(n) = n.
Here are some ideas:
Add extra partition column in front of your key, this column is nodeID (each node in your tree has unique ID). This will create independent data flows and will ensure that keys from different branches of the tree do not mix and all of the variances are calculated in the context of the nodeID in waves, for each layer of nodes. This will remove the necessity of having an MR job per node with very little change in the code and ensure that you have O(n) = n number of jobs and not O(n) = 2^n;
Data is not sorted around the split value and while splitting elements from parent list will have to travel to their destination child lists and there will be network traffic between the cluster nodes. Thus caching the whole data set on the cluster with multiple machines might not give significant improvements;
After calculating a couple of levels of the tree, there can be a situation that certain nodeIDs have a number of rows that can fit in the memory of the mapper or the reducer, then you could continue processing that sub-tree completely in memory and avoid costly MR job, this could reduce the number of MR jobs as you get to the bottom of the tree or reduce the amount of data as the processing gets closer to the bottom;
Another optimisation would be to write a single MR job that in the mapper does the splitting around the selected value of each node and outputs them via MultipleOutputs and emits the keys with child nodeIDs of the next tree level to the reducer to calculate the variance of the columns within the child lists. Of cause the first ever run has no splitting value, but all subsequent runs will have multiple split values, one for each child nodeID.

Where to add specific function in the Map/Reduce Framework

I have a general question to the MAP/Reduce Framework.
I have a task, which can be separated into several partitions. For each partition, I need to run a computation intensive algorithm.
Then, according to the MAP/Reduce Framework, it seems that I have two choices:
Run the algorithm in the Map stage, so that in the reduce stage, there is no work needed to be done, except collect the results of each partition from the Map stage and do summarization
In the Map stage, just divide and send the partitions (with data) to the reduce stage. In the reduce stage, run the algorithm first, and then collect and summarize the results from each partitions.
Correct me if I misunderstand.
I am a beginner. I may not understand the MAP/Reduce very well. I only have basic parallel computing concept.
You're actually really confused. In a broad and general sense, the map portion takes the task and divides it among some n many nodes or so. Those n nodes that receive a fraction of the whole task do something with their piece. When finished computing some steps on their data, the reduce operation reassembles the data.
The REAL power of map-reduce is how scalable it is.
Given a dataset D running on a map-reduce cluster m with n nodes under it, each node is mapped 1/D pieces of the task. Then the cluster m with n nodes reduces those pieces into a single element. Now, take a node q to be a cluster n with p nodes under it. If m assigns q 1/D, q can map 1/D to (1/D)/p with respect to n. Then n's nodes can reduce the data back to q where q can supply its data to its neighbors for m.
Make sense?
In MapReduce, you have a Mapper and a Reducer. You also have a Partitioner and a Combiner.
Hadoop is a distributed file system that partitions(or splits, you might say) the file into blocks of BLOCK SIZE. These partitioned blocks are places on different nodes. So, when a job is submitted to the MapReduce Framework, it divides that job such that there is a Mapper for every input split(for now lets say it is the partitioned block). Since, these blocks are distributed onto different nodes, these Mappers also run on different nodes.
In the Map stage,
The file is divided into records by the RecordReader, the definition of record is controlled by InputFormat that we choose. Every record is a key-value pair.
The map() of our Mapper is run for every such record. The output of this step is again in key-value pairs
The output of our Mapper is partitioned using the Partitioner that we provide, or the default HashPartitioner. Here in this step, by partitioning, I mean deciding which key and its corresponding values go to which Reducer(if there is only one Reducer, its of no use anyway)
Optionally, you can also combine/minimize the output that is being sent to the reducer. You can use a Combiner to do that. Note that, the framework does not guarantee the number of times a Combiner will be called. It is only part of optimization.
This is where your algorithm on the data is usually written. Since these tasks run in parallel, it makes a good candidate for computation intensive tasks.
After all the Mappers complete running on all nodes, the intermediate data i.e the data at end of Map stage is copied to their corresponding reducer.
In the Reduce stage, the reduce() of our Reducer is run on each record of data from the Mappers. Here the record comprises of a key and its corresponding values, not necessarily just one value. This is where you generally run your summarization/aggregation logic.
When you write your MapReduce job you usually think about what can be done on each record of data in both the Mapper and Reducer. A MapReduce program can just contain a Mapper with map() implemented and a Reducer with reduce() implemented. This way you can focus more on what you want to do with the data and not bother about parallelizing. You don't have to worry about how the job is split, the framework does that for you. However, you will have to learn about it sooner or later.
I would suggest you to go through Apache's MapReduce tutorial or Yahoo's Hadoop tutorial for a good overview. I personally like yahoo's explanation of Hadoop but Apache's details are good and their explanation using word count program is very nice and intuitive.
Also, for
I have a task, which can be separated into several partitions. For
each partition, I need to run a computing intensive algorithm.
Hadoop distributed file system has data split onto multiple nodes and map reduce framework assigns a task to every every split. So, in hadoop, the process goes and executes where the data resides. You cannot define the number of map tasks to run, data does. You can however, specify/control the number of reduce tasks.
I hope I have comprehensively answered your question.

Conceptual questions about map reduce

I've been doing a lot of reading about Map Reduce and I had the following questions that I can't seem to find the answers to:
Everyone points to the word-count example. But why do we need the map reduce paradigm for a really big corpus for the word counts? I'm not sure how having one machine read from a really huge stream and maintain the word counts all in memory is worse than having a number of connected machines split the counting task amongst themselves and aggregate it again. Finally, at the end, there will still be one place where all the count will be maintained right?
Are mapper and reducer machines physically different? Or can the mapping and reducing happen on the same machine?
Suppose my stream is the foll three sentences:
a b c
b c d
b c
So, the word-count mapper will generate key-value pairs as:
a 1
b 1
c 1
b 1
c 1
d 1
b 1
c 1
And now it will pass these key value pairs to the next stage, right? I have the following questions:
- Is this next stage the reducer?
- Can a mapper send the first b 1 and second b 1 tuples to different nodes? If yes, then do the counts get aggregated in the next phase? If no, then why not? Wouldn't that be counter intutive?
Finally, in the end of a map reduce job, the final output is all aggregated at a single machine, right? If yes, doesn't this make the entire process too expensive, computationally?
Word count is easiest to explain that is why you see it more often. It has become "Hello World" example for Hadoop Framework.
Yes, Map and Reduce can be on same machine or different machine. Reduce starts only after all map completes.
All keys goes to same reducer.
( so answer to your question
Can a mapper send the first b 1 and second b 1 tuples to different nodes --- is NO )
Also its not right to say entire processing is expensive.
As Map-Reduce paradigm can process/solve/analyze problems which were almost impossible to be processed by single machine ( the reason its called BIG data ).
And now with MapReduce its possible with commodity ( read cheaper ) hardware ; that is why is widely accepted.
The Map-Reduce (MR) paradigm was created by Google, and Google is doing Word Count (or in their special case they are creating inverted indices, but that is pretty similar conceptually). You can use MR for many things (and people try doing it) but it isn't really useful. In fact many companies use MR for a special version of Word Count. When Spotify analyses their logs and reports which songs were listened to how often, it is basically word count, just with TB of logs.
The end-result doesn't land on only one machine in hadoop, but again in HDFS, which is distributed. And than you can perform another MR algorithm on that result, ...
In hadoop you have different kind of nodes, but as far as we have tested MR, all nodes where storing data as well as performing Map and Reduce jobs. The reason for performing the Map and Reduce jobs directly on the machine where data is stored is the locality and therefore lower network traffic. You can afterwards combine the reduced results and reduce them again.
For instance when Machine 1 has
a b c
and Machine 2 has
b c d
b c
Than Machine 2 would Map and Reduce the data and only send
b 2
c 2
d 1
over the wire. However Machine 2 actually wouldn't send the data anywhere, this result would rather be save as a preliminary result in HDFS and other machines can access it.
This was now specific to Hadoop, I think it helps to understand the Map-Reduce paradigm when you also look at other usage scenarios. The NoSQL Databases Couchbase and CouchDB use Map-Reduce to create views. This means that you can analyse data and compute sums, min, max, counts, ... This MR-Jobs are run on all the nodes of such a database cluster and the results are stored in the database again and all of this without Hadoop and HDFS.

Data sharing in Hadoop Map Reduce chaining

Is it possible to share a value between successive reducer and mapper?
Or is it possible to store the output of first reducer into memory and second mapper can access that from memory ?
Problem is ,
I had written a chain map reducer like Map1 -> Reducer1 --> Map2 --> Reducer2.
Map1 and Map2 is reading the same input file.
Reduce1 is deriving a value suppose 'X' as its output.
I need 'X' and input file for Map2.
How can we do this without reading the output file of Reduce1?
Is it possible store 'X' in memory to access for Mapper 2 ?
Each job is independent of each other, so without storing the output in intermediate location it's not possible to share the data across jobs.
FYI, in MapReduce model the map tasks don't talk to each other. Same is the case for reduce tasks also. Apache Giraph which runs on Hadoop uses communication between the mappers in the same job for iterative algorithms which requires the same job to be run again and again without communication between the mappers.
Not sure about the algorithm being implemented and why MR, but every MR algorithm can be implemented in BSP also. Here is a paper comparing BSP with MR. Some of the algorithms perform well in BSP when compared to MR. Apache Hama is an implementation of the BSP model, the way Apache Hadoop is an implementation of MR.
If number of distinct rows produced by Reducer1 is small (say you have 10000 (id,price) tuples), using two stage processing is preferred. You can load results from first map/reduce into memory in each Map2 mapper and filter input data. So, no uneeded data will be transferred via network and all data will be processed locally. With use of combineres amount of data can be even less.
In case of huge amount of distinct rows looks like you need to read data twice.

Resources