I've been doing a lot of reading about Map Reduce and I had the following questions that I can't seem to find the answers to:
Everyone points to the word-count example. But why do we need the map reduce paradigm for a really big corpus for the word counts? I'm not sure how having one machine read from a really huge stream and maintain the word counts all in memory is worse than having a number of connected machines split the counting task amongst themselves and aggregate it again. Finally, at the end, there will still be one place where all the count will be maintained right?
Are mapper and reducer machines physically different? Or can the mapping and reducing happen on the same machine?
Suppose my stream is the foll three sentences:
a b c
b c d
b c
So, the word-count mapper will generate key-value pairs as:
a 1
b 1
c 1
b 1
c 1
d 1
b 1
c 1
And now it will pass these key value pairs to the next stage, right? I have the following questions:
- Is this next stage the reducer?
- Can a mapper send the first b 1 and second b 1 tuples to different nodes? If yes, then do the counts get aggregated in the next phase? If no, then why not? Wouldn't that be counter intutive?
Finally, in the end of a map reduce job, the final output is all aggregated at a single machine, right? If yes, doesn't this make the entire process too expensive, computationally?
Word count is easiest to explain that is why you see it more often. It has become "Hello World" example for Hadoop Framework.
Yes, Map and Reduce can be on same machine or different machine. Reduce starts only after all map completes.
All keys goes to same reducer.
( so answer to your question
Can a mapper send the first b 1 and second b 1 tuples to different nodes --- is NO )
Also its not right to say entire processing is expensive.
As Map-Reduce paradigm can process/solve/analyze problems which were almost impossible to be processed by single machine ( the reason its called BIG data ).
And now with MapReduce its possible with commodity ( read cheaper ) hardware ; that is why is widely accepted.
The Map-Reduce (MR) paradigm was created by Google, and Google is doing Word Count (or in their special case they are creating inverted indices, but that is pretty similar conceptually). You can use MR for many things (and people try doing it) but it isn't really useful. In fact many companies use MR for a special version of Word Count. When Spotify analyses their logs and reports which songs were listened to how often, it is basically word count, just with TB of logs.
The end-result doesn't land on only one machine in hadoop, but again in HDFS, which is distributed. And than you can perform another MR algorithm on that result, ...
In hadoop you have different kind of nodes, but as far as we have tested MR, all nodes where storing data as well as performing Map and Reduce jobs. The reason for performing the Map and Reduce jobs directly on the machine where data is stored is the locality and therefore lower network traffic. You can afterwards combine the reduced results and reduce them again.
For instance when Machine 1 has
a b c
and Machine 2 has
b c d
b c
Than Machine 2 would Map and Reduce the data and only send
b 2
c 2
d 1
over the wire. However Machine 2 actually wouldn't send the data anywhere, this result would rather be save as a preliminary result in HDFS and other machines can access it.
This was now specific to Hadoop, I think it helps to understand the Map-Reduce paradigm when you also look at other usage scenarios. The NoSQL Databases Couchbase and CouchDB use Map-Reduce to create views. This means that you can analyse data and compute sums, min, max, counts, ... This MR-Jobs are run on all the nodes of such a database cluster and the results are stored in the database again and all of this without Hadoop and HDFS.
Related
I am curious if Spark first reads entire file into memory and only then starts processing it, meaning applying transformations and actions, or it reads first chunk of a file - applies transformation on it, reads second chunk and so on.
Is there any difference between Spark in Hadoop for the same matter? I read that Spark keeps entire file in memory most of the times, while Hadoop not. But what about the initial step when we read it for the first time and map the keys.
Thanks
I think a fair characterisation would be this:
Both Hadoop (or more accurately MapReduce) and Spark use the same underlying filesystem HDFS to begin with.
During the Mapping phase both will read all data and actually write the map result to disk so that it can be sorted and distributed between nodes via the Shuffle logic.
Both of them do in fact try and cache the data just mapped in memory in addition to spilling it to disk for the Shuffle to do its work.
The difference here though is that Spark is a lot more efficient in this process, trying to optimally align the node chosen for a specific computation with the data already cached on a certain node.
Since Spark also does something called lazy-evaluation the memory use of Spark is very different from Hadoop as a result of planning computation and caching simultaneously.
In in the steps of a word-count job Hadoop does this:
Map all the words to 1.
Write all those mapped pairs of (word, 1) to a single file in HDFS (single file could still span multiple nodes on the distributed HDFS) (this is the shuffle phase)
Sort the rows of (word, 1) in that shared file (this is the sorting phase)
Have the reducers read sections (partitions) from that shared file that now contains all the words sorted and sum up all those 1s for every word.
Spark on the other hand will go the other way around:
It figures that like in Hadoop it is probably most efficient to have all those words summed up via separate Reducer runs, so it decides according to some factors that it wants to split the job into x parts and then merge them into the final result.
So it knows that words will have to be sorted which will require at least part of them in memory at a given time.
After that it evaluates that such a sorted list will require all words mapped to (word, 1) pairs to start the calculation.
It works through steps 3 than 2 than 1.
Now the trick relative to Hadoop is that it knows in Step 3, which in-memory cached items it will need in 2. and in 2. it already knows how these parts (mostly K-V pairs) will be needed in the final step 1.
This allows Spark to very efficiently plan the execution of Jobs, but caching data it knows will be needed in later stages of the job. Hadoop working from the beginning (mapping) to the end without explicitly looking ahead into the following stages, simply cannot use memory this efficiently and hence doesn't waste resources keeping the large chunks in memory, that Spark would keep. Unlike Spark it just doesn't know if all the pairs in a Map phase will be needed in the next step.
The fact that it appears that Spark is keeping the whole dataset in memory hence isn't something Spark actively does, but rather a result of the way Spark is able to plan the execution of a job.
On the other hand, Spark may be able to actually keep fewer things memory in a different kind of job. Counting the number of distinct words is a good example here in my opinion.
Here Spark would have planned ahead and immediately drop a repeat-word from the cache/memory when encountering it during the mapping, while in Hadoop it would go ahead and waste memory on shuffling the repeat words too (I acknowledge there is a million ways to also make Hadoop do this but it's not out of the box, also there is ways of writing your Spark job in unfortunate ways to break these optimisations, but it's not so easy to fool Spark here :)).
Hope this helps understand that the memory use is just a natural consequence of the way Spark works, but not something actively aimed at and also not something strictly required by Spark. It is also perfectly capable of repeatedly spilling data back to disk between steps of the execution when memory becomes an issue.
For more insight into this I recommend learning about the DAG scheduler in Spark from here to see how this is actually done in code.
You'll see that it always follows the pattern of working out where what data is and will be cached before figuring out what to calculate where.
Spark uses lazy iterators to process data and can spill data to disk if necessary. It doesn't read all data in memory.
The difference compared to Hadoop is that Spark can chain multiple operations together.
What do you think of the answer for Question 4 mentioned in this site will be ?
Is the answer right or wrong
QUESTION: 4
In the standard word count MapReduce algorithm, why might using a combiner reduce theoverall Job running time?
A. Because combiners perform local aggregation of word counts, thereby allowing the mappers to process input data faster.
B. Because combinersperform local aggregation of word counts, thereby reducing the number of mappers that need to run.
C. Because combiners perform local aggregation of word counts, and then transfer that data toreducers without writing the intermediate data to disk.
D. Because combiners perform local aggregation of word counts, thereby reducing the number of key-value pairs that need to be snuff let across the network to the reducers.
Answer:A
and
QUESTION: 3
What happens in a MapReduce job when you set the number of reducers to one?
A. A single reducer gathers and processes all the output from all the mappers. The output iswritten in as many separate files as there are mappers.
B. A single reducer gathers andprocesses all the output from all the mappers. The output iswritten to a single file in HDFS.
C. Setting the number of reducers to one creates a processing bottleneck, and since the number of reducers as specified by the programmer is used as a reference value only, the MapReduceruntime provides a default setting for the number of reducers.
D. Setting the number of reducers to one is invalid, and an exception is thrown.
Answer:A
From my understanding answers to the above questions
Question 4: D
Question 3: B
UPDATE
You have user profile records in your OLTP database,that you want to join with weblogs you have already ingested into HDFS.How will you obtain these user records?
Options
A. HDFS commands
B. Pig load
C. Sqoop import
D. Hive
Answer:B
and for updated question my answer I am doubtfull with B and C
EDIT
Right Answer: Sqoop.
As far my understanding both the answers are wrong.
I haven't worked much with the Combiner but everywhere I found it to be working on outputs of Mapper. The answer to Question No 4 should D.
Again from practical experience I've found that the number of output files is always equal to the number of Reducers. So the answer to the Question No 3 should be B. This may not be the case when using MultipleOutputs but that's not common.
Finally I think Apache won't lie about MapReduce ( exceptions do occur :). The answer to the both the questions are available in their wiki page. have a look.
By the way, I liked the "100% Pass-Guaranteed or Your Money Back!!!" quote on the link you provided ;-)
EDIT
Not sure about the question in the update section since I've little knowledge on Pig & Sqoop. But certainly the same can be achieved using Hive by creating external tables on the HDFS data & then joining.
UPDATE
After comments from user milk3422 & the owner, I did some searching and find out that my assumption of Hive being the answer to the last question is wrong since another OLTP database is involved. The proper answer should be C as Sqoop is designed to transfer data between HDFS and relational databases.
The answer for question 4 and 3 seem correct to me. For question 4 its quite justifiable becoz while using a combiner the map output is being kept in collection n processed first then buffer is flushed when full. To justify this I will add this link : http://wiki.apache.org/hadoop/HadoopMapReduce
Here it clearly states why combiner will add speed to the process.
Also I think q.3 answer is also correct becoz in general that's basic configuration followed by default. To justify that I will add another informative link: https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/mapreduce-types
I have a general question to the MAP/Reduce Framework.
I have a task, which can be separated into several partitions. For each partition, I need to run a computation intensive algorithm.
Then, according to the MAP/Reduce Framework, it seems that I have two choices:
Run the algorithm in the Map stage, so that in the reduce stage, there is no work needed to be done, except collect the results of each partition from the Map stage and do summarization
In the Map stage, just divide and send the partitions (with data) to the reduce stage. In the reduce stage, run the algorithm first, and then collect and summarize the results from each partitions.
Correct me if I misunderstand.
I am a beginner. I may not understand the MAP/Reduce very well. I only have basic parallel computing concept.
You're actually really confused. In a broad and general sense, the map portion takes the task and divides it among some n many nodes or so. Those n nodes that receive a fraction of the whole task do something with their piece. When finished computing some steps on their data, the reduce operation reassembles the data.
The REAL power of map-reduce is how scalable it is.
Given a dataset D running on a map-reduce cluster m with n nodes under it, each node is mapped 1/D pieces of the task. Then the cluster m with n nodes reduces those pieces into a single element. Now, take a node q to be a cluster n with p nodes under it. If m assigns q 1/D, q can map 1/D to (1/D)/p with respect to n. Then n's nodes can reduce the data back to q where q can supply its data to its neighbors for m.
Make sense?
In MapReduce, you have a Mapper and a Reducer. You also have a Partitioner and a Combiner.
Hadoop is a distributed file system that partitions(or splits, you might say) the file into blocks of BLOCK SIZE. These partitioned blocks are places on different nodes. So, when a job is submitted to the MapReduce Framework, it divides that job such that there is a Mapper for every input split(for now lets say it is the partitioned block). Since, these blocks are distributed onto different nodes, these Mappers also run on different nodes.
In the Map stage,
The file is divided into records by the RecordReader, the definition of record is controlled by InputFormat that we choose. Every record is a key-value pair.
The map() of our Mapper is run for every such record. The output of this step is again in key-value pairs
The output of our Mapper is partitioned using the Partitioner that we provide, or the default HashPartitioner. Here in this step, by partitioning, I mean deciding which key and its corresponding values go to which Reducer(if there is only one Reducer, its of no use anyway)
Optionally, you can also combine/minimize the output that is being sent to the reducer. You can use a Combiner to do that. Note that, the framework does not guarantee the number of times a Combiner will be called. It is only part of optimization.
This is where your algorithm on the data is usually written. Since these tasks run in parallel, it makes a good candidate for computation intensive tasks.
After all the Mappers complete running on all nodes, the intermediate data i.e the data at end of Map stage is copied to their corresponding reducer.
In the Reduce stage, the reduce() of our Reducer is run on each record of data from the Mappers. Here the record comprises of a key and its corresponding values, not necessarily just one value. This is where you generally run your summarization/aggregation logic.
When you write your MapReduce job you usually think about what can be done on each record of data in both the Mapper and Reducer. A MapReduce program can just contain a Mapper with map() implemented and a Reducer with reduce() implemented. This way you can focus more on what you want to do with the data and not bother about parallelizing. You don't have to worry about how the job is split, the framework does that for you. However, you will have to learn about it sooner or later.
I would suggest you to go through Apache's MapReduce tutorial or Yahoo's Hadoop tutorial for a good overview. I personally like yahoo's explanation of Hadoop but Apache's details are good and their explanation using word count program is very nice and intuitive.
Also, for
I have a task, which can be separated into several partitions. For
each partition, I need to run a computing intensive algorithm.
Hadoop distributed file system has data split onto multiple nodes and map reduce framework assigns a task to every every split. So, in hadoop, the process goes and executes where the data resides. You cannot define the number of map tasks to run, data does. You can however, specify/control the number of reduce tasks.
I hope I have comprehensively answered your question.
I'm using Hadoop to compute co-occurrence similarity between words. I have a file that consists of co-occurring word pairs that looks like:
a b
a c
b c
b d
I'm using a Graph based approach that treats words as nodes and co-occurring words have an edge between them. My algorithm needs to compute the degree of all nodes. I've successfully written a Map-Reduce job to compute the total degree which outputs the following:
a 2
b 3
c 2
d 1
Currently, the output is written back to a file but what I want instead is to capture the result into, say, a java.util.HashMap. I, then, want to use this HashMap in an other Reduce job to compute the final similarity.
Here are my questions:
Is it possible to capture results of reduce job in memory (List, Map). If so, how ?
Is this the best approach ? If not, How should I deal with this ?
There's two possibilities: Or you read the data in your map/reduce task from the distributed file system. Or you add it directly to the distributed cache. I just googled distributed cache size, and it can be controlled:
"The local.cache.size parameter controls the size of the
DistributedCache. By default, it’s set to 10 GB."
Link to cloudera blog
So if you add the output of your first job to the distributed cache of the second you should be fine I think. Tens of thousands of entries are nowhere near the gigabyte range.
Adding a file to the distributed cache goes as follows:
TO READ in your mapper:
Path[] uris = DistributedCache.getLocalCacheFiles(context.getConfiguration());
String patternsFile = uris[0].toString();
BufferedReader in = new BufferedReader(new FileReader(patternsFile));
TO ADD to the DBCache:
DistributedCache.addCacheFile(new URI(file), job.getConfiguration());
while setting up your second job.
Let me know if this does the trick.
I have a data set that has approximately 1 billion data points. There are about 46 million unique data points I want to extract from this.
I want to use Hadoop to extract the unique values, but keep getting "Out of Memory" and Java heap size errors on Hadoop - at the same time, I am able to run this fairly easily on a single box using a Python Set (hashtable, if you will.)
I am using a fairly simple algorithm to extract these unique values: I am parsing the 1 billion lines in my map and outputting lines that look like this:
UniqValueCount:I a
UniqValueCount:I a
UniqValueCount:I b
UniqValueCount:I c
UniqValueCount:I c
UniqValueCount:I d
and then running the "aggregate" reducer to get the results, which should look like this for the above data set:
I 4
This works well for a small set of values, but when I run this for the 1 billion data points (which have 46 million keys, as I mentioned) the job fails.
I'm running this on Amazon's Elastic Map Reduce, and even if I use six m2.4xlarge nodes (their maximum memory nodes at 68.4 GB each) the job fails with the "out of memory" errors.
But I am able to extract the unique values using a Python code with a Set data structure (hash table) on one single m1.large (a much smaller box with 8 GB memory). I am confused that the Hadoop job fails since 46 million uniques should not take up that much memory.
What could be going wrong? Am I using the UniqValueCount wrong?
You're probably getting the memory error in the shuffle, remember that Hadoop sorts the keys before starting the reducers. Sort itself is not necessary for most apps, but Hadoop uses this as a way to aggregate all value belonging to a key.
For your example, your mappers will end up writing a lot of times the same values, while you only care about how many uniques you have for a given key. Here is what you're doing right now:
Mapper output:
I -> a
I -> a
I -> a
I -> a
I -> b
I -> a
I -> b
Reducer input:
I -> [a, a, a, a, b, a, b]
Reducer output:
I -> 2
But you really don't need to write 5*a or 2*b in this case, 1 time would be enough since you only care about uniques. So instead of counting the uniques in the reducer, you could directly reduce a lot of overhead by making sure you only send each value once:
Mapper output:
I -> a
I -> b
Reducer input:
I -> [a, b]
Reducer output:
I -> 2
This would effectively reduce the network bandwidth, and the shuffle will be much simpler since there will be less keys to sort.
You could do this 2 ways:
Add a combiner in your job that will run just after the mapper but before the reducer, and will only keep uniques before sending to the reducer.
Modify your mapper to keep a mapping of what you already sent, and not send if you've already sent this mapping before.