Combiners , Reducers and EcoSystemProject in Hadoop - hadoop

What do you think of the answer for Question 4 mentioned in this site will be ?
Is the answer right or wrong
QUESTION: 4
In the standard word count MapReduce algorithm, why might using a combiner reduce theoverall Job running time?
A. Because combiners perform local aggregation of word counts, thereby allowing the mappers to process input data faster.
B. Because combinersperform local aggregation of word counts, thereby reducing the number of mappers that need to run.
C. Because combiners perform local aggregation of word counts, and then transfer that data toreducers without writing the intermediate data to disk.
D. Because combiners perform local aggregation of word counts, thereby reducing the number of key-value pairs that need to be snuff let across the network to the reducers.
Answer:A
and
QUESTION: 3
What happens in a MapReduce job when you set the number of reducers to one?
A. A single reducer gathers and processes all the output from all the mappers. The output iswritten in as many separate files as there are mappers.
B. A single reducer gathers andprocesses all the output from all the mappers. The output iswritten to a single file in HDFS.
C. Setting the number of reducers to one creates a processing bottleneck, and since the number of reducers as specified by the programmer is used as a reference value only, the MapReduceruntime provides a default setting for the number of reducers.
D. Setting the number of reducers to one is invalid, and an exception is thrown.
Answer:A
From my understanding answers to the above questions
Question 4: D
Question 3: B
UPDATE
You have user profile records in your OLTP database,that you want to join with weblogs you have already ingested into HDFS.How will you obtain these user records?
Options
A. HDFS commands
B. Pig load
C. Sqoop import
D. Hive
Answer:B
and for updated question my answer I am doubtfull with B and C
EDIT
Right Answer: Sqoop.

As far my understanding both the answers are wrong.
I haven't worked much with the Combiner but everywhere I found it to be working on outputs of Mapper. The answer to Question No 4 should D.
Again from practical experience I've found that the number of output files is always equal to the number of Reducers. So the answer to the Question No 3 should be B. This may not be the case when using MultipleOutputs but that's not common.
Finally I think Apache won't lie about MapReduce ( exceptions do occur :). The answer to the both the questions are available in their wiki page. have a look.
By the way, I liked the "100% Pass-Guaranteed or Your Money Back!!!" quote on the link you provided ;-)
EDIT
Not sure about the question in the update section since I've little knowledge on Pig & Sqoop. But certainly the same can be achieved using Hive by creating external tables on the HDFS data & then joining.
UPDATE
After comments from user milk3422 & the owner, I did some searching and find out that my assumption of Hive being the answer to the last question is wrong since another OLTP database is involved. The proper answer should be C as Sqoop is designed to transfer data between HDFS and relational databases.

The answer for question 4 and 3 seem correct to me. For question 4 its quite justifiable becoz while using a combiner the map output is being kept in collection n processed first then buffer is flushed when full. To justify this I will add this link : http://wiki.apache.org/hadoop/HadoopMapReduce
Here it clearly states why combiner will add speed to the process.
Also I think q.3 answer is also correct becoz in general that's basic configuration followed by default. To justify that I will add another informative link: https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/mapreduce-types

Related

Try to confirm my understanding of HBase and MapReduce behavior

I'm trying to do some process on my HBase dataset. But I'm pretty new to the HBase and Hadoop ecosystem.
I would like to get some feedback from this community, to see if my understanding of HBase and the MapReduce operation on it is correct.
Some backgrounds here:
We have a HBase table that is about 1TB, and exceeds 100 million records.2. It has 3 region servers and each region server contains about 80 regions, making the total region 240.3. The records in the table should be pretty uniform distributed to each region, from what I know.
And what I'm trying to achieve is that I could filter out rows based on some column values, and export those rows to HDFS filesystem or something like that.
For example, we have a column named "type" and it might contain value 1 or 2 or 3. I would like to have 3 distinct HDFS files (or directories, as data on HDFS is partitioned) that have records of type 1, 2, 3 respectively.
From what I can tell, MapReduce seems like a good approach to attack these kinds of problems.
I've done some research and experiment, and could get the result I want. But I'm not sure if I understand the behavior of HBase TableMapper and Scan, yet it's crucial for our code's performance, as our dataset is really large.
To simplify the issue, I would take the official RowCounter implementation as an example, and I would like to confirm my knowledge is correct.
So my questions about HBase with MapReduce is that:
In the simplest form of RowCounter (without any optional argument), it is actually a full table scan. HBase iterates over all records in the table, and emits the row to the map method in RowCounterMapper. Is this correct?
The TableMapper will divide the task based on how many regions we have in a table. For example, if we have only 1 region in our HBase table, it will only have 1 map task, and it effectively equals to a single thread, and does not utilize any parallel processing of our hadoop cluster?
If the above is correct, is it possible that we could configure HBase to spawn multiple tasks for a region? For example, when we do a RowCounter on a table that only has 1 region, it still has 10 or 20 tasks, and counting the row in parallel manner?
Since TableMapper also depends on Scan operation, I would also like to confirm my understanding about the Scan operation and performance.
If I use setStartRow / setEndRow to limit the scope of my dataset, as rowkey is indexed, it does not impact our performance, because it does not emit full table scan.
In our case, we might need to filter our data based on their modified time. In this case, we might use scan.setTimeRange() to limit the scope of our dataset. My question is that since HBase does not index the timestamp, will this scan become a full table scan, and does not have any advantage compared to we just filter it by our MapReduce job itself?
Finally, actually we have some discussion on how we should do this export. And we have the following two approaches, yet not sure which one is better.
Using the MapReduce approach described above. But we are not sure if the parallelism will be bound by how many regions a table has. ie, the concurrency never exceeds the region counts, and we could not improve our performance unless we increase the region.
We maintain a rowkey list in a separate place (might be on HDFS), and we use spark to read the file, then just get the record using a simple Get operation. All the concurrency occurs on the spark / hadoop side.
I would like to have some suggestions about which solution is better from this community, it will be really helpful. Thanks.
Seems like you have a very small cluster. Scalability is dependent on number of region servers(RS) also. So, just by merely increasing number of regions in table without increasing number of region servers wont really help you speed up the job. I think 80 Regions/RS for that table itself is decent enough.
I am assuming you are going to use TableInputFormat, it works by running 1 mapper/region and performs server side filter on basis of scan object. I agree that scanning using TableInputFormat is optimal approach to export large amount of data from hbase but scalability and performance not just proportional to number of regions. There are many many other factors like # of RS, RAM and Disk on each RS, uniform distribution of data are some of them.
In general, I would go with #1 since you just need to prepare a scan object and then hbase will take care of rest.
#2 is more cumbersome since you need to maintain the rowkey state outside hbase.

How mapper and reducer tasks are assigned

When executing a MR job, Hadoop divides the input data into N Splits and then starts the corresponding N Map programs to process them separately.
1.How is the data divided (splited into different inputSplits)?
2.How is Split scheduled (how do you decide which TaskTracker machine the Map program that handles Split should run on)?
3.How to read the divided data?
4.How Reduce task assigned ?
In hadoop1.X
In hadoop 2.x
Both of the questions has some relationship , so I asked them together , you can show which of the part you are good at .
thanks in advance .
Data is stored/read in HDFS Blocks of a predefined size, and read by various RecordReader types by using byte scanners, and knowing how many bytes to read in order to determine when an InputSplit needs to be returned.
A good exercise to understand it better is to implement your own RecordReader and create small and large files of one small record, one large record, and many records. In the many records case, you try to split a record across two blocks, but that test case should be the same as one large record over two blocks.
Reduce tasks can be set by the client of the MapReduce action.
As of Hadoop 2 + YARN, that image is outdated

What is the benefit of the reducers in Hadoop?

I don't see a value for the reducers in Hadoop in the following scenario:
The Map Tasks generate unique keys (Because we can merge both the Map/Reduce functionality together)
The output size of the Map Tasks is too big (This will exhaust the memory if we wait for the reducers to begin the work)
If we have any functionality that doesn't need grouping and sorting of the keys
Please correct me if I am wrong.
And if someone could give me a real example of the benefits of the reducers and when it should be used, I will appreciate it.
Reducer is beneficial (or required) when you need to do operations like aggregation/grouping etc..
FYI : Reducer is meant for grouping different value for a key which comes from different mapper. So for a use case which do not require grouping/aggregation then there is no point of using reducer(you can set it to Zero , meaning Map-Only jobs).
One quick use-case i can think of is - you want to randomly split a big file to multiple part file. In this case you will supply big file (lets say 100G) to Map-Only jobs. All maps will read a chunk of file and write as a part of file.

Does hadoop job submitter while calculating splits takes record boundries into account? [duplicate]

This question already has answers here:
How does Hadoop process records split across block boundaries?
(6 answers)
Closed 8 years ago.
This question is NOT a duplicate of:
How does Hadoop process records split across block boundaries?
I've one question regarding the input split calculation. As per the hadoop guide
1) the InputSplits respect record boundaries
2) At the same time it say that splits are calculated by Job Submitter. Which I assume runs on the client side. [Anatomy of a MapReduce Job Run - Classic MRv1]
Does this mean that :
(a) job submitter reads blocks to calculate input splits? If this is the case then wont it be very inefficient and beat the very purpose of hadoop.
Or
(b) Does the job submitter just calculates splits that are merely an estimate based up on block sizes and location and Then does it become the InputFormat and RecordReader's responsibility running under mapper to get records across the host boundary.
Thanks
(a) job submitter reads blocks to calculate input splits? If this is
the case then wont it be very inefficient and beat the very purpose of
hadoop.
I don't think so. The job submitter should read blocks' information from the name node and then merely do the calculation, which should not use much computing resource.
(b) Does the job submitter just calculates splits that are merely an
estimate based up on block sizes and location and Then does it become
the InputFormat and RecordReader's responsibility running under mapper
to get records across the host boundary.
I am not sure how accurate a submitter's calculation is but the split size is calculated based on the configured minimum and maximum split size as well as the block size using this equation
max(minimumSplitSize, min(maximumSplitSize, blockSize))
all these values can be set by users. For example, the minimum split size can be 1 and the maximum value can be the max long value (9223372036854775807).
correct - records in an InputFormat is a logic conception. This means as developers when we develop map reduce code, we don't need to consider the case that a record is separated into 2 different splits. The record reader is in charge of reading the missing information via remote read. This may cause some overhead but usually it is slight.

Query related to Hadoop's map-reduce

Scenario:
I have one subset of database and one dataware house. I have bring this both things on HDFS.
I want to analyse the result based on subset and datawarehouse.
(In short, for one record in subset I have to scan each and every record in dataware house)
Question:
I want to do this task using Map-Reduce algo. I am not getting that how to take both files as a input in mapper and also how to handle both files in map phase of map-reduce.
Pls suggest me some idea so that I can able to perform it?
Check the Section 3.5 (Relations Joins) in Data-Intensive Text Processing with MapReduce for Map-Side Joins, Reduce-Side Joins and Memory-Backed Joins. In any case MultipleInput class is used to have multiple mappers process different files in a single job.
FYI, you could use Apache Sqoop to import DB into HDFS.
Some time ago I wrote a Hadoop map reduce for one of my classes. I was scanning several IMD databases and producing a merged information about actors (basically the name, biography and films he acted in was in different databases). I think you can use the same approach I used for my homework:
I wrote a separate map reduce turning every database file in the same format, just placing a two-letter prefix infront of every row the map-reduce produced to be able to tell 'BI' (biography), 'MV' (movies) and so on. Then I used all these produced files as input for my last map reduced that processed them grouping them in the desired way.
I am not even sure that you need so much work if you are really going to scan every line of the datawarehouse. Maybe in this case you can just do this scan either in the map or the reduce phase (based on what additional processing you want to do), but my suggestion assumes that you actually need to filter the datawarehouse based on the subsets. If the latter my suggestion might work for you.

Resources