hadoop - definitive guide - why is a block in hdfs so large - hadoop

I came across the following paragraph from the definitive guide(HDFS Concepts - blocks) and could not understand.
Map tasks in MapReduce normally operate on one block at a time, so if you have too few tasks (fewer than nodes in the cluster), your jobs will run slower than they could otherwise.
I am wondering how the jobs would be slower when the tasks are few when compared to the total number of nodes in the cluster. Say there are 1000 nodes in the cluster and 3 tasks(By tasks I took blocks as each block is sent to a node for a single task), the time it takes to get the result will always be less than the scenario that has say 1000 nodes and 1000 tasks right?
I couldn't get convinced by the paragraph given in the definitive guide.

The paragraph you quoted from book basically says "utilize as much nodes as you can." If you have 1000 nodes and only 3 blocks or tasks, only 3 nodes are running on your tasks, and all other 997 nodes do nothing about your tasks. If you have 1000 nodes and 1000 tasks, and each of these 1000 nodes has some part of your data, all 1000 nodes will be utilized on you tasks. You also take advantage of data locality since each node will first work on local data.

Related

How to work out how many mappers are needed for a MapReduce job

Below I have a question that gives us this information.
Suppose the program presented in 2a) will be executed on a dataset of 200 million
recorded inspections, collecting 2000 days of data. In total there are 1,000,000 unique
establishments. The total input size is 1 Terabyte. The cluster has 100 worker nodes
(all of them idle), and HDFS is configured with a block size of 128MB.
Using that information, provide a reasoned answer to the following questions. State
any assumptions you feel necessary when presenting your answer.
Here, I'm asked to answer these questions.
1) How many worker nodes will be involved during the execution of the Map and Reduce
tasks of the job?
2) How many times does the map method run on each physical worker?
3) How many input splits are processed at each node?
4) How many times will the reduce method be invoked at each reducer?
Can someone verify my answers are correct?
Q1) I'm basically working out how many mappers I need? My working out is 1TB (input size) divided by the block size (128MB).
1TB / 128MB = 7812.5. Since 7812.5 mappers are needed and we only have 100 worker nodes, all 100 nodes are gonna be used correct?
Q2) From Q1 I figured out 7812.5 mappers are needed, so each map method will be run 7812.5 (round up to 7813) times on each pyhsical worker.
Q3) The input splits are same as the number of mappers, so there will be 7813 splits.
Q4) Since I'm told there are 1,000,000 unique values and the default number of reducers is 2. The reduce method will run 500,000 times on each reducer.
Can someone go through my reasoning and see if I'm correct? Thanks

Why some worker nodes cost more CPU for system during running Spark application?

I have 1 master node and 4 worker nodes. I set up the cluster using Ambari and all monitoring metrics are collected from its dashboard. Spark on the top of Hadoop, so I have got YARN and HDFS. I run a very simple word count script and found that one of the worker nodes did the most job. The word count job is divided into 149 tasks. 98 tasks are done by one node.
Here is my code for counting words
val file = sc.textFile("/data/2gdata.txt") //read file from HDFS
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.collect*
This picture illustrates the events timeline and CPU usage for each worder nodes
Aggregated Metrics by Executor are shown here
Each task has same size of input file. I assume they would spend similar time such as around 30 seconds to count word in the piece of input file. Some tasks spent more than 10 minutes.
I realized those nodes doing less job cost more CPU for system operation as shown in blue area in the first graph. The worker did more tasks and cost more CPU for user (application).
I am wondering what kinds of system operations required for a Spark application. Why three of worker nodes cost more CPU for system? I also enabled spark.speculation, but those stragglers will be killed after 10 minutes and performance didn't get better. Moreover, those stragglers are node_local, so I assume this issue is not related to HDFS replication. (There are 3 replications under the rack.)
Thank you very much.
Even the input file size is same for each task, during the shuffle and reduce phase, some task might process more data than other tasks, data skewing may cause more CPU costs.
You can repartitioning the data in between may improve the performance.

Hadoop, uneven load between machines

I have a cluster of 4 machines that I need to run a benchmark against.
I decide to use Terasort to benchmark.
However, when I run the benchmark, only one out of four machine is under load, while the other three are completely idle.
If I run the test another time, a different machine would be completely under load while the other three would be idle.
When I create the dataset with Teragen everything works just fine, the load is evenly distributed between all the four machine.
What can be wrong in this configuration ?
Thanks
I hope your cluster is distributed properly as 4 nodes (1 name node , 1 secondary name node, 2 data nodes)
The process flow happens like it starts with name-node and job tracker will schedule the job for the task trackers which has the data blocks.
The usage of data-nodes depends on few factors like number of replication, number of mappers and number of blocks.
If The number of blocks are many, it will be placed evenly in all the data nodes of your cluster. If the replication factor is 2, then the blocks will be available in both the data nodes. So both can run the mappers which deal with those blocks
If you have two blocks for a file and two mappers will run simultaneously in the data nodes and utilize the resources properly.
In your case, it seems block size is the problem. Try to reduce it. so there should be at least 2 blocks which makes utilization will be more and so is the performance.
Hadoop can be tuned as per your need with the below settings.
dfs.replication in hdfs-site.xml
dfs.block.size in hdfs-site.xml
Good luck !!!

Hadoop Terasort unstable benchmark results

I have a Cloudera Hadoop cluster and I'm doing some benchmarks running Terasort but I'm getting very unstable results from 105 - 150 minutes. Some times I've seen it was replicating more than usual or doing a lot of garbage collections but some other times they were pretty much the same.
I don't know the reason of the unstable results, any hint or recommendation will be very welcome :)
I run the benchmarks as follows:
I've chosen the number of maps and reduces tasks following this guide http://wiki.apache.org/hadoop/HowManyMapsAndReduces
Speculative maps and reduce execution is off.
Generating dataset:
10,000,000,000 rows of 100 bytes ~= 953674 M
Block size = 128 MB
Number of maps tasks = 3725 (number-of-rows * row-size) / (block-size*2) I do times 2 because the maps tasks time was too low, like 7 seconds.
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar teragen -Ddfs.replication=3 -Dmapred.map.tasks=3725 10000000000 /terasort-in
Running terasort:
num-of-worker-nodes = 4
num-of-cores-per-node = 8
Reduce tasks = 56 ( 1.75 * num-of-worker-nodes * num-of-cores-per-node )
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar terasort -Ddfs.replication=1 -Dmapred.reduce.tasks=56 /terasort-in /terasort-out
The service and role distribution among nodes is as follows:
6 Nodes - 8 cores, 16 GB RAM and 2 HD each - running just HDFS and MapReduce:
1st node, just master roles:
Namenode.
Cloudera management services.
2nd node, just master roles:
JobTracker.
SecondaryNamenode.
3rd to 6th nodes, just worker roles:
TaskTracker.
Datanode.
I use the 2nd node as client because is the one with the lowest load.
Please tell me if you need any configuration property value or detail.
Update: After Chris White's answer I've tried to reduce the number of pollings between the jobtracker and tasktrackers by having just 1 worker and very few maps and reduces, now the benchmarks are pretty stable :)
There are many factors that you need to take into consideration when looking at performance:
This could be a polling problem combined with the small number of processing slots you have available.
The Task Trackers poll the running tasks periodically to determine if they have finished, and the Job Tracker also polls the Task Trackers. With your ~3700 map tasks (if i've read your question correctly), if there was say a ~1 second difference in polling times, then this could account for the ~hour you are seeing in timing differences.
If you have a larger cluster with more processing slots, i imagine this number would become more stable, but no MR job will every have a constant running time, there are too many polling and other external timings (JVM start up time for example) that can adjust the overall runtime.
What was the data locality counters say for both jobs? If one job had considerably more data lock tasks than another then i would expect it to run fast too.

How does Hadoop dfs.replicate work?

I have a 2 node hadoop (1 is the master/slave and another slave) setup and 4 input files each of size 1GB.
When i set dfs.replicate to 2, then the entire data is copied over to both the nodes which is understandable. But my question is that, how do i see an improved performance (almost twice as better) over a single node setup since in the 2 node case, map-reduce will still run over the complete data set on both the systems along with the added overhead of channeling the inputs from 2 mappers to reducers.
Also when i set the replication as 1, the entire data exists only on the master node which is also understandable to avoid ethernet overhead. But even in this case, i see a performance improvement compared to single node setup which i find confusing, since map-reduce runs on local data sets, this scenario should essentially be similar to single node setup with one map-reduce program running on master node on the entire data set ??
Can someone help me understand what i am missing here ???
Thanks
Pawan
Pawan,
In the two node case the map reduce job will not run on entire dataset. MapReduce operates in HDFS blocks which will be of size 64 MB or more based on your configuration. Your 1 GB is split into blocks and distributed on the cluster nodes. some of these blocks are processed on node 1 and the other on node 2 but no duplications. The replication factor only increases the availability of data and more tolerance towards node failures. It will not duplicate the tasks.
resultantly what's happening is, from the processing perspective the data is split between the node 1 and node 2 and being processed. Which means, if your are utilizing your processing power fully and rightly, your are doubling your speed theoritically.
Cheers
Rags

Resources