Hadoop, uneven load between machines - hadoop

I have a cluster of 4 machines that I need to run a benchmark against.
I decide to use Terasort to benchmark.
However, when I run the benchmark, only one out of four machine is under load, while the other three are completely idle.
If I run the test another time, a different machine would be completely under load while the other three would be idle.
When I create the dataset with Teragen everything works just fine, the load is evenly distributed between all the four machine.
What can be wrong in this configuration ?
Thanks

I hope your cluster is distributed properly as 4 nodes (1 name node , 1 secondary name node, 2 data nodes)
The process flow happens like it starts with name-node and job tracker will schedule the job for the task trackers which has the data blocks.
The usage of data-nodes depends on few factors like number of replication, number of mappers and number of blocks.
If The number of blocks are many, it will be placed evenly in all the data nodes of your cluster. If the replication factor is 2, then the blocks will be available in both the data nodes. So both can run the mappers which deal with those blocks
If you have two blocks for a file and two mappers will run simultaneously in the data nodes and utilize the resources properly.
In your case, it seems block size is the problem. Try to reduce it. so there should be at least 2 blocks which makes utilization will be more and so is the performance.
Hadoop can be tuned as per your need with the below settings.
dfs.replication in hdfs-site.xml
dfs.block.size in hdfs-site.xml
Good luck !!!

Related

Why is my Hadoop MapReduce doesn't run faster even when i add nodes on the cluster?

So i run a 50 MB of data with WordCount on my Hadoop cluster. i run the test on 5 different cluster size, single-node cluster up to 5 node cluster. The thing is, the execution time isn't changing much. it only have 1 - 2 minutes different on each run. isn't adding node to a cluster result in more resource that can be used and making the job run faster?
i expect the execution time to be much more faster with each node addition, but the result showing me otherwise.
the node i use have 2 GB of RAM and 2 cores.
i don't change anything regarding container on yarn-site.xml and map/reduce allocation.mb on mapred-site.xml.
You need to test it with a bigger amount of data.
YARN will allocate a map container for each HDFS block of data. The default HDFS block size is usually 64Mb, so perhaps your test file only uses one HDFS block.
A container is the minimum slice of computation that YARN will assign to a node. In the worst case for your testing, it will need only one container for the map phase, and another for the reduce phase. 2 containers usually fits in just one node, so adding more nodes doesn't give you more speed.

How does Hadoop framework decides the node to run Map job

As per my understanding, files stored in HDFS are divided into blocks and and each block is replicated to multiple nodes, 3 by default. How does Hadoop framework choose the node to run a map job, out of all the nodes on which a particular block is replicated.
As I know, there will be same amounts of map tasks as amounts of blocks.
See manual here.
Usually, framework choose those nodes close to the input block for reducing network bandwidth for map task.
That's all I know.
In Mapreduce 1 it depends on how many map task are running in that datanode which hosts a replica, because the number of map tasks is fixed in MR1. In MR2 there are no fixed slots, so it depends on number of tasks already running in that node.

Can a slave node have multiple blocks of the same file in hadoop?

Say I have a hadoop cluster where one node is the Master node and the other is a Data node. The slave node is an 8-core machine just to make sure there are enough cores to process jobs parallelly. Can i still split the file into say 3 blocks and have the slave node store all the three blocks separately on it. In other words, "if we want to utilize all the slave nodes in a hadoop cluster", then is there a 1:1 relation between number of slave nodes and the maximum number of blocks of a file? If yes, then in such a case how would the map-reduce work. Will the master node fire three map jobs to the slave node and have each mapper pick up each block on the slave node?
My question can be seen in a different way. If we have a 1GB file on a cluster with 3 data nodes then how do the 64 MB blocks get divided and how are they distributed between the three nodes?
The second question seems to be more understandable for me so I will take that first.
From HDFS Perspective:
With 64MB block size a 1GB file consists from 16 blocks, blocks are being stored somewhat randomly between DataNodes, if you have more from them as the replication factor, but you can expect an even distribution between the nodes, if you do not load the data from one of the DNs. If you do, that DN will hold a replica from all the blocks, and other DNs will hold the remaining replicas distributed sort of evenly (still randomly placed). So yes, if you have a file consists from 16 blocks, and only 3 DN with a replication factor of 3 all 3 DNs will hold all 16 blocks for example.
From YARN's perspective when you run the MapReduce job:
YARN tries to find a container on a node for a mapper that has the data locally, there is a configurable wait time for a free container on such nodes before YARN starts up the mapper on a node that does not have the data.
YARN does not rely on physical cores directly, you can configure the number of virtual cores and the amount of memory a container uses, and based on these values YARN will allocate the amount of available containers in a NodeManager.
Further reading on YARN tuning on Cloudera Engineering blog
However:
From the first part of the question as I understand you want to achieve paralellism by defining the block size to split your data files.
MapReduce does not care about HDFS blocks, it has its own abstraction to split the input, it is called InputSplit. InputSplits are feeded to the mappers, by the InputFormat. Also InputSplits are defining the place where the split is available locally so that YARN can find a container that is on a node that has the split on local data storage. I suggest to check the API, and the available implementations of InputFormat, as they most likely suit your needs, however if they are not, then you can still write your own implementation, and specify it via the job configuration.

Confusion of how hadoop splits work

We are Hadoop newbies, we realize that hadoop is for processing big data, and how Cartesian product is extremely expensive. However we are having some experiments where we are running a Cartesian product job similar to the one in the MapReduce Design Patterns book except with a reducer calculating avg of all intermediate results( including only upper half of A*B, so total is A*B/2).
Our setting: 3 node cluster, block size = 64M, we tested different data set sizes ranging from
5000 points (130KB) to 10000 points (260KB).
Observations:
1- All map tasks are running on one node, sometimes on the master machine, other times on one of the slaves, but it never processed on more than one machine.Is there a way to force hadoop to distribute the splits therefore map tasks among machines? Based on what factors dose hadoop decide which machine is going to process the map tasks( in our case once it decided the master, in another case it decided a slave).
2- In all cases where we are testing the same job on different data sizes, we are getting 4 map tasks. Where dose the number 4 comes from?since our data size is less than the block size, why are we having 4 splits not 1.
3- Is there a way to see more information about exact splits for a running job.
Thanks in advance
What version of Hadoop are you using? I am going to assume a later version that uses YARN.
1) Hadoop should distribute the map tasks among your cluster automatically and not favor any specific nodes. It will place a map task as close to the data as possible, i.e. it will choose a NodeManager on the same host as a DataNode hosting a block. If such a NodeManager isn't available, then it will just pick a node to run your task. This means you should see all of your slave nodes running tasks when your job is launched. There may be other factors blocking Hadoop from using a node, such as the NodeManager being down, or not enough memory to start up a JVM on a specific node.
2) Is your file size slightly above 64MB? Even one byte over 67,108,864 bytes will create two splits. The CartesianInputFormat first computes the cross product of all the blocks in your data set. Having a file that is two blocks will create four splits -- A1xB1, A1xB2, A2xB1, A2xB2. Try a smaller file and see if you are still getting four splits.
3) You can see the running job in the UI of your ResourceManager. https://:8088 will open the main page (jobtracker-host:50030 for MRv1) and you can navigate to your running job from there, which will get you to see individual tasks that are running. If you want more specifics on what the input format is doing, add some log statements to the CartesianInputFormat's getSplits method and re-run your code to see what is going on.

How does Hadoop dfs.replicate work?

I have a 2 node hadoop (1 is the master/slave and another slave) setup and 4 input files each of size 1GB.
When i set dfs.replicate to 2, then the entire data is copied over to both the nodes which is understandable. But my question is that, how do i see an improved performance (almost twice as better) over a single node setup since in the 2 node case, map-reduce will still run over the complete data set on both the systems along with the added overhead of channeling the inputs from 2 mappers to reducers.
Also when i set the replication as 1, the entire data exists only on the master node which is also understandable to avoid ethernet overhead. But even in this case, i see a performance improvement compared to single node setup which i find confusing, since map-reduce runs on local data sets, this scenario should essentially be similar to single node setup with one map-reduce program running on master node on the entire data set ??
Can someone help me understand what i am missing here ???
Thanks
Pawan
Pawan,
In the two node case the map reduce job will not run on entire dataset. MapReduce operates in HDFS blocks which will be of size 64 MB or more based on your configuration. Your 1 GB is split into blocks and distributed on the cluster nodes. some of these blocks are processed on node 1 and the other on node 2 but no duplications. The replication factor only increases the availability of data and more tolerance towards node failures. It will not duplicate the tasks.
resultantly what's happening is, from the processing perspective the data is split between the node 1 and node 2 and being processed. Which means, if your are utilizing your processing power fully and rightly, your are doubling your speed theoritically.
Cheers
Rags

Resources