How to set hadoop cluster priority? - hadoop

I am starting to learn Hadoop. I have a hadoop server and it connects with 3 clusters node. If I run a MapReduce job it works well. I need to set the priority for these clusters.
For example
node1, node2, node3 are my cluster which is connect with my hadoop server. Here If I run the MR job, It will split and assign job like the above priority for every time. Is it possible?
Because the cluster nodes have different memory capacity. So I need to set high memory node will handle the Job first.

It's not possible to 'weight' certain servers based on capacity. However, each server can have a configuration to match it's memory, processor count, etc.
For example, if one server has 16 cores and another has 8 cores, you can configure the first server to run 12 tasks simultaneously and the second to only run 6. The same idea with memory.

Related

How do we allocate different number of reducers of heterogeneous clusters?

Our system has a cluster of 5 hosts (e.g., data node or computer slaves…). Now, I want allocate different number of reducers of these hosts because 1 host is slow. . We are using Hadoop Yarn. The resource manager (so called Job tracker in MapReduce1) always allocate evenly number of reducers of to 5 hosts. Is there anyway that I can limit number of reducers of a specific host? For example, what I want is that a job with 40 reducers, 4 fast computers have 36 reducers (e.g., 9 reducers each host), the slow computer has only 4 reducers.
It is entirely possible and a common phenomenon to have heterogenous systems in a hadoop cluster. Typically, as the cluster keeps becoming larger and hence is scaling horizontally, new nodes of different configurations get added to the cluster.
In such scenarios, in order to have configurations applicable to a specific node or to a group of nodes, we need to modify the configurations accordingly on those hosts.
For example, in case of Hortonworks Data Platform where the cluster is managed through Ambari, the concept of host config groups can be leveraged for this purpose.
Please see the below link for further information:
https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.1.0/bk_Ambari_Users_Guide/content/_using_host_config_groups.html
Also see the below link, where the discussion is about increasing the number of YARN containers at a node level. It remains the same in your case as well, which is the opposite of the use case discussed there:
How to increase the number of containers in nodemanager in YARN
Another useful link:
http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/

Can a slave node have multiple blocks of the same file in hadoop?

Say I have a hadoop cluster where one node is the Master node and the other is a Data node. The slave node is an 8-core machine just to make sure there are enough cores to process jobs parallelly. Can i still split the file into say 3 blocks and have the slave node store all the three blocks separately on it. In other words, "if we want to utilize all the slave nodes in a hadoop cluster", then is there a 1:1 relation between number of slave nodes and the maximum number of blocks of a file? If yes, then in such a case how would the map-reduce work. Will the master node fire three map jobs to the slave node and have each mapper pick up each block on the slave node?
My question can be seen in a different way. If we have a 1GB file on a cluster with 3 data nodes then how do the 64 MB blocks get divided and how are they distributed between the three nodes?
The second question seems to be more understandable for me so I will take that first.
From HDFS Perspective:
With 64MB block size a 1GB file consists from 16 blocks, blocks are being stored somewhat randomly between DataNodes, if you have more from them as the replication factor, but you can expect an even distribution between the nodes, if you do not load the data from one of the DNs. If you do, that DN will hold a replica from all the blocks, and other DNs will hold the remaining replicas distributed sort of evenly (still randomly placed). So yes, if you have a file consists from 16 blocks, and only 3 DN with a replication factor of 3 all 3 DNs will hold all 16 blocks for example.
From YARN's perspective when you run the MapReduce job:
YARN tries to find a container on a node for a mapper that has the data locally, there is a configurable wait time for a free container on such nodes before YARN starts up the mapper on a node that does not have the data.
YARN does not rely on physical cores directly, you can configure the number of virtual cores and the amount of memory a container uses, and based on these values YARN will allocate the amount of available containers in a NodeManager.
Further reading on YARN tuning on Cloudera Engineering blog
However:
From the first part of the question as I understand you want to achieve paralellism by defining the block size to split your data files.
MapReduce does not care about HDFS blocks, it has its own abstraction to split the input, it is called InputSplit. InputSplits are feeded to the mappers, by the InputFormat. Also InputSplits are defining the place where the split is available locally so that YARN can find a container that is on a node that has the split on local data storage. I suggest to check the API, and the available implementations of InputFormat, as they most likely suit your needs, however if they are not, then you can still write your own implementation, and specify it via the job configuration.

Hadoop, uneven load between machines

I have a cluster of 4 machines that I need to run a benchmark against.
I decide to use Terasort to benchmark.
However, when I run the benchmark, only one out of four machine is under load, while the other three are completely idle.
If I run the test another time, a different machine would be completely under load while the other three would be idle.
When I create the dataset with Teragen everything works just fine, the load is evenly distributed between all the four machine.
What can be wrong in this configuration ?
Thanks
I hope your cluster is distributed properly as 4 nodes (1 name node , 1 secondary name node, 2 data nodes)
The process flow happens like it starts with name-node and job tracker will schedule the job for the task trackers which has the data blocks.
The usage of data-nodes depends on few factors like number of replication, number of mappers and number of blocks.
If The number of blocks are many, it will be placed evenly in all the data nodes of your cluster. If the replication factor is 2, then the blocks will be available in both the data nodes. So both can run the mappers which deal with those blocks
If you have two blocks for a file and two mappers will run simultaneously in the data nodes and utilize the resources properly.
In your case, it seems block size is the problem. Try to reduce it. so there should be at least 2 blocks which makes utilization will be more and so is the performance.
Hadoop can be tuned as per your need with the below settings.
dfs.replication in hdfs-site.xml
dfs.block.size in hdfs-site.xml
Good luck !!!

Hadoop - Difference between single large node and multiple small node

I am new in hadoop. I am wondering are there any different between single node and multi-node, if both have same computing power.
For example, there is one server with 4 cores CPU and 32 GB RAM to setup a single node hadoop. On the other hand, there are four servers with one core CPE (same clock rate with the "big" server) and 8GB RAM to setup a 4 node hadoop cluster. Which setup would be better?
It is always better to have more number of data nodes instead of one powerful node. Basically we need to configure Name node with high configuration and data nodes can have less configurations.
So, if you have single node, all the demons will be running in the same node with single JVM, where as if you have multiple nodes the demons can run parallelly (sequentially) with multiple JVMs.

In Hadoop can we control the number of nodes per job programatically?

I am running a job timing analysis. I have a pre configured cluster with 8 nodes. I want to run a given job with 8 nodes, 6 nodes , 4 nodes and 2 nodes respectively and note down the corresponding run times. Is there a way i can do this programatically, i.e by using appropriate settings in the Job configuration in Java code ?
There are a couple of ways. Would prefer in the same order.
exclude files can be used to not allow some of the task trackers/data nodes connect to the job tracker/ name node. Check this faq. The properties to be used are mapreduce.jobtracker.hosts.exclude.filename and dfs.hosts.exclude. Note than once the files have been changed, the name node and the job tracker have to be refreshed using the mradmin and dfsadmin commands with the refreshNodes option and it might take some time for the cluster to settle because data blocks have to be moved from the excluded nodes.
Another way is to stop the task tracker on the nodes. Then the map/reduce tasks will not be scheduled on that node. But, the data will still be fetched from all the data nodes. So, the data nodes also need to be stopped. Make sure that the name node gets out of safe mode and the replication factor is also set properly (with 2 data nodes, the replication factor can't be 3).
A Capacity Scheduler can also be used to limit the usage of a cluster by a particular job. But, when resources are free/idle then the scheduler will allocate resources beyond capacity for better utilization of the cluster. I am not sure if this can be stopped.
Well are you good with scripting ? If so play around with start scripts of the daemons. Since this is an experimental setup, I think restarting hadoop for each experiment should be fine.

Resources