hadoop's poor scheduling of tasks - hadoop

I am running some map reduce tasks on hadoop. The mapper is used to generate data and hence does not depend upon the hdfs block placement. To test my system I am using 2 nodes and one master node. I am doing my testing on hadoop-2.0 with yarn.
There is something very uncomfortable that I find with hadoop. I have configured it to run 8 maps tasks. Unfortunately hadoop is launching all the 8 map tasks on one node, and the other node is almost ideal. There are 4 reducers, and it does not balance these reducers too. It really results in a poor performance when that happens.
I have these properties set in mapred-site.xml in both the job tracker and task tracker
<property>
<name>mapreduce.tasktracker.map.tasks.maximum</name>
<value>2</value>
</property>
<property>
<name>mapreduce.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
</property>
Can some one explain if this problem can be solved or why does such problem exists with hadoop?

Don't think of mappers/reducers as one to one with servers. What it sounds like is happening is your system knows that the load is so low their is no need to launch reducers across the cluster. It is trying to avoid the network overhead of transfering files from master to the slave nodes.
Think of the number of mappers and reducers as how many concurrent threads you will allow your cluster to run. This is important when determing how much memory to allocate for each mapper/reducer.
To force an even distrubtion you could try allocating enough memory for each mapper/reducer to make it require a whole node. For example, 4 nodes, 8 mappers. Force each mapper to have 50% of the ram on each node. Not sure if this will work as expected, but really Hadoop load balancing itself is something good in theory, but might not seem that way for small data situations.

Related

Control number of mappers on each node in cluster

I have a very small 2 node Hadoop-HBase cluster. I am executing MapReduce jobs on it. I use Hadoop-2.5.2. I have 32GB(nodes have 64GB memory each) free for MapReduce in each node with the configuration in yarn site as follows
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>32768</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>15</value>
</property>
My resource requirements are 2GB for each mapper/reducer that gets executed. I have configured this in the mapred-site.xml Given these configurations, with a total of about 64GB in memory and 30 vcores, I see about 31 mappers or 31 reducers getting executed in parallel.
While all this is fine, there is one part that I am trying to figure out. The number of mappers or reducers executing in parallel, is not the same on both nodes, one of the nodes has higher number of tasks than the other. Why does this happen? Can this be controlled? If so, how?
I suppose YARN does not see this as resources of a node rather resources of a cluster and spawns the tasks wherever it can in the cluster. Is this understanding correct? If not, what is the correct explanation to the said behaviour during a MR execution?

HDFS and redundancy

I'm planning a data processing pipeline. My scenario is this:
A user uploads data to a server
This data should be distributed to one (and only one) node in my cluster. There is no distributed computing, just picking a node which has currently the least to do
The data processing pipeline gets data from some kind of distributed job engine. Though here is (finally) my question: many job engines rely on HDFS to work on the data. But since this data is processed on one node only, I'd rather like to avoid to distribute it. But my understanding is that HDFS keeps the data redundant - though I could not find any info if this means whether all data on HDFS is available on all nodes, or the data is mostly on the node where it is processed (locality).
It would be a concern to me due to IO reasons for my usage scenario if data on HDFS would completely redundant.
You can go with Hadoop (Map Reduce + HDFS) to solve your problem.
You can tell HDFS to store specific number of copies as you want. See below dfs.replication property. Set this value to 1 if you want only one copy.
conf/hdfs-site.xml - On master and all slave machines
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
Not necessary that HDFS copy data on each and every node. More info.
Hadoop is work on principle that 'Move code to Data'. Since moving code (mostly in MB's) demands very less network bandwidth than moving data in GB's or TB's, you no need to worry about data locality or network bandwidth. Hadoop take cares of it.

Hadoop, uneven load between machines

I have a cluster of 4 machines that I need to run a benchmark against.
I decide to use Terasort to benchmark.
However, when I run the benchmark, only one out of four machine is under load, while the other three are completely idle.
If I run the test another time, a different machine would be completely under load while the other three would be idle.
When I create the dataset with Teragen everything works just fine, the load is evenly distributed between all the four machine.
What can be wrong in this configuration ?
Thanks
I hope your cluster is distributed properly as 4 nodes (1 name node , 1 secondary name node, 2 data nodes)
The process flow happens like it starts with name-node and job tracker will schedule the job for the task trackers which has the data blocks.
The usage of data-nodes depends on few factors like number of replication, number of mappers and number of blocks.
If The number of blocks are many, it will be placed evenly in all the data nodes of your cluster. If the replication factor is 2, then the blocks will be available in both the data nodes. So both can run the mappers which deal with those blocks
If you have two blocks for a file and two mappers will run simultaneously in the data nodes and utilize the resources properly.
In your case, it seems block size is the problem. Try to reduce it. so there should be at least 2 blocks which makes utilization will be more and so is the performance.
Hadoop can be tuned as per your need with the below settings.
dfs.replication in hdfs-site.xml
dfs.block.size in hdfs-site.xml
Good luck !!!

Controling and monitorying number of simultaneous map/reduce tasks in YARN

I have an Hadoop 2.2 cluster deployed on a small number of powerful machines. I have a constraint to use YARN as the framework, which I am not very familiar with.
How do I control the number of actual map and reduce tasks that will run in parallel? Each machine has many CPU cores (12-32) and enough RAM. I want to utilize them maximally.
How can I monitor that my settings actually led to a better utilization of the machine? Where can I check how many cores (threads, processes) were used during a given job?
Thanks in advance for helping me melt these machines :)
1.
In MR1, the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties dictated how many map and reduce slots each TaskTracker had.
These properties no longer exist in YARN. Instead, YARN uses yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, which control the amount of memory and CPU on each node, both available to both maps and reduces
Essentially:
YARN has no TaskTrackers, but just generic NodeManagers. Hence, there's no more Map slots and Reduce slots separation. Everything depends on the amount of memory in use/demanded
2.
Using the web UI you can get lot of monitoring/admin kind of info:
NameNode - http://:50070/
Resource Manager - http://:8088/
In addition Apache Ambari is meant for this:
http://ambari.apache.org/
And Hue for interfacing with the Hadoop/YARN cluster in many ways:
http://gethue.com/
There is a good guide on YARN configuration from Hortonworks
You may analyze your job in Job History server. It usually may be found on port 19888. Ambari and Ganglia are also very good for cluster utilization measurement.
I've the same problem,
in order to increase the number of mappers, it's recommended to reduce the size of the input split (each input split is processed by a mapper and so a container). I don't know how to do it,
indeed, hadoop 2.2 /yarn does not take into account none of the following settings
<property>
<name>mapreduce.input.fileinputformat.split.minsize</name>
<value>1</value>
</property>
<property>
<name>mapreduce.input.fileinputformat.split.maxsize</name>
<value>16777216</value>
</property>
<property>
<name>mapred.min.split.size</name>
<value>1</value>
</property>
<property>
<name>mapred.max.split.size</name>
<value>16777216</value>
</property>
best

Understanding Hadoop Simulator Mumak

Recently I was trying to understand the working of Mumak (see, e.g., MAPREDUCE-728)
It basically takes a job trace and topology trace and simulates hadoop.
I couldn't understand how it assigns splits across nodes.
What does mumak mean by local map task and non-local task?
In MapReduce there is the notion of "locality" which signifies how "far away" a task is running from the data it is working on. The best locality is running a task on a node that contains the data it needs. The second best locality is a node in the same rack as a node containing the data, etc...
Mumak has the ability to slow-down the tasks scheduled on non-local nodes by using the following settings in your configuration file:
<property>
<name>mumak.scale.racklocal</name>
<value>1.5</value>
<description>Scaling factor for task attempt runtime of rack-local over
node-local</description>
</property>
<property>
<name>mumak.scale.rackremote</name>
<value>1.8</value>
<description>Scaling factor for task attempt runtime of rack-remote over
node-local</description>
</property>

Resources