Concurrent "single-threaded" hadoop executions: where is the bottleneck?

Concurrent "single-threaded" hadoop executions: where is the bottleneck? - hadoop

I am running a compute-intensive, hadoop-based, map-reduce application. I have configured hadoop to use as few threads as possible, but multiple concurrent deployments lead to an increase of the execution time of the application.
I cannot find the cause of this increase in execution time, so there must be a bottleneck that I have not discovered and/or a configuration parameter that I have missed.
Testbed
My testbed consists of 3 Dell PowerEdge R630, each with an Intel Xeon E5-2630v3: 8 cores, 2 threads/core. These machines are located in the same 10 Gbps cluster, interconnected by the same switch. These will be referred to as M1, M2, M3.
Hadoop Configuration
I am running hadoop-1.2.1 on java-1.6.0-openjdk-amd64. I have configured hadoop to use the smallest possible number of threads. Here is my mapred-site.xml configuration:
<configuration>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>1</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>1</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>10.0.0.1:9001</value>
</property>
<property>
<name>mapred.map.tasks.speculative.execution</name>
<value>false</value>
</property>
<property>
<name>mapred.reduce.tasks.speculative.execution</name>
<value>false</value>
</property>
<property>
<name>tasktracker.http.threads</name>
<value>2</value>
</property>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>2</value>
</property>
</configuration>
Deployment
The actual deployment takes place on containers, spawned via nova-docker. In each deployment I am spawning 3 containers, C1, C2 and C3, with 1 container per physical machine. Let's assume that C1 is spawned on M1, C2 on M2, C3 on M3.
In particular:
One container, C1, acts as the "Master"; it runs the Namenode and the Jobtracker services.
The other two containers, C2 and C3, act as the "Slaves", they run the Datanode and the Tasktracker services.
I have run this experiment twice:
One concurrent deployment
Two concurrent deployments
"Two concurrent" deployments means that there are two identical deployments, running concurrently. To further clarify, when two deployments are running, there are six containers present:
- C1a and C1b on M1
- C2a and C2b on M2
- C3a and C3b on M3
C1a, C2a and C3a belong to the same map-reduce execution and communicate with each other, as expected. Same goes for the containers C1b, C2b and C3b, respectively.
Execution time
Both cases (1 concurrent deployment, 2 concurrent deployments), were run 10 times, to get a good sample. Here is the execution time with 1 and 2 concurrent deployments; as it is evident, with 2 concurrent deployments the execution time rises by 6.72%.
Issue
My question is: why is the execution time longer when running two concurrent deployments, even though I have configured hadoop to use as few threads as possible? In particular:
Could I be PCIe-bottlenecked or CPU-bottlenecked? (see below)
Have I missed something else in configuring hadoop to use as few threads as possible?
Is hadoop using more threads than the ones I am aware of, that could be congesting the CPU or another resource?
I have already investigated the following:
Bandwidth consumption: we are definitively not network-bottlenecked. The network can sustain up to 10 Gbps, the application is not consuming more than 400-500 Mbps, on average, and there is nobody else using the cluster.
PCIe: I have already measured the PCIe bandwidth to investigate whether I am bottlenecked there. I have opened a related question on Superuser to ask whether my readings indicate a congested PCI or not.
CPU utilization: please see the next section.
CPU metrics
I installed the PCM tools to measure the CPU utilization during the executions. These tools were installed on one of the physical machines that hosts the slave containers (Datanode, Tasktracker).
I measured the utilization for cores in active state for the following cases:
Idle (labeled "0 tenants")
1 concurrent deployment (labeled "1 tenant")
2 concurrent deployments (labeled "2 tenants")
As it is evident, the CPU utilization for 1 or 2 concurrent deployments is similar, albeit for 1 deployment is slightly higher on average. Therefore, CPU utilization does not seem to be an issue; what could I be missing?
Please let me know in the comments whether I could provide any additional information.

To answer my own question, the eventual bottleneck is I/O bandwidth when writing to disk. With the help of iotop I measured the writing speed:
And with dd I measured the maximum writing speed:
# dd if=/dev/zero of=diskbench bs=1G count=1 conv=fdatasync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 7.38756 s, 145 MB/s
The writing speed seems to be constant around 10 MB/sec, reaching quite often up to 120-160 MB/sec. A natural question would be "why do we have continuous writing to the disk?" It's how hadoop works: the mappers are writing their intermediate output to the local disk, but not the HDFS, as it has been discussed here.
Therefore, since the mappers write continuously to the local hard disk, a bottleneck is expected there when multiple hadoop executions are run, even if we have CPU processing power to spare.

Related

Control number of mappers on each node in cluster

I have a very small 2 node Hadoop-HBase cluster. I am executing MapReduce jobs on it. I use Hadoop-2.5.2. I have 32GB(nodes have 64GB memory each) free for MapReduce in each node with the configuration in yarn site as follows
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>32768</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>15</value>
</property>
My resource requirements are 2GB for each mapper/reducer that gets executed. I have configured this in the mapred-site.xml Given these configurations, with a total of about 64GB in memory and 30 vcores, I see about 31 mappers or 31 reducers getting executed in parallel.
While all this is fine, there is one part that I am trying to figure out. The number of mappers or reducers executing in parallel, is not the same on both nodes, one of the nodes has higher number of tasks than the other. Why does this happen? Can this be controlled? If so, how?
I suppose YARN does not see this as resources of a node rather resources of a cluster and spawns the tasks wherever it can in the cluster. Is this understanding correct? If not, what is the correct explanation to the said behaviour during a MR execution?

MapReduce Application Master using complete node, no other task running on that node

I have deployed 6 node (1 Master and 5 Slaves) hadoop cluster on amazon ec2 by using T2.Medium instances( each instance have 4 Gb ram and two virtual cores). In the yarn-site.xml,I have added these two properties:
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3072</value>
</property>
This allows nodemanager to use maximum of 3 GB ram on each node. Leaving 1 Gb ram for the system.
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1536</value>
</property>
And this gives minimum ram to be used by each container. I have used these properties so that maximum of two containers will run parallely on each node.
But When I run the program, the slave node on which Mapreduce Application Master is running (MRAppMaster), there is no other task (container running). Although ram used by the Application Master is only about 10% of total (i.e 400-500 Mb).
Whereas on other slave nodes, each node have two containers running. This degrades the performance. Because the program is compute intensive, and all the processing power of node where Application Master is running, is unused( as Application Master only use about 0-1 % of processing power).
Ideally, there should also be one container running on the node where
MRAppMaster is running.
So can anyone please help me regarding this.

Oozie job is stuck in the Running state

I have a simple job workflow which executes a mapreduce job as a shell action. After submitting the job, its status becomes Running and it stays there but never ends. The mapreduce cluster shows that there are two jobs running, one belongs to the shell application launcher and one for the actual mapreduce job. However the one for the mapreduce job is shown as UNASSIGNED and the progress is zero (which means it has been started yet).
Interestingly when I kill the oozie job, the mapreduce job actually starts running and completes successfully. It looks like the shell launcher is blocking it.
p.s. It is a simple workflow and there is no start or end date that may cause it wait.

Please consider the below case as per you memory resource
Number of container are dependent on the number of blocksize. if you have 2 GB data of 512 mb block size, Yarn creates 4 maps and 1 reduce. While running the mapreduce we should follow some rules to submit the mapreduce job.(this should be applicable for small cluster)
You should configure the below property asper you RAM DISK and CORES.
<property>
<description>The minimum allocation for every container request at the RM,
in MBs. Memory requests lower than this won't take effect,
and the specified value will get allocated at minimum.</description>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
</property>
<property>
<description>The maximum allocation for every container request at the RM,
in MBs. Memory requests higher than this won't take effect,
and will get capped to this value.</description>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
And set the Java heap size as per the Memory Resources. Once ensured with the above property in yarn-site.xml according the mapreduce will succeed efficiently.

When a job stuck in "UNASSIGNED" state, it usually means resource manager(RM) can't allocate container to the job.
Check the capacity configure for the user and queue. Giving them more capacity should help.
With Hadoop 2.7 and capacity scheduler, specifically, the following properties need to be examined:
yarn.scheduler.capacity.<queue-path>.capacity
yarn.scheduler.capacity.<queue-path>.user-limit-factor
yarn.scheduler.capacity.maximum-applications
/ yarn.scheduler.capacity.<queue-path>.maximum-applications
yarn.scheduler.capacity.maximum-am-resource-percent
/ yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent
see more details on those properties at
Hadoop: Capacity Scheduler - Queue Properties

Controling and monitorying number of simultaneous map/reduce tasks in YARN

I have an Hadoop 2.2 cluster deployed on a small number of powerful machines. I have a constraint to use YARN as the framework, which I am not very familiar with.
How do I control the number of actual map and reduce tasks that will run in parallel? Each machine has many CPU cores (12-32) and enough RAM. I want to utilize them maximally.
How can I monitor that my settings actually led to a better utilization of the machine? Where can I check how many cores (threads, processes) were used during a given job?
Thanks in advance for helping me melt these machines :)

1.
In MR1, the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties dictated how many map and reduce slots each TaskTracker had.
These properties no longer exist in YARN. Instead, YARN uses yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, which control the amount of memory and CPU on each node, both available to both maps and reduces
Essentially:
YARN has no TaskTrackers, but just generic NodeManagers. Hence, there's no more Map slots and Reduce slots separation. Everything depends on the amount of memory in use/demanded
2.
Using the web UI you can get lot of monitoring/admin kind of info:
NameNode - http://:50070/
Resource Manager - http://:8088/
In addition Apache Ambari is meant for this:
http://ambari.apache.org/
And Hue for interfacing with the Hadoop/YARN cluster in many ways:
http://gethue.com/

There is a good guide on YARN configuration from Hortonworks
You may analyze your job in Job History server. It usually may be found on port 19888. Ambari and Ganglia are also very good for cluster utilization measurement.

I've the same problem,
in order to increase the number of mappers, it's recommended to reduce the size of the input split (each input split is processed by a mapper and so a container). I don't know how to do it,
indeed, hadoop 2.2 /yarn does not take into account none of the following settings
<property>
<name>mapreduce.input.fileinputformat.split.minsize</name>
<value>1</value>
</property>
<property>
<name>mapreduce.input.fileinputformat.split.maxsize</name>
<value>16777216</value>
</property>
<property>
<name>mapred.min.split.size</name>
<value>1</value>
</property>
<property>
<name>mapred.max.split.size</name>
<value>16777216</value>
</property>
best

Container is running beyond memory limits

In Hadoop v1, I have assigned each 7 mapper and reducer slot with size of 1GB, my mappers & reducers runs fine. My machine has 8G memory, 8 processor.
Now with YARN, when run the same application on the same machine, I got container error.
By default, I have this settings:
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
It gave me error:
Container [pid=28920,containerID=container_1389136889967_0001_01_000121] is running beyond virtual memory limits. Current usage: 1.2 GB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.
I then tried to set memory limit in mapred-site.xml:
<property>
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>
But still getting error:
Container [pid=26783,containerID=container_1389136889967_0009_01_000002] is running beyond physical memory limits. Current usage: 4.2 GB of 4 GB physical memory used; 5.2 GB of 8.4 GB virtual memory used. Killing container.
I'm confused why the the map task need this much memory. In my understanding, 1GB of memory is enough for my map/reduce task. Why as I assign more memory to container, the task use more? Is it because each task gets more splits? I feel it's more efficient to decrease the size of container a little bit and create more containers, so that more tasks are running in parallel. The problem is how can I make sure each container won't be assigned more splits than it can handle?

You should also properly configure the maximum memory allocations for MapReduce. From this HortonWorks tutorial:
[...]
Each machine in our cluster has 48 GB of RAM. Some of this RAM should be >reserved for Operating System usage. On each node, we’ll assign 40 GB RAM for >YARN to use and keep 8 GB for the Operating System
For our example cluster, we have the minimum RAM for a Container
(yarn.scheduler.minimum-allocation-mb) = 2 GB. We’ll thus assign 4 GB
for Map task Containers, and 8 GB for Reduce tasks Containers.
In mapred-site.xml:
mapreduce.map.memory.mb: 4096
mapreduce.reduce.memory.mb: 8192
Each Container will run JVMs for the Map and Reduce tasks. The JVM
heap size should be set to lower than the Map and Reduce memory
defined above, so that they are within the bounds of the Container
memory allocated by YARN.
In mapred-site.xml:
mapreduce.map.java.opts: -Xmx3072m
mapreduce.reduce.java.opts: -Xmx6144m
The above settings configure the upper limit of the physical RAM that
Map and Reduce tasks will use.
To sum it up:
In YARN, you should use the mapreduce configs, not the mapred ones. EDIT: This comment is not applicable anymore now that you've edited your question.
What you are configuring is actually how much you want to request, not what is the max to allocate.
The max limits are configured with the java.opts settings listed above.
Finally, you may want to check this other SO question that describes a similar problem (and solution).

There is a check placed at Yarn level for Virtual and Physical memory usage ratio.
Issue is not only that VM doesn't have sufficient physical memory. But it is because Virtual memory usage is more than expected for given physical memory.
Note : This is happening on Centos/RHEL 6 due to its aggressive allocation of virtual memory.
It can be resolved either by :
Disable virtual memory usage check by setting
yarn.nodemanager.vmem-check-enabled to false;
Increase VM:PM ratio by setting yarn.nodemanager.vmem-pmem-ratio to some higher value.
References :
https://issues.apache.org/jira/browse/HADOOP-11364
http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/
Add following property in yarn-site.xml
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>

I had a really similar issue using HIVE in EMR. None of the extant solutions worked for me -- ie, none of the mapreduce configurations worked for me; and neither did setting yarn.nodemanager.vmem-check-enabled to false.
However, what ended up working was setting tez.am.resource.memory.mb, for example:
hive -hiveconf tez.am.resource.memory.mb=4096
Another setting to consider tweaking is yarn.app.mapreduce.am.resource.mb

I can't comment on the accepted answer, due to low reputation. However, I would like to add, this behavior is by design. The NodeManager is killing your container. It sounds like you are trying to use hadoop streaming which is running as a child process of the map-reduce task. The NodeManager monitors the entire process tree of the task and if it eats up more memory than the maximum set in mapreduce.map.memory.mb or mapreduce.reduce.memory.mb respectively, we would expect the Nodemanager to kill the task, otherwise your task is stealing memory belonging to other containers, which you don't want.

While working with spark in EMR I was having the same problem and setting maximizeResourceAllocation=true did the trick; hope it helps someone. You have to set it when you create the cluster. From the EMR docs:
aws emr create-cluster --release-label emr-5.4.0 --applications Name=Spark \
--instance-type m3.xlarge --instance-count 2 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json
Where myConfig.json should say:
[
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]

We also faced this issue recently. If the issue is related to mapper memory, couple of things I would like to suggest that needs to be checked are.
Check if combiner is enabled or not? If yes, then it means that reduce logic has to be run on all the records (output of mapper). This happens in memory. Based on your application you need to check if enabling combiner helps or not. Trade off is between the network transfer bytes and time taken/memory/CPU for the reduce logic on 'X' number of records.
If you feel that combiner is not much of value, just disable it.
If you need combiner and 'X' is a huge number (say millions of records) then considering changing your split logic (For default input formats use less block size, normally 1 block size = 1 split) to map less number of records to a single mapper.
Number of records getting processed in a single mapper. Remember that all these records need to be sorted in memory (output of mapper is sorted). Consider setting mapreduce.task.io.sort.mb (default is 200MB) to a higher value if needed. mapred-configs.xml
If any of the above didn't help, try to run the mapper logic as a standalone application and profile the application using a Profiler (like JProfiler) and see where the memory getting used. This can give you very good insights.

Running yarn on Windows Linux subsystem with Ubunto OS, error "running beyond virtual memory limits, Killing container"
I resolved it by disabling virtual memory check in the file yarn-site.xml
<property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property>

I am practicing Hadoop programs (version hadoop3). Via virtual box I have installed Linux OS. We allocate very limited memory at time of installation of Linux.
By setting the following memory limit properties in mapred-site.xml and restarting your HDFS and YARN then my program worked.
<property>
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
</property>

I haven't personally checked, but hadoop-yarn-container-virtual-memory-understanding-and-solving-container-is-running-beyond-virtual-memory-limits-errors sounds very reasonable
I solved the issue by changing yarn.nodemanager.vmem-pmem-ratio to a higher value , and I would agree that:
Another less recommended solution is to disable the virtual memory check by setting yarn.nodemanager.vmem-check-enabled to false.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Concurrent "single-threaded" hadoop executions: where is the bottleneck? - hadoop

Related

Control number of mappers on each node in cluster

MapReduce Application Master using complete node, no other task running on that node

Oozie job is stuck in the Running state

Controling and monitorying number of simultaneous map/reduce tasks in YARN

Container is running beyond memory limits

Categories

Resources