Hadoop 2: why are there two linux processes for each map or reduce task? - hadoop

We are trying to migrate our jobs to Hadoop 2 (Hadoop 2.8.1, single node cluster, to be precise) from Hadoop 1.0.3. We are using YARN to manage our map-reduce jobs. One of the differences that we have noticed is the presence of two Linux processes for each map or reduce task that is planned for execution. For example, for any of our reduce tasks, we find these two executing processes:
hadoop 124692 124690 0 12:33 ? 00:00:00 /bin/bash -c /opt/java/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx5800M -XX:-UsePerfData -Djava.io.tmpdir=/tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/application_1510651062679_0001/container_1510651062679_0001_01_000278/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/opt/hadoop/hadoop-2.8.1/logs/userlogs/application_1510651062679_0001/container_1510651062679_0001_01_000278 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog -Dyarn.app.mapreduce.shuffle.logger=INFO,shuffleCLA -Dyarn.app.mapreduce.shuffle.logfile=syslog.shuffle -Dyarn.app.mapreduce.shuffle.log.filesize=0 -Dyarn.app.mapreduce.shuffle.log.backups=0 org.apache.hadoop.mapred.YarnChild 192.168.101.29 33929 attempt_1510651062679_0001_r_000135_0 278 1>/opt/hadoop/hadoop-2.8.1/logs/userlogs/application_1510651062679_0001/container_1510651062679_0001_01_000278/stdout 2>/opt/hadoop/hadoop-2.8.1/logs/userlogs/application_1510651062679_0001/container_1510651062679_0001_01_000278/stderr
hadoop 124696 124692 74 12:33 ? 00:10:30 /opt/java/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx5800M -XX:-UsePerfData -Djava.io.tmpdir=/tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/application_1510651062679_0001/container_1510651062679_0001_01_000278/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/opt/hadoop/hadoop-2.8.1/logs/userlogs/application_1510651062679_0001/container_1510651062679_0001_01_000278 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog -Dyarn.app.mapreduce.shuffle.logger=INFO,shuffleCLA -Dyarn.app.mapreduce.shuffle.logfile=syslog.shuffle -Dyarn.app.mapreduce.shuffle.log.filesize=0 -Dyarn.app.mapreduce.shuffle.log.backups=0
The second process is a child of the first one. All in all, we see that the overall number of processes during our job execution is much higher than it was with Hadoop 1.0.3, where only one process was executing for each map or reduce task.
a) Could this be a reason for the job executing quite slower than it does with Hadoop 1.0.3 ?
b) Are those two processes the intended way it all works ?
Thank you in advance for your advice.

On a close check you will find
Pid 124692 is /bin/bash
Pid 124696 is /opt/java/bin/java
/bin/bash is the container process which spawns Java process within the enclosed environment (CPU , RAM restricted to a container)
You can think of this as A Virtual Machine inside which you can run your own process. Both virtual machine and the process running inside it will have a parent child relationship.
Please read about Linux container in details to know more about it .

Related

Cloudera quickstart CDH 5.15 cluster is RUNNING slow

I have Cloudera quickstart CDH 5.15 cluster is very slow
when i run a simple hadoop command like "hadoop fs -ls" it takes almost 20 seconds
but when i try runnnig local commands like "ls" it is very fast please help me with this.
The quickstart VM requires 6-8 GB of RAM to work reliably.
But the JVM startup process for any hadoop command is going to be much much slower compared to other built-in shell commands that operate similarly. There's no way around that fact.
If you want the Hadoop ls command to be quicker, it would be beneficial to setup an actual distributed cluster with adequate memory for the Namenode process, which is what ls contacts

Spark on Hadoop YARN - executor missing

I have a cluster of 3 macOS machines running Hadoop and Spark-1.5.2 (though with Spark-2.0.0 the same problem exists). With 'yarn' as the Spark master URL, I am running into a strange issue where tasks are only allocated to 2 of the 3 machines.
Based on the Hadoop dashboard (port 8088 on the master) it is clear that all 3 nodes are part of the cluster. However, any Spark job I run only uses 2 executors.
For example here is the "Executors" tab on a lengthy run of the JavaWordCount example:
"batservers" is the master. There should be an additional slave, "batservers2", but it's just not there.
Why might this be?
Note that none of my YARN or Spark (or, for that matter, HDFS) configurations are unusual, except provisions for giving the YARN resource- and node-managers extra memory.
Remarkably, all it took was a detailed look at the spark-submit help message to discover the answer:
YARN-only:
...
--num-executors NUM Number of executors to launch (Default: 2).
If I specify --num-executors 3 in my spark-submit command, the 3rd node is used.

Each Data Node in a Hadoop cluster is constantly reading disk

Some time ago I found out that each of our data nodes is constantly reading disks at ~10M/s accumulated speed. I found it out with iotop util.
What I've done so far to diagnose it:
I tried to stop different services on a cluster, but it only stops when I stop hdfs service completely
I checked the logs of a data node, but can only see some HDFS_WRITEs operation happening every 1-2 minutes, nothing about reading the data. I checked during idle time of course
Some info on our system:
we're using a CDH distro, 5.8 now, but the problem started several versions ago
no running jobs in YARN at that moment
the issue is here for several months 24/7 and it wasn't there before
My prime suspect for now is some auditing process at CDH. Unfortunately I couldn't find any good documentation on administration of these processes.
Here is an information on a data node process from ps -ef output:
hdfs 58093 6398 10 Oct11 ? 02:56:30 /usr/lib/jvm/java-8-oracle/bin/java -Dproc_datanode -Xmx1000m -Dhdfs.audit.logger=INFO,RFAAUDIT -Dsecurity.audit.logger=INFO,RFAS -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop.log.file=hadoop-cmf-hdfs-DATANODE-hadoop-worker-03.srv.mycompany.org.log.out -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hadoop -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -Xms1073741824 -Xmx1073741824 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:OnOutOfMemoryError=/usr/lib/cmf/service/common/killparent.sh -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode
I'll be really grateful for any clues on an issue.

Pig job gets killed on Amazon EMR.

I have been trying to run a pig job with multiple steps on Amazon EMR. Here are the details of my environment:
Number of nodes: 20
AMI Version: 3.1.0
Hadoop Distribution: 2.4.0
The pig script has multiple steps and it spawns a long-running map reduce job that has both a map phase and reduce phase. After running for sometime (sometimes an hour, sometimes three or four), the job is killed. The information on the resource manager for the job is:
Kill job received from hadoop (auth:SIMPLE) at
Job received Kill while in RUNNING state.
Obviously, I did not kill it :)
My question is: how do I go about trying to identify what exactly happened? How do I diagnose the issue? Which log files to look at (what to grep for)? Any help on even where the appropriate log files would be greatly helpful. I am new to YARN/Hadoop 2.0
There can be number of reasons. Enable debugging on your cluster and see in the stderr logs for more information.
aws emr create-cluster --name "Test cluster" --ami-version 3.9 --log-uri s3://mybucket/logs/ \
--enable-debugging --applications Name=Hue Name=Hive Name=Pig
More details here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html

CDH4.4: Restarting HDFS and MapReduce from shell

I'm trying to automate stopping, formatting and starting HDFS and MapReduce services on a Cloudera Hadoop 4.4 cluster, using a bash script.
It's easy to kill HDFS and MapReduce processes using "pkill -U hdfs && pkill -U mapred", but how can I start those processes again, without using the Cloudera Manager GUI?
Well, apparently CM has a pretty sweet API
Check it out here
http://cloudera.github.io/cm_api/

Resources