Is the mapred process in Hadoop multi-threaded? - hadoop

I've configured our hadoop cluster with mapred_map_tasks_max to 6 and as expected, I see 6 mapred processes running when kicking of PIG jobs.
I am however a bit surprised to see the CPU usage on some of these individual processes to exceed 100% sometimes reaching 1000%+. Does mapreduce default to multiple threads? Could this be something with Pig itself?
All I could find online was some information about a setting (mapred.map.runner.class), but this doesn't appear to be set to MultiThreaded in anyway.
Thanks.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2630 mapred 20 0 53.4g 2.8g 12m S 218.1 4.5 1:17.32 java
2553 mapred 20 0 53.4g 2.8g 12m S 110.7 4.5 1:25.07 java
2636 mapred 20 0 53.4g 2.8g 12m S 110.4 4.5 1:11.58 java
2437 mapred 20 0 53.5g 5.6g 12m S 108.1 8.8 3:46.52 java
2353 mapred 20 0 53.5g 5.2g 12m S 101.1 8.3 3:35.27 java
2239 mapred 20 0 53.5g 5.8g 12m S 82.6 9.3 3:54.47 java

It is possible with Hadoop to use a multi threaded mapper (see http://kickstarthadoop.blogspot.com/2012/02/enable-multiple-threads-in-mapper-aka.html). As far as I know, pig doesn't support multi threading jobs (although you can multi thread calling Pig Servers... https://issues.apache.org/jira/browse/PIG-240).
That said, Pig will by default run multiple mappers/reducers on the same host, one mapper/reducer per available core.

Related

Hadoop multinode cluster too slow. How do I increase speed of data processing?

I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.
After running an MR job, i checked my RAM Usage which is mentioned below:
Namenode
free -g
total used free shared buff/cache available
Mem: 31 7 15 0 8 22
Swap: 31 0 31
Datanode :
Slave1 :
free -g
total used free shared buff/cache available
Mem: 31 6 6 0 18 24
Swap: 31 3 28
Slave2:
total used free shared buff/cache available
Mem: 31 2 4 0 24 28
Swap: 31 1 30
Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED state and wait for the first job to finish and then they start.
Here is the output of ps command of the JAR that I submnitted to execute the MR job:
/opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
-Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop
-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
-classpath --classpath of jars
org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02
Is there any settings that I can change/add to allow multiple jobs to run simultaneously and speed up current data processing ? I am using hadoop 2.5.2. The cluster is in PROD environment and I can not take it down for updating hadoop version.
EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -
nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
Here is some more information :
18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363
18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?
I believe you can edit the mapred-default.xml
The Params you are looking for are
mapreduce.job.running.map.limit
mapreduce.job.running.reduce.limit
0 (Probably what it is set too at the moment) means UNLIMITED.
Looking at your Memory 32G/Machine seems too small.
What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.
Based on your yarn-site.xml your yarn.scheduler.minimum-allocation-mb setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.
Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.
To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:
For Tez: Divide RAM/CORES = Max TEZ Container size
So in my case: 128/32 = 4GB
TEZ:
YARN:

Flink 1.6 bucketing sink HDFS files stuck in .in-progress

I am writing Kafka data stream to bucketing sink in a HDFS path. Kafka gives out string data. Using FlinkKafkaConsumer010 to consume from Kafka
-rw-r--r-- 3 ubuntu supergroup 4097694 2018-10-19 19:16 /streaming/2018-10-19--19/_part-0-1.in-progress
-rw-r--r-- 3 ubuntu supergroup 3890083 2018-10-19 19:16 /streaming/2018-10-19--19/_part-1-1.in-progress
-rw-r--r-- 3 ubuntu supergroup 3910767 2018-10-19 19:16 /streaming/2018-10-19--19/_part-2-1.in-progress
-rw-r--r-- 3 ubuntu supergroup 4053052 2018-10-19 19:16 /streaming/2018-10-19--19/_part-3-1.in-progress
This happens only when I use some mapping function to manipulate the stream data on the fly. If I directly write the stream to HDFS its working fine. Any idea why this might be happening? I am using Flink 1.6.1, Hadoop 3.1.1 and Oracle JDK1.8
Little bit late for this question, but I also experience similar issue.
I have a case class Address
case class Address(val i: Int)
and I read the source from collection with number of Address, for example
env.fromCollection(Seq(new Address(...), ...))
...
val customAvroFileSink = StreamingFileSink
.forBulkFormat(
new Path("/tmp/data/"),
ParquetAvroWriters.forReflectRecord(classOf[Address]))
.build()
...
xxx.addSink(customAvroFileSink)
with checkpoint enabled, my parquet file will also end up with in-progress
I find that the Flink finish the process before checkpoint triggered, so my result never full flushed to the disk. After I changed the checkpoint interval to a smaller number, the parquet is no longer in-progress.
This scenario generally happens when checkpointing is disabled.
Could you check checkpointing setting while running a job with the mapping function? Looks like you have enabled checkpointing for a job writing directly to HDFS.
I had a similar issue and enabling checkpointing and changing the state backend from the default MemoryStateBackend to FsStateBackend worked. In my case, checkpointing failed because MemoryStateBackend had a maxStateSize that was too small such that the state of one of the operations could not fit in memory.
StateBackend stateBackend = new FsStateBackend("file:///home/ubuntu/flink_state_backend");
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
.enableCheckpointing(Duration.ofSeconds(60).toMillis())
.setStateBackend(stateBackend);

Hadoop 2: why are there two linux processes for each map or reduce task?

We are trying to migrate our jobs to Hadoop 2 (Hadoop 2.8.1, single node cluster, to be precise) from Hadoop 1.0.3. We are using YARN to manage our map-reduce jobs. One of the differences that we have noticed is the presence of two Linux processes for each map or reduce task that is planned for execution. For example, for any of our reduce tasks, we find these two executing processes:
hadoop 124692 124690 0 12:33 ? 00:00:00 /bin/bash -c /opt/java/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx5800M -XX:-UsePerfData -Djava.io.tmpdir=/tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/application_1510651062679_0001/container_1510651062679_0001_01_000278/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/opt/hadoop/hadoop-2.8.1/logs/userlogs/application_1510651062679_0001/container_1510651062679_0001_01_000278 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog -Dyarn.app.mapreduce.shuffle.logger=INFO,shuffleCLA -Dyarn.app.mapreduce.shuffle.logfile=syslog.shuffle -Dyarn.app.mapreduce.shuffle.log.filesize=0 -Dyarn.app.mapreduce.shuffle.log.backups=0 org.apache.hadoop.mapred.YarnChild 192.168.101.29 33929 attempt_1510651062679_0001_r_000135_0 278 1>/opt/hadoop/hadoop-2.8.1/logs/userlogs/application_1510651062679_0001/container_1510651062679_0001_01_000278/stdout 2>/opt/hadoop/hadoop-2.8.1/logs/userlogs/application_1510651062679_0001/container_1510651062679_0001_01_000278/stderr
hadoop 124696 124692 74 12:33 ? 00:10:30 /opt/java/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx5800M -XX:-UsePerfData -Djava.io.tmpdir=/tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/application_1510651062679_0001/container_1510651062679_0001_01_000278/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/opt/hadoop/hadoop-2.8.1/logs/userlogs/application_1510651062679_0001/container_1510651062679_0001_01_000278 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog -Dyarn.app.mapreduce.shuffle.logger=INFO,shuffleCLA -Dyarn.app.mapreduce.shuffle.logfile=syslog.shuffle -Dyarn.app.mapreduce.shuffle.log.filesize=0 -Dyarn.app.mapreduce.shuffle.log.backups=0
The second process is a child of the first one. All in all, we see that the overall number of processes during our job execution is much higher than it was with Hadoop 1.0.3, where only one process was executing for each map or reduce task.
a) Could this be a reason for the job executing quite slower than it does with Hadoop 1.0.3 ?
b) Are those two processes the intended way it all works ?
Thank you in advance for your advice.
On a close check you will find
Pid 124692 is /bin/bash
Pid 124696 is /opt/java/bin/java
/bin/bash is the container process which spawns Java process within the enclosed environment (CPU , RAM restricted to a container)
You can think of this as A Virtual Machine inside which you can run your own process. Both virtual machine and the process running inside it will have a parent child relationship.
Please read about Linux container in details to know more about it .

Spark Program running very slow on cluster

I am trying to run my PySpark in Cluster with 2 nodes and 1 master (all have 16 Gb RAM). I have run my spark with below command.
spark-submit --master yarn --deploy-mode cluster --name "Pyspark"
--num-executors 40 --executor-memory 2g CD.py
However my code runs very slow, it takes almost 1 hour to parse 8.2 GB of data.
Then i tried to change the configuration in my YARN. I changed following properties.
yarn.scheduler.increment-allocation-mb = 2 GiB
yarn.scheduler.minimum-allocation-mb = 2 GiB
yarn.scheduler.increment-allocation-mb = 2 GiB
yarn.scheduler.maximum-allocation-mb = 2 GiB
After doing these changes still my spark is running very slow and taking more than 1 hour to parse 8.2 GB of files.
could you please try with the below configuration
spark.executor.memory 5g
spark.executor.cores 5
spark.executor.instances 3
spark.driver.cores 2

How to cleaning hadoop mapreduce memory usage?

I want to ask. I can say for example I have 10 MB memory on each node after I activate start-all.sh process. So, I run the namenode, datanode, secondary namenode, dll. But after I've done the hadoop mapreduce job, why the memory for example decrease to 5 MB for example. Whereas, the hadoop mapreduce job has done.
How can it back to the 10 MB free memory? Thanks all....
Maybe you can try the linux clear memory command :
echo 3 > /proc/sys/vm/drop_caches

Resources