Hadoop 2.6.0 how to configure job to have less running mappers - hadoop

I have some jobs that are long-time running but not that important.. some cleaning/indexing jobs... If I am running the job right now I am blocking rest of jobs. all new jobs stay in queue in pending mode..
What I would like is: reduce number of running mappers for a specific job. Can I do that?

Related

hadoop jobs in deadlock with pyspark and oozie

I am trying to run pyspark on yarn with oozie, after submitting the workflow, there are 2 jobs in the hadoop job queue, one is the oozie job , which is with the application type "map reduce", and another job triggered by the previous one, with application type "Spark", while the first job is running, the second job remains in 'accepted" status. here comes the problem, while the first job is waiting for the second job to finish to proceed, the second is waiting for the first one to finish to run, I may be stuck in a dead lock, how could I get rid of this trouble, is there anyway the hadoop job with application type "mapreduce" run parallel with other jobs of different application type?
Any advice is appreciated, thanks!
Please check the value for property into Yarn scheduler configuration. I guess you need to increase it to something like .9 or so.
Property: yarn.scheduler.capacity.maximum-am-resource-percent
You would need to start Yarn, MapReduce and Oozie after updating the property.
More info: Setting Application Limits.

Yarn capacity-scheduler Parallelize

Does capacity-scheduler in yarn run app in parallel on the same queue for the same user.
For example:If we have 2 hive CLI on 2 terminals with same user, and the same query is started on both, do they execute on the default queue in parallel or sequentially.
Currently, the UI shows 1 running, and 1 in pending state:
Is there a way to run it in parallel?
Yarn capacity scheduler run jobs in FIFO manner for the jobs submitted in the same queue. For example if both the hive cli's got submitted for default queue then which ever able to secure resources first will get into running state and other will wait(only if enough resources are not present in the queue).
If you want parallel execution
1) you can run other job in different queue.You can define the queue name while launching job on yarn.
2) You need to define resources in a manner so that both job can get resources as desired.

YARN Queue Can't Run More Than One Spark Job

I can run several jobs (MapReduce, Hive) in one queue. But if I run a Spark/Spark Streaming job, every job added after that will be in ACCEPTED state but not RUNNING. Only after I kill the Spark job the other job will be RUNNING.
I tried to create a different queue for Spark and non Spark jobs, they work as expected but this is not what I want.
My questions:
1. Is this YARN or Spark config issue?
2. What is the right config to solve that issue?
Any helps will be appreciated, thanks.

Hadoop on Mesos uses only one node?

I have successfully set up Mesos 0.22.1 cluster on 5 nodes. I can run Marathon and Chronos tasks on all slave nodes. Now I’m trying to run Hadoop jobs using Mesos Scheduler. I have followed very good tutorial and I could run wordcount test job. But when I try to run some larger job (loading data from Kafka to HDFS using Camus) job is running without the errors, but uses only one node with one task tracker, though it has in total 30 map jobs, and my nodes configured to run 2 map jobs in parallel.
What am I missing? Shouldn’t Jobtracker split task to run in parallel on all available nodes using 2 Map slots on eash node?
And what is strange - on Jobtracker webpage cluster summary reports only 1 available node. Is it correct behavior?
Any ideas are greatly appreciated!

Hadoop removes MapReduce history when it is restarted

I am carrying out several Hadoop tests using TestDFSIO and TeraSort benchmark tools. I am basically testing with different amount of datanodes in order to assess the linearity of the processing capacity and datanode scalability.
During the above mentioned process, I have obviously had to restart several times all Hadoop environment. Every time I restarted Hadoop, all MapReduce jobs are removed and the job counter starts again from "job_2013*_0001". For comparison reasons, it is very important for me to keep all the MapReduce jobs up that I have previously launched. So, my question is:
¿How can I avoid Hadoop removes all MapReduce-job history after it is restarted?
¿Is there some property to control job removing after Hadoop environment restarting?
Thanks!
the MR job history logs are not deleted right way after you restart hadoop, the new job will be counted from *_0001 and only new jobs which are started after hadoop restart will be displayed on resource manager web portal though. In fact, there are 2 log related settings from yarn default:
# this is where you can find the MR job history logs
yarn.nodemanager.log-dirs = ${yarn.log.dir}/userlogs
# this is how long the history logs will be retained
yarn.nodemanager.log.retain-seconds = 10800
and the default ${yarn.log.dir} is defined in $HADOOP_HONE/etc/hadoop/yarn-env.sh.
YARN_LOG_DIR="$HADOOP_YARN_HOME/logs"
BTW, similar settings could also be found in mapred-env.sh if you are use Hadoop 1.X

Resources