Yarn capacity-scheduler Parallelize - hadoop

Does capacity-scheduler in yarn run app in parallel on the same queue for the same user.
For example:If we have 2 hive CLI on 2 terminals with same user, and the same query is started on both, do they execute on the default queue in parallel or sequentially.
Currently, the UI shows 1 running, and 1 in pending state:
Is there a way to run it in parallel?

Yarn capacity scheduler run jobs in FIFO manner for the jobs submitted in the same queue. For example if both the hive cli's got submitted for default queue then which ever able to secure resources first will get into running state and other will wait(only if enough resources are not present in the queue).
If you want parallel execution
1) you can run other job in different queue.You can define the queue name while launching job on yarn.
2) You need to define resources in a manner so that both job can get resources as desired.

Related

Multiple instances of a partitioned spring batch job

I have a Spring batch partitioned job. The job is always started with a unique set of parameters so always a new job.
My remoting fabric is JMS with request/response queues configured for communication between the masters and slaves.
One instance of this partitioned job processes files in a given folder. Master step gets the file names from the folder and submits the file names to the slaves; each slave instance processes one of the files.
Job works fine.
Recently, I started to execute multiple instances (completely separate JVMs) of this job to process files from multiple folders. So I essentially have multiple master steps running but the same set of slaves.
Randomly; I notice the following behavior sometimes - the slaves will finish their work but the master keeps spinning thinking the slaves are still doing something. The step status will show successful in the job repo but at the job level the status is STARTING with an exit code of UNKNOWN.
All masters share the set of request/response queues; one queue for requests and one for responses.
Is this a supported configuration? Can you have multiple master steps sharing the same set of queues running concurrently? Because of the behavior above I'm thinking the responses back from the workers are going to the incorrect master.

Hadoop 2.6.0 how to configure job to have less running mappers

I have some jobs that are long-time running but not that important.. some cleaning/indexing jobs... If I am running the job right now I am blocking rest of jobs. all new jobs stay in queue in pending mode..
What I would like is: reduce number of running mappers for a specific job. Can I do that?

Number of application masters in a mapreduce job?? And mapreduce processing steps in YARN

I know that there is only Resource Manager in a hadoop cluster.
From my understanding, there should be only one Application Master for a cluster as well. Is that right? Following is my understanding of how a mapreduce job is run in YARN. Please correct if my understanding is not right.
Application execution sequence of steps on YARN:
Client submits a job to the Resource Manager (RM). RM runs on Master Node. There is only one RM across the cluster to manage the resources. Resource Manager is a Daemon process.
RM will go to HDFS thru Name Node.
RM spins up an Application Master (AM). AM will reach HDFS thru Name Node. It will create a mapper matrix. This is the mapper phase. Like if Block 1 is available on Name Node 5 or 6.
Based on Mapper matrix information, AM sends requests to individual Node managers (NM) to run a particular task for each block. NM runs on slave node.
Each NM sends a request to RM to get a container. A container executes an application specific process with a constrained set of resources (memory, CPU etc).
Mapper task runs in the container and sends the heart beat to the Application master. AM also sends the heart beat to RM.
After all the processes are done, AM starts another matrix for Reducer tasks.
After all the reducer tasks are completed, the AM sends the results to RM.
RM lets the client know the results and kills the AM.
Application Master can get stuck. That is why it is sending heart beats to Resource Manager
Thanks much
Nath
Other steps look fine.
RM spins up an Application Master (AM). AM will reach HDFS thru Name Node. It will create a mapper matrix. This is the mapper phase. Like if Block 1 is available on Name Node 5 or 6.
Slight correction here. The AM can only execute inside any given container. So first the RM requests a node manager on some node to start a container and then only the AM gets launched inside that cotainer, not before. So there will be a container dedicated to the AM.

What happens to orphaned Yarn Child processes?

Hadoop YARN launches instances of YarnChild in child VM to execute the actual tasks. Those tasks communicate with their ApplicationMaster (AM) through the umbilical interface.
My question is what happens if AM dies and Resource Manager(RM) fails to bring it up (say, due to some code defect in AM)? In such a case, the children tasks would (a) note the absence of AM due to heartbeat and then, (b) go to RM to get new AM location, which in this case they will not get. So, what happens to these orphaned tasks? I have a scenario where I would like to terminate them. Is that the default behavior and does their NodeManager (NM) terminate them?
From Hadoop -Definitive Guide, Chapter 6, Failures, Failures in yarn
After a crash, a new resource manager instance is brought up(by
admin), and it recovers from the saved state. The state consists of
node managers in system, as well as running applications. Here tasks
are not part of resource managers state, as they are managed by
application.
Also, it is said that the resource manager is designed to be able to recover from crashes.
All child task related to that particular application master would be on halt state. Hadoop admin should either restart the application master or kill it. NodeManager doesn't terminate the failed Application Master.
If you want to kill a application then you can use yarn application -kill application_id command to kill the application. It will kill all running and queued jobs under the application.
If you want to kill a task in YARN then you can use hadoop job -kill-task <task-id> to kill a particular task in YARN

How to use custom pool assignment for FairScheduler in Hadoop?

I am trying to take advantage of multiple pools in FairScheduler. But all my jobs are submitted by a single agent process and therefore all belong to same user.
I have set mapred.fairscheduler.poolnameproperty to scheduler.pool.name and then in each job I set "scheduler.pool.name" to a specific pool from pools.xml that I want to use for that job.
I can see in job configuration web page that both properties have values as expected and scheduler webpage shows all pools I am trying to use. However all jobs are still running in the pool %username% where username is name of the user that was used to submit all jobs.
I am running hadoop version 0.20.1 from Cloudera distribution.
Any ideas how to make my jobs run in a pool that is not dependent on the name of the user, who submitted the job?
Looks like restart of jobtracker was not sufficient to effect new configuration. After restart of all tasktrackers and a jobtracker pool assignment works as expected.

Resources