How to use custom pool assignment for FairScheduler in Hadoop? - hadoop

I am trying to take advantage of multiple pools in FairScheduler. But all my jobs are submitted by a single agent process and therefore all belong to same user.
I have set mapred.fairscheduler.poolnameproperty to scheduler.pool.name and then in each job I set "scheduler.pool.name" to a specific pool from pools.xml that I want to use for that job.
I can see in job configuration web page that both properties have values as expected and scheduler webpage shows all pools I am trying to use. However all jobs are still running in the pool %username% where username is name of the user that was used to submit all jobs.
I am running hadoop version 0.20.1 from Cloudera distribution.
Any ideas how to make my jobs run in a pool that is not dependent on the name of the user, who submitted the job?

Looks like restart of jobtracker was not sufficient to effect new configuration. After restart of all tasktrackers and a jobtracker pool assignment works as expected.

Related

Restrict Spark job in local mode

Is there any way to restrict access to execute spark-submit with spark deploy mode as local mode. If I permit users to execute jobs in local mode my yarn cluster will become under utilized.
I have configured to use yarn as cluster manager to schedule spark jobs.
I have checked spark configs where I did not find any parameters to allow only a specific deploy mode. User can override the default deploy mode while submitting spark jobs to the cluster.
You can incentivize and facilitate using YARN by setting the spark.master key to yarn in your conf/spark-defaults.conf file. If your configuration is ready to point to the proper master, by default users will deploy their jobs on YARN.
I'm not aware of any way to completely bar your users from using a master, especially if it's under their control (as it's the case for local). What you can do, if you control the Spark installation, is modifying the existing spark-shell/spark-submit launch script to detect if a user is trying to explicitly use local as a master and preventing this to happen. Alternatively you could also have your own script that checks and prevents any local session to be opened and then runs spark-shell/spark-submit normally.

Yarn capacity-scheduler Parallelize

Does capacity-scheduler in yarn run app in parallel on the same queue for the same user.
For example:If we have 2 hive CLI on 2 terminals with same user, and the same query is started on both, do they execute on the default queue in parallel or sequentially.
Currently, the UI shows 1 running, and 1 in pending state:
Is there a way to run it in parallel?
Yarn capacity scheduler run jobs in FIFO manner for the jobs submitted in the same queue. For example if both the hive cli's got submitted for default queue then which ever able to secure resources first will get into running state and other will wait(only if enough resources are not present in the queue).
If you want parallel execution
1) you can run other job in different queue.You can define the queue name while launching job on yarn.
2) You need to define resources in a manner so that both job can get resources as desired.

Difference between job, application, task, task attempt logs in Hadoop, Oozie

I'm running an Oozie job with multiple actions and there's a part I could not make it work. In the process of troubleshooting I'm overwhelmed with lots of logs.
In YARN UI (yarn.resourceman­ager.webapp.address in yarn-site.xml, normally on port 8088), there's the application_<app_id> logs.
In Job History Server (yarn.log.server.url in yarn-site.xml, ours on port 19888), there's the job_<job_id> logs. (These job logs should also show up on Hue's Job Browser, right?)
In Hue's Oozie workflow editor, there's the task and task_attempt (not sure if they're the same, everything's a mixed-up soup to me already), which redirects to the Job Browser if you clicked here and there.
Can someone explain what's the difference between these things from Hadoop/Oozie architectural standpoint?
P.S.
I've seen in logs container_<container_id> as well. Might as well include this in your explanation in relation to the things above.
In terms of YARN, the programs that are being run on a cluster are called applications. In terms of MapReduce they are called jobs. So, if you are running MapReduce on YARN, job and application are the same thing (if you take a close look, job ids and application ids are the same).
MapReduce job consists of several tasks (they could be either map or reduce tasks). If a task fails, it is launched again on another node. Those are task attempts.
Container is a YARN term. This is a unit of resource allocation. For example, MapReduce task would be run in a single container.

I am not sure whether the application is running on just the master or the whole cluster for Spark on EC2

I am using Spark 1.1.1 . I followed the instructions given on https://spark.apache.org/docs/1.1.1/ec2-scripts.html and have a cluster of 1 master node and 1 worker on EC2 running.
I have made a jar of the application and rsynced it to the slaves. When I run the application using spark-submit with the deploy-mode of client, the application works. However, when I do so using deploy-mode cluster it gives me an error saying it cannot find the jar on the worker. The permission of the jar is 755 on both the master and worker.
I am not sure whether when I run the application using deploy-mode=client whether the application is using the workers. I don't think it is since the worker url does not show any completed jobs. But it does show failed jobs during deploy-mode=cluster.
Am I doing something wrong? Thank you for your help.
You can check if executors are assigned to the application on the /executors page on port 4040 (e.g. http://localhost:4040/executors/). If you only see <driver> then you are not using the worker. If you see one line for <driver> and one other line (with ID 0, unless it has restarted), then the worker is also providing an executor to your application. Here you can also see how many tasks it has completed for your application, and other stats.

How I can find out IPs of slave nodes where currently map reduce task is running or about to run for a given Job?

I want to find out IPs of slave nodes where currently map reduce job is running or about to run for a given Job.
Is there any method to do this ?
Thanks in Advance.
For any job, you can view the list of running tasks through the Job Scheduler Web UI - this will detail the nodes on which the task is running.
As for where tasks are about to run - this is not neccessarily decided in advance. As slots become available on a node, the Job Scheduler (there are a number which behave differently depending on your needs) identifies a job task which will run on that node (based upon a number of criteria, hopefully honoring data locality where it can) and instructs the task tracker on that node to run the specific task.
Programatically, look at the javadocs for the JobClient class, it should be able to acquire information about the running tasks, and their node names (you'll probably need to do a DNS lookup to get the actual IPs i imagine)
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
http://localhost:50030/ – web UI for MapReduce job tracker(s)
http://localhost:50060/ – web UI for task tracker(s)
http://localhost:50070/ – web UI for HDFS name node(s)
Thanks to #Chris..
Programatically, look at the javadocs for the JobClient class, it should be able to acquire information about the running tasks, and their node names.

Resources