How to kill the youngest task in Hadoop Fair Scheduler - hadoop

I have very interesting use-case.
I’m running Apache Hadoop distribution latest version, with yarn.
The use-case is long computational jobs, that most work is compute intensive inside mapper part. I’m using fair scheduler for fair multi-user resource usage. Since task is long computation, I’m looking for a way to hint scheduler to kill the youngest task. Is that possible to configure Fair Scheduler to kill the youngest tasks?
If no such configuration, is that possible to inherit from fair scheduler and add such functionality?

Yes, you can totally modify the class to build your own.
Once you've built it and added it to the classpath you'd need to alter the configuration to use it similar to switching to capacity scheduler:
Configuration
Setting up ResourceManager to use CapacityScheduler
To configure the ResourceManager to use the CapacityScheduler, set the following property in the conf/yarn-site.xml:
Property Value
yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler

Related

Airflow setup for high availability

How to deploy apache airflow (formally known as airbnb's airflow) scheduler in high availability?
I am not asking about the backend DB or RabbitMQ that should obviously be deployed in high availability configuration.
My main focus is the scheduler - is there something special needs to be done?
After a bit digging I found that it is not safe to run multiple schedulers simoultanously, this means that out of the box - airflow schedulers are not safe to use in high availablity environments.
The airflow team are planning to solve this issue by adding a lock mechanism on the DAG data structure, but this is not implemented yet (I checked by running 2 schedulers and saw that they schedule the same dag instances which is not good).
This is described here:
https://groups.google.com/forum/#!topic/airbnb_airflow/-1wKa3OcwME
I did found a way to workaround this high availalbilty issue by wrapping the schedulers with my own code and use cluster tools for leader election (I personanlly use consul for this purpose). This way only the elected master is running the scheduler and when the master is down the slave replaces him.
Please consider this when u use airflow in high availabilty environments since out of the box, airflow scheduler is currently not suitable for this (unless you solve this issue yourself).
Edit - an alternative approach to the master slave solution is to use a cluster manager/scheduler to make sure that only one airflow scheduler instance is always available. This approach relies on the self healing abilities of the cluster manager u have. For example both mesos and nomad supports this kind of configuration (I presonally chose nomad for its simplicity).
My personal experience was to follow the instructions I found for some best practices; that is to restart the scheduler every 10 runs ( -N 10 ) and use this software when possible:
https://github.com/teamclairvoyant/airflow-scheduler-failover-controller
I also use a DAG which pings a monitoring system to be sure that the scheduler has not gone away.
In my scenario, I have 2 schedulers (on 2 separate docker swarms), with the standby cluster scheduler turned off (using docker swarm service scale=0). I needed to make sure the primary scheduler had stopped fully before I started up the standby scheduler. What I found was that having 2 running schedulers (even for a brief time period) resulted in an occasional DAG scheduled to run on both clusters leading to duplicate reports generated from two different cluster zone.

Schedule a trigger for a job that is excecuted on every node in a cluster

I'm wondering if there is a simple workaround/hack for quartz of triggering a job that is excecuted on every node in a cluster.
My situation:
My application is caching some things and is running in a cluster with no distributed-cache. Now I have situations where I want to refresh the caches on all nodes triggered by a job.
As you have found out, Quartz always picks up a random instance to execute a scheduled job and this cannot be easily changed unless you want to hack its internals.
Probably the easiest way to achieve what you describe would be to implement some sort of a coordinator (or master) job that will be aware of all Quartz instances in the cluster and will "manually" trigger execution of the cache-sync job on every single node. The master job can easily do it via the RMI, or JMX APIs exposed by Quartz.
You may want to check this somewhat similar question.

Does oozie provide any performance optimizations in terms of I/O?

Since oozie is a workflow engine for Hadoop platform, does it improve the performance of execution of a DAG dependencies of MapReduce jobs?
I mean, since the output of one MapReduce job is given as input to the next MapReduce job in the DAG, does oozie provides any mechanism for storing the intermediate results in memory and thus saving I/O.
Or is it just a workflow manager, that coordinates a series of dependent MapReduce?
Want to know how internally oozie works?
It is just a workflow manager. It doesn't change how, say, MapReduce works even though it runs M/R jobs.
What you are describing is much more like what Apache Spark does. I'm not aware that Oozie integrates directly with Spark yet, but, it can't possibly be difficult or far off.
It is "just a workflow manager, that coordinates a series of MapReduce" jobs. It uses the same mechanisms to execute jobs as using the command line.

How I can find out IPs of slave nodes where currently map reduce task is running or about to run for a given Job?

I want to find out IPs of slave nodes where currently map reduce job is running or about to run for a given Job.
Is there any method to do this ?
Thanks in Advance.
For any job, you can view the list of running tasks through the Job Scheduler Web UI - this will detail the nodes on which the task is running.
As for where tasks are about to run - this is not neccessarily decided in advance. As slots become available on a node, the Job Scheduler (there are a number which behave differently depending on your needs) identifies a job task which will run on that node (based upon a number of criteria, hopefully honoring data locality where it can) and instructs the task tracker on that node to run the specific task.
Programatically, look at the javadocs for the JobClient class, it should be able to acquire information about the running tasks, and their node names (you'll probably need to do a DNS lookup to get the actual IPs i imagine)
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
http://localhost:50030/ – web UI for MapReduce job tracker(s)
http://localhost:50060/ – web UI for task tracker(s)
http://localhost:50070/ – web UI for HDFS name node(s)
Thanks to #Chris..
Programatically, look at the javadocs for the JobClient class, it should be able to acquire information about the running tasks, and their node names.

Hadoop Job Scheduling query

I am a beginner to Hadoop.
As per my understanding, Hadoop framework runs the Jobs in FIFO order (default scheduling).
Is there any way to tell the framework to run the job at a particular time?
i.e Is there any way to configure to run the job daily at 3PM like that?
Any inputs on this greatly appreciated.
Thanks, R
What about calling the job from external java schedule framework, like Quartz? Then you can run the job as you want.
you might consider using Oozie (http://yahoo.github.com/oozie/). It allows (beside other things):
Frequency execution: Oozie workflow specification supports both data
and time triggers. Users can specify execution frequency and can wait
for data arrival to trigger an action in the workflow.
It is independent of any other Hadoop schedulers and should work with any of them, so probably nothing in you Hadoop configuration will change.
How about having a script to execute your Hadoop job and then using at command to execute at some specified time.if you want the job to run regularly, you could setup a cron job to execute your script.
I'd use a commercial scheduling app if Cron does not cut it and/or a custom workflow solution. We use a solution called jams but keep in mind it's .net-oriented.

Resources