How to know if preemption is happening on Yarn fair share scheduler? - hadoop

Is there any way to know for sure if the preemption mechanism has been triggered on YARN?
In the YARN Resource Manager or the logs maybe?

If your log level is set to info you should see this in the YARN Resource Manager logs.
// Warn application about containers to be killed
for (RMContainer container : containers) {
FSAppAttempt app = scheduler.getSchedulerApp(
container.getApplicationAttemptId());
LOG.info("Preempting container " + container +
" from queue " + app.getQueueName());
app.trackContainerForPreemption(container);
}
at https://github.com/apache/hadoop/.../scheduler/fair/FSPreemptionThread.java#L143.
And if you have the log level to debug you will also see this:
if (LOG.isDebugEnabled()) {
LOG.debug(
"allocate: post-update" + " applicationAttemptId=" + appAttemptId
+ " #ask=" + ask.size() + " reservation= " + application
.getCurrentReservation());
LOG.debug("Preempting " + preemptionContainerIds.size()
+ " container(s)");
}
at https://github.com/apache/hadoop/.../scheduler/fair/FairScheduler.java#937

Related

Airflow UI link in SlackAPIPostOperator?

Im using SlackAPIPostOperator in Airflow to send Slack messages on task failures.
I wondered if there's a smart way to add a link to the airflow UI logs page of the failed task to the slack message.
The following is an example I want to achieve:
http://myserver-uw1.myaws.com:8080/admin/airflow/graph?execution_date=...&arrange=LR&root=&dag_id=MyDAG&_csrf_token=mytoken
The current message is:
def slack_failed_task(context):
failed_alert = SlackAPIPostOperator(
task_id='slack_failed',
channel="#mychannel",
token="...",
text=':red_circle: Failure on: ' +
str(context['dag']) +
'\nRun ID: ' + str(context['run_id']) +
'\nTask: ' + str(context['task_instance']))
return failed_alert.execute(context=context)
You can build the url to the UI with the config value base_url under the [webserver] section and then use Slack's message format <http://example.com|stuff> for links.
from airflow import configuration
def slack_failed_task(context):
link = '<{base_url}/admin/airflow/log?dag_id={dag_id}&task_id={task_id}&execution_date={execution_date}|logs>'.format(
base_url=configuration.get('webserver', 'BASE_URL'),
dag_id=context['dag'].dag_id,
task_id=context['task_instance'].task_id,
execution_date=context['ts'])) # equal to context['execution_date'].isoformat())
failed_alert = SlackAPIPostOperator(
task_id='slack_failed',
channel="#mychannel",
token="...",
text=':red_circle: Failure on: ' +
str(context['dag']) +
'\nRun ID: ' + str(context['run_id']) +
'\nTask: ' + str(context['task_instance']) +
'\nSee ' + link + ' to debug')
return failed_alert.execute(context=context)
We can also do this using the log_url attribute in the Task Instance
def slack_failed_task(context):
failed_alert = SlackAPIPostOperator(
task_id='slack_failed',
channel="#mychannel",
token="...",
text=':red_circle: Failure on: ' +
str(context['dag']) +
'\nRun ID: ' + str(context['run_id']) +
'\nTask: ' + str(context['task_instance']) +
'\nLogs: <{url}|to Airflow UI>'.format(url=context['task_instance'].log_url)
)
return failed_alert.execute(context=context)
I know this is available since version 1.10.4 at the very least.

MQCMD_INQUIRE_CLUSTER_Q_MGR pcf request not returning cluster information

Isn't MQCMD_INQUIRE_CLUSTER_Q_MGR is equivalent to runmqsc DISPLAY CLUSQMGR(*) command. Following is the output from this command
display clusqmgr(*)
4 : display clusqmgr(*)
AMQ8441: Display Cluster Queue Manager details.
CLUSQMGR(QM_FR1) CHANNEL(TO.QM_FR1)
CLUSTER(CLUSTER1)
AMQ8441: Display Cluster Queue Manager details.
CLUSQMGR(QM_FR2) CHANNEL(TO.QM_FR2)
CLUSTER(CLUSTER1)
AMQ8441: Display Cluster Queue Manager details.
CLUSQMGR(QM_PR1) CHANNEL(TO.QM_PR1)
CLUSTER(CLUSTER1)
AMQ8441: Display Cluster Queue Manager details.
CLUSQMGR(QM_PR2) CHANNEL(TO.QM_PR2)
CLUSTER(CLUSTER1)
AMQ8441: Display Cluster Queue Manager details.
CLUSQMGR(QM_PR3) CHANNEL(TO.QM_PR3)
CLUSTER(CLUSTER1)
AMQ8441: Display Cluster Queue Manager details.
CLUSQMGR(QM_PR3) CHANNEL(TO.QM_PR3)
CLUSTER(CLUSTER1)
I was expecting a similar response with PCF in the code i have supplied, but i don't get this information.
I have tried the following code but this does not return cluster information.
PCFMessageAgent agent = new PCFMessageAgent(queueManager);
agent.setCheckResponses(false);
PCFMessage[] responses;
PCFMessage request = new PCFMessage(MQConstants.MQCMD_INQUIRE_CLUSTER_Q_MGR);
request.addParameter(MQConstants.MQCA_CLUSTER_Q_MGR_NAME, queueManager);
responses = agent.send(request);
String clusterName = (String)responses[0].getParameterValue(MQConstants.MQCA_CLUSTER_NAME);
String clusterInfo = (String)responses[0].getParameterValue(MQConstants.MQIACF_CLUSTER_INFO);
logger.info("Cluster Name [" + clusterName + "]");
logger.info("Cluster Information [" + clusterInfo + "]");
The last line prints out a null.
So the question is
How do I get this information using PCF? The above output is for a full repository queue manager.
The following code displays the required information:
responses = agent.send(request);
for(int i=0; i < responses.length; i++) {
System.out.println("Cluster Queue manager [" + (String)responses[i].getParameterValue(MQConstants.MQCA_CLUSTER_Q_MGR_NAME) + "]");
System.out.println("Cluster Name [" + (String)responses[i].getParameterValue(MQConstants.MQCA_CLUSTER_NAME) + "]");
System.out.println("Cluster Channel [" + (String)responses[i].getParameterValue(MQConstants.MQCACH_CHANNEL_NAME) + "]");
}
The output looks like this:
Cluster Queue manager [QM1 ]
Cluster Name [CLUS1 ]
Cluster Channel [TO.QM1 ]

Set heartbeat via command line in hadoop

It is possible to set the heartbeat of the nodemanager parameter via command line in hadoop ?
How ?
In alternative is possible to modify such parameter without restart the cluster ?
The parameter I am interested in manage is yarn.resourcemanager.nodemanagers.heartbeat-interval-ms under yarn-default.xml
You cannot set this parameter yarn.resourcemanager.nodemanagers.heartbeat-interval-ms (indicates The heart-beat interval in milliseconds for every NodeManager in the cluster.) using command line.
You can change this parameter in yarn-site.xml and then you need to re-start the services.
The reason being, this parameter is read once, when the Resource Tracker Service is started in the Resource Manager. The heart beat interval is returned to the Node Manager, as part of NodeHeartbeatResponse.
// Heartbeat response
NodeHeartbeatResponse nodeHeartBeatResponse = YarnServerBuilderUtils
.newNodeHeartbeatResponse(lastNodeHeartbeatResponse.
getResponseId() + 1, NodeAction.NORMAL, null, null, null, null,
nextHeartBeatInterval);
The parameter nextHeartBeatInterval in the call above, is read in serviceInit() method of Resource Tracker Service:
nextHeartBeatInterval =
conf.getLong(YarnConfiguration.RM_NM_HEARTBEAT_INTERVAL_MS,
YarnConfiguration.DEFAULT_RM_NM_HEARTBEAT_INTERVAL_MS);
if (nextHeartBeatInterval <= 0) {
throw new YarnRuntimeException("Invalid Configuration. "
+ YarnConfiguration.RM_NM_HEARTBEAT_INTERVAL_MS
+ " should be larger than 0.");
}
Also, the value of yarn.resourcemanager.nodemanagers.heartbeat-interval-ms (default 1000) should be less than value of yarn.nm.liveness-monitor.expiry-interval-ms (default 600000). yarn.nm.liveness-monitor.expiry-interval-ms indicates How long to wait until a node manager is considered dead..
The check for this is in validateConfigs() method of the Resource Manager:
// validate expireIntvl >= heartbeatIntvl
long expireIntvl = conf.getLong(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS,
YarnConfiguration.DEFAULT_RM_NM_EXPIRY_INTERVAL_MS);
long heartbeatIntvl =
conf.getLong(YarnConfiguration.RM_NM_HEARTBEAT_INTERVAL_MS,
YarnConfiguration.DEFAULT_RM_NM_HEARTBEAT_INTERVAL_MS);
if (expireIntvl < heartbeatIntvl) {
throw new YarnRuntimeException("Nodemanager expiry interval should be no"
+ " less than heartbeat interval, "
+ YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS + "=" + expireIntvl
+ ", " + YarnConfiguration.RM_NM_HEARTBEAT_INTERVAL_MS + "="
+ heartbeatIntvl);
}

Server Not Found Exception in wsadmin

I have a jython script to create a server, deploy and application and then start the server.
But I am geeting the following exception while running it.
WASX7017E: Exception received while running file "myfile.py"; exception information: javax.management.MBeanException[[ com.ibm.websphere.management.exception.AdminException: Server, SERVERNAME, not found.
Here is the entire code... http://snipt.org/BMaf4
Update: He is the entire log http://snipt.org/BNZ1
Can't figure out where I am going wrong.
But When I am issuing a start all servers in wsadminlib.. the server gets started
The problem might be in the way how you deploying and/or configuring your application.
# Your script sample
# Deploy the WAR
APP_PATH= APP_HOME + '/MYAPP/CycleWAR/war/MYAPPCycle.war'
ARGS_LIST='.....'
AdminApp.install(APP_PATH, ARGS_LIST)
You need to add another parameter into your arguments list ARGS_LIST : args = "[-server " + SERVER_NAME + "]"
# Since 6.0.X version
# Server Deployment
args = "[-server " + serverName + "]"
# Cluster Deployment
# args = "[-cluster " + clusterName + "]"
AdminApp.install(applicationFilePath, args)
After the server configurations...
#Sync the Nodes
Sync1 = AdminControl.completeObjectName( "type=NodeSync,node="+ NODE_NAME +",*")
print "Getting Sync Info.. " + "\n" + Sync1
AdminControl.invoke(Sync1, 'sync')
print NODE_NAME + " Sync Completed.. "
We have to sync the nodes first before trying to start the server.

Tracking Hadoop job status via web interface? (Exposing Hadoop to internal clients in the company)

I want to develop a website that will allow analysts within the company to run Hadoop jobs (choose from a set of defined jobs) and see their job's status\progress.
Is there an easy way to do this (get running jobs statuses etc.) via Ruby\Python?
How do you expose your Hadoop cluster to internal clients on your company?
I have found one way to get information about jobs on JobTracker. This is the code:
Configuration conf = new Configuration();
conf.set("mapred.job.tracker", "URL");
JobClient client = new JobClient(new JobConf(conf));
JobStatus[] jobStatuses = client.getAllJobs();
for (JobStatus jobStatus : jobStatuses) {
long lastTaskEndTime = 0L;
TaskReport[] mapReports = client.getMapTaskReports(jobStatus.getJobID());
for (TaskReport r : mapReports) {
if (lastTaskEndTime < r.getFinishTime()) {
lastTaskEndTime = r.getFinishTime();
}
}
TaskReport[] reduceReports = client.getReduceTaskReports(jobStatus.getJobID());
for (TaskReport r : reduceReports) {
if (lastTaskEndTime < r.getFinishTime()) {
lastTaskEndTime = r.getFinishTime();
}
}
client.getSetupTaskReports(jobStatus.getJobID());
client.getCleanupTaskReports(jobStatus.getJobID());
System.out.println("JobID: " + jobStatus.getJobID().toString() +
", username: " + jobStatus.getUsername() +
", startTime: " + jobStatus.getStartTime() +
", endTime: " + lastTaskEndTime +
", Durration: " + (lastTaskEndTime - jobStatus.getStartTime()));
}
Since version 'beta 2' of Cloudera's Hadoop Distribution you can almost with no effort use Hadoop User Experience (HUE), which was earlier called Cloudera Desktop.
But since this version it has grown enormously. It comes with job designer,hive interface and many more. You should definitely check this out before deciding to build your own application.
Maybe a good place to start would be to take a look at Cloudera Destktop. It provides a web interface to enable cluster administration and job development tasks. Its free to download.
There is nothing like this that ships with hadoop. It should be trivial to build this functionality. Some of this is available via the JobTracker's page and some you will have to build yourself.

Resources