Measure Hadoop job time using JobControl - hadoop

I used to launch my Hadoop job with the following
long start = new Date().getTime();
boolean status = job.waitForCompletion(true);
long end = new Date().getTime();
This way I could measure the time taken by the job once it ends directly in my code.
Now I have to use the JobControl in order to express dependencies between my jobs:
JobControl jobControl = new JobControl("MyJob");
jobControl.addJob(job1);
jobControl.addJob(job2);
job3.addDependingJob(job2);
jobControl.addJob(job3);
jobControl.run();
However once jobControl.run() has been executed, the code never goes further so I cannot include code to poll on the jobControl.getState() for the completion of the job.
How can I measure the time taken by a job using JobControl?

JobControl has no nice functionality to allow you to hook and get this information. You have some (potentially painful) options to try:
Start JobControl.run() in a separate thread, and in your main thread, poll the JobControl.getXXXJobs() methods to track when jobs change state
Look into using the Job End Notification URL hook, but this will require you to start a 'server' in your client to receive the notification events, and then try to work backwards from when a job ends
Extend the JobControl and jobcontrol.Job objects to track when a job changes state and add methods to query the start / end times

Related

stop specfic process in python ProcessPoolExecutor or shared state btw them

This is my code
def long_stage_task(node, deployment_folder_name, stage_s3_bucket):
global workers
logging.info("starting....")
work = StageOS(node, deployment_folder_name, stage_s3_bucket)--> class
work.stagestart()--> class method
executor = ProcessPoolExecutor(5)
executor.submit(long_stage_task, i, deployment_folder_name, stage_s3_bucket)
Now how can i stop a particular process/pid.
Is there any way to pass globals or shared state btw them, i don't see any thing in the doc.
https://docs.python.org/3/library/concurrent.futures.html
You could pass to the workers a list of Events and set them when you want the worker to stop. This implies your long_stage_task function periodically checks its own Event.
If what you are after is stopping a task which is taking too long, you can take a look at pebble. It allows to set timeouts to function calls as well as to cancel ongoing tasks.

Spring batch A job instance already exists

OK,
I know this has been asked before, but I still can't find a definite answer to my question. And my question is this: I am using spring batch to export data to SOLR search server. It needs to run every minute, so I can export all the updates. The first execution passes OK, but the second one complains with:
2014-10-02 20:37:00,022 [defaultTaskScheduler-1] ERROR: catching
org.springframework.batch.core.repository.JobInstanceAlreadyCompleteException: A job instance already exists and is complete for parameters={catalogVersionPK=3378876823725152,
type=UPDATE}. If you want to run this job again, change the parameters.
at org.springframework.batch.core.repository.support.SimpleJobRepository.createJobExecution(SimpleJobRepository.java:126)
at
Of course I can add a date-time parameter to the job like this:
.addLong("time", System.getCurrentTimeMillis())
and then the job can be run more than once. However, I also want to query the last execution of the job, so I have code like this:
DateTime endTime = new DateTime(0);
JobExecution je = jobRepository.getLastJobExecution("searchExportJob", new JobParametersBuilder().addLong("catalogVersionPK", catalogVersionPK).addString("type", type).toJobParameters());
if (je != null && je.getEndTime() != null) {
endTime = new DateTime(je.getEndTime());
}
and this returns nothing, because I didn't provide the time parameter. So seems like I can run the job once and get the last execution time, or i can run it multiple times and not get the last execution time. I am really stuck :(
Assumption
Spring Batch use some tables to store each JOB executed with its parameters.
If you run twice the job with the same parameters, the second one fails, because the job is identified by jobName and parameters.
1# Solution
You could use JobExecution when run a new Job.
JobExecution execution = jobLauncher.run(job, new JobParameters());
.....
// Use a JobExecutionDao to retrieve the JobExecution by ID
JobExecution ex = jobExecutionDao.getJobExecution(execution.getId());
2# Solution
You could implement a custom JobExecutionDao and perform a custom query to find your JobExecution on BATCH_JOB_EXECUTION table.
See here the reference of Spring.
I hope my answer is helpful to you.
Use the Job Explorer as suggested by Luca Basso Ricci.
Because you do not know the job parameters you need to look up by instance.
Look for the last instance of job named searchExportJob
Look for the last execution of the instance above
This way you use Spring Batch API only
//We can set count 1 because job instance are ordered by instance id descending
//so we know the first one returned is the last instance
List<JobInstance> instances = jobExplorer.getJobInstances("searchExportJob",0,1);
JobInstance lastInstance = instances.get(0);
List<JobExecution> jobExecutions = jobExplorer.getJobExecutions(lastInstance);
//JobExcectuin is ordered by execution id descending so first
//result is the last execution
JobExecution je = jobExecutions.get(0);
if (je != null && je.getEndTime() != null) {
endTime = new DateTime(je.getEndTime());
}
Note this code only work for Spring Batch 2.2.x and above in 2.1.x the API was somewhat different
There is another interface you can use: JobExplorer
From its javadoc:
Entry point for browsing executions of running or historical jobs and
steps. Since the data may be re-hydrated from persistent storage, it
may not contain volatile fields that would have been present when the
execution was active
If you are debugging your batch job and terminate your batch job before completing, then it will give this error if you try to start it again.
To start it again either you need to update the name of your job so that it will create another execution id.
or you can update below tables.
BATCH_JOB_EXECUTION
BATCH_STEP_EXECUTION
You need to update Status and End_time columns with non null values.
Create new Job RunId everytime.
If your code involves creating Job object using Jobfactory then below snippet would be useful for the problem:
return jobBuilderFactory
.get("someJobName")
.incrementer(new RunIdIncrementer()) // solution lies here- creating new job id everytime
.flow( // and here
stepBuilderFactory
.get("someTaskletStepName")
.tasklet(tasklet) // u can replace it with step
.allowStartIfComplete(true) // this will make the job run even if complete in last run
.build())
.end()
.build();

How to related task's back to the machine they were run on in Hadoop

I am working on a Hadoop project (currently using hadoop 1.2.1) where I need to keep track of task runtime information and which machines are performing tasks well. I am able to get task progress using the following:
RunningJob runningJob = JobClient.runJob(conf);
JobStatus jobStatus = runningJob.getJobStatus();
From here I can get a JobTracker and get map task progress:
TaskReport[] mapTaskReports = tracker.getMapTaskReports();
But now that I have the task reports, I am not sure how to know which machines these tasks are/were running on. Is there any machine identifying information that I can retrieve (machine name, ip address, etc.) and be able to related back to these task reports?
NOTE: I need to be able to do this mapping with a job is still in progress, so I can make decisions based on whether certain machines are preforming poorly for certain tasks.
EDIT: I think that the TaskTracker object may have what I want, with its getHostName() method, but I am not sure how to get an instance of it. The TaskTracker constructor takes in a JobConf object, but it doesn't seem to specify which machine it will get it from, as each machine running a task for the job will have its own instance of the TaskTracker.
RunningJob has API called getTaskCompletionEvents(), which returns TaskCompletionEvent array. Using
TaskCompletionEvent we can know HTTP address of Task Tracker.
Please try below code ..this is sample code..not tested
TaskCompletionEvent [] events = runningJob.getTaskCompletionEvents (0);
for (TaskCompletionEvent event: events) {
System.out.println(event.getTaskTrackerHttp()); // host:port format
}

Get sidekiq to execute a job immediately

At the moment, I have a sidekiq job like this:
class SyncUser
include Sidekiq::Worker
def perform(user_id)
#do stuff
end
end
I am placing a job on the queue like this:
SyncUser.perform_async user.id
This all works of course but there is a bit of a lag between calling perform_async and the job actually getting executed.
Is there anything else I can do to tell sidekiq to execute the job immediately?
There are two questions here.
If you want to execute a job immediately, in the current context you can use:
SyncUser.new.perform(user.id)
If you want to decrease the delay between asynchronous work being scheduled and when it's executed in the sidekiq worker, you can decrease the poll_interval setting:
Sidekiq.configure_server do |config|
config.poll_interval = 2
end
The poll_interval is the delay within worker backends of how frequently workers check for jobs on the queue. The average time between a job being scheduled and executed with a free worker will be poll_interval / 2.
use .perform_inline method
SyncUser.perform_inline(user.id)
If you also need to perform nested jobs, you can use Sidekiq::Testing.inline! in your production console
require 'sidekiq/testing'
Sidekiq::Testing.inline!
SyncUser.perform_inline(user.id)
For those who are using Sidekiq via the Active Job framework, you can do
SyncUser.perform_now(user.id)

Spring scheduler (quartz) how to pauseAll schedulers and reschedule them before resumeAll

The need is such clear that first I pauseAll schedulers and before resumeAll I want to reschedule jobs (I mean change the trigger expressions) and make them resume with THE NEW trigger expressions not the former ones.
Is it possible rescheduling a scheduler while it is paused? In other words is it ok by doing the following?
scheduler.pauseAll(); // pause first
scheduler.rescheduleJob(...); // reschedule while it is paused??
scheduler.resumeAll(); // resume All with the new job-trigger expressions as above
(I cannot test exact scenario because of restrictions about the project structure by now, I need time for build test and adapt to the project)
Thanks in advance.
I figured out today that it is possible to reschedule jobs even while scheduler is in pause state. Thus, I made it done by the following peace of code as I mentioned above:
scheduler.pauseAll(); // pause first
scheduler.rescheduleJob(...); // reschedule while it is paused
scheduler.resumeAll(); // resume All

Resources