I'm using LoadLeveler to submit jobs on an IBM/BlueGene architecture. I read the documentation made from IBM and also gave Google a try, but I cannot find how to do the following, which I expect should be there:
One can use the
queue
keyword to tell LoadLeveler that a new job step is described, so I could do something like
first_step
queue
second_step
queue
but what I fail to find is a way that does something like
loop job_id = 1,10
do_job_with_given_job_id
end
Do I have to write a "normal" shell script that in turn calls a load level script for a bunch of times, or is there some built in loop mechanism? I know that other job managers can do this.
When this comes up, we normally just recommend that one writes a shell script which generates the job submission script or scripts; that's what I do for my own jobs. Do these steps have dependancies on each other?
Also, just out of curiosity, which schedulers/resource managers can queue multiple jobs within a loop in a submission script? Not the PBS-based ones...
Related
I've got Jenkins job A that triggers job B and afterwards executes a shell script.
Inside this shell script of Jenkins job A I want to use a variable set by Jenkins Job B.
How can I do this?
This can be accomplished in many ways. One way would be to configure Job A to have a build step, that triggers Job B, and fetches variables in a document after Job B has finished. Then Job A can read those variables and use them in later steps.
There are several things to consider here though. First of all this requires Job B to finish before Job A can/should continue, so if you are thinking of parallel job execution this isn't ideal. Secondly, when dealing with env variables you will need a plugin to make variables available outside of the build step (exporting isn't enough), check out the EnvInject plugin. And thirdly, if job configuration is becoming complex, there probably is a better way of doing it. With Jenkinsfile and previously pipelining plugins, Job orchestration has improved a lot, and passing parameters around and such is much easier in this new, shiny world. That being said, here is an example of something that works like what you are asking about.
Job A
As a build step, trigger Job B, and let Job A halt while Job B finishes
As the next build step, copy an artifact from another build (Job B latest stable), using the Copy Artifact Plugin.
Do something with the file, for example just printing it's content, it's now accessible in Job A.
Job B
Export a variable and save it to a file
Archive the written file and make it accessible to Job A.
This isn't pretty, but it works at least.
P.s. I'd recommend checking out the Jenkinsfile (https://jenkins.io/doc/book/pipeline/jenkinsfile/) options, it simplifies a lot after the initial learning curve.
Apologies if this question sounds basic, I'm totally new to Hadoop environment.
What am I looking for?
In my case, there are jobs scheduled to run everday and I would want to export the list of failed jobs in an excel sheet each day.
How do I view the workflow jobs?
Currently I use the Oozie web console to view the jobs and I don't have/see an option to export. Also,
I was not able to find this information from the Oozie documentation.
However, I found that jobs can be listed using commands like
$ oozie jobs -oozie http://localhost:8080/oozie -localtime -len 2 -fliter status=RUNNING
Where am I stuck?
I want to filter the failed jobs for a given date and would want to export it as csv/excel data.
#YoungHobbit was right to point at that post which is very similar to this one; his answer was dead on target when it comes to extracting the entire list of jobs that have run on a specific day with the Oozie CLI (command-line interface).
Just don't forget to specify an "unbounded" reply e.g. -len 999999999 to avoid side effects (defaut is to show only the first 100 matches, which may be way too low if you run a lot of frequent jobs).
The trick is that you can make a more complex filter such as
"startCreatedTime=2016-06-28T00:00Z;endcreatedtime=2016-06-28T10:00Z;status=FAILED"
... but you cannot request jobs that have FAILED or have been KILLED or have been SUSPENDED (which may result from a temporary YARN or HDFS outage) or are still suspiciously RUNNING (because a sub-workflow is SUSPENDED for instance).
So your best choice is to get the whole list, then filter out all jobs that have SUCCEEDED, with a plain old grep -- as suggested in another answer.
Then you will also need a complex sed or awk script to break down the ugly CLI output into a well-formed CSV. Ouch!
Now, you have an alternative to the Oozie CLI: the Oozie REST API (old Cloudera tutorial here, reference for Oozie V4.2 here) lets you query the Oozie server with any programming language that provides...
an HTTP client
and a way to parse JSON messages (using plain old regular expressions, if nothing else is available)
The logic would be basically the same -- fetch the list of all jobs in the desired time window, ignore SUCCEEDED jobs, parse the others to generate a CSV record, dump into a CSV file.
But your program would be more robust, since it would be based on structured JSON input.
One more thing: if you are familiar with Microsoft VBA, you can even use an Excel macro to build the report dynamically, in a self-service way. No need to bother with in intermediate CSV file.
A certain number of jobs needs to be executed in a sequence, such that result of one job is input to another. There's also a loop in one part of job chain. Currently, I'm running this sequency using wait for completition, but I'm going to start this sequence from web service, so I don't want to get stuck waiting for response. I wan't to start the sequence and return.
How can I do that, considering that job's depend on each other?
The typical approach I follow is to use Oozie work flow to chain the sequence of jobs with passing the dependent inputs to them accordingly.
I used a shell script to invoke the oozie job .
I am not sure about the loops within the oozie workflow. but the below link speaks about the way to implement loops within the workflow.Hope it might help you.
http://zapone.org/bernadette/2015/01/05/how-to-loop-in-oozie-using-sub-workflow/
Apart from this the JobControl class is also a good option if the jobs need to be in sequence and it requires less efforts to implement.It would be easy to do loop since it would be fully done with Java code.
http://gandhigeet.blogspot.com/2012/12/hadoop-mapreduce-chaining.html
https://cloudcelebrity.wordpress.com/2012/03/30/how-to-chain-multiple-mapreduce-jobs-in-hadoop/
I have read the documentation so I know the difference.
My question however is that, is there any risk in using .submit instead of .waitForComplete if I want to run several Hadoop jobs on a cluster in parallel ?
I mostly use Elastic Map Reduce.
When I tried doing so, I noticed that only the first job being executed.
If your aim is to run jobs in parallel then there is certainly no risk in using job.submit(). The main reason job.waitForCompletion exists is that it's method call returns only when the job gets finished, and it returns with it's success or failure status which can be used to determine that further steps are to be run or not.
Now, getting back at you seeing only the first job being executed, this is because by default Hadoop schedules the jobs in FIFO order. You certainly can change this behaviour. Read more here.
I would like to run a script when all of the jobs that I have sent to a server are done.
for example, I send
ssh server "for i in config*; do qsub ./run 1 $i; done"
And I get back a list of the jobs that were started. I would like to automatically start another script on the server to process the output from these jobs once all are completed.
I would appreciate any advice that would help me avoid the following inelegant solution:
If I save each of the 1000 job id's from the above call in a separate file, I could check the contents of each file against the current list of running jobs, i.e. output from a call to:
ssh qstat
I would only need to check every half hour, but I would imagine that there is a better way.
It depends a bit on what job scheduler you are using and what version, but there's another approach that can be taken too if your results-processing can also be done on the same queue as the job.
One very handy way of managing lots of related job in more recent versions of torque (and with grid engine, and others) is to launch the any individual jobs as a job array (cf. http://docs.adaptivecomputing.com/torque/4-1-4/Content/topics/commands/qsub.htm#-t). This requires mapping the individual runs to numbers somehow, which may or may not be convenient; but if you can do it for your jobs, it does greatly simplify managing the jobs; you can qsub them all in one line, you can qdel or qhold them all at once (while still having the capability to deal with jobs individually).
If you do this, then you could submit an analysis job which had a dependency on the array of jobs which would only run once all of the jobs in the array were complete: (cf. http://docs.adaptivecomputing.com/torque/4-1-4/Content/topics/commands/qsub.htm#dependencyExamples). Submitting the job would look like:
qsub analyze.sh -W depend=afterokarray:427[]
where analyze.sh had the script to do the analysis, and 427 would be the job id of the array of jobs you launched. (The [] means only run after all are completed). The syntax differs for other schedulers (eg, SGE/OGE) but the ideas are the same.
Getting this right can take some doing, and certainly Tristan's approach has the advantage of being simple, and working with any scheduler; but learning to use job arrays in this situation if you'll be doing alot of this may be worth your time.
Something you might consider is having each job script just touch a filename in a dedicated folder like $i.jobdone, and in your master script, you could simply use ls *.jobdone | wc -l to test for the right number of jobs done.
You can use wait to stop execution until all your jobs are done. You can even collect all the exit statuses and other running statistics (time it took, count of jobs done at the time, whatever) if you cycle around waiting for specific ids.
I'd write a small C program to do the waiting and collecting (if you have permissions to upload and run executables), but you can easily use the bash wait built-in for roughly the same purpose, albeit with less flexibility.
Edit: small example.
#!/bin/bash
...
waitfor=''
for i in tasks; do
task &
waitfor="$waitfor $!"
done
wait $waitfor
...
If you run this script in background, It won't bother you and whatever comes after the wait line will run when your jobs are over.