Job with multiple tasks on different servers - ruby

I need to have a Job with multiple tasks, being run on different machines, one after another (not simultaneously), and while the current job is running, another same job can arrive to the queue, but should not be started until the previous one has finished. So I came up with this 'solution' which might not be the best but it gets the job done :). I just have one problem.
I figured out I would need a JobQueue (either MongoDb or Redis) with the following structure:
{
hostname: 'host where to execute the task',
running:FALSE,
task: 'current task number',
tasks:{
[task_id:1, commands:'run these ecommands', hostname:'aaa'],
[task_id:2,commands:'another command', hostname:'bbb']
}
}
Hosts:
search for the jobs with same hostname, and running==FALSE
execute the task that is set in that job
upon finish, host sets running=FALSE, checks if there are any other tasks to perform and increases task number + sets the hostname to the next machine from the next task
Because jobs can accumulate, imagine situation when jobs are queued for one host like this: A,B,A
Since I have to run all the jobs for the specified machine how do I not start the 3rd A (first A is still running)?

{
_id : ObjectId("xxxx"), // unique, generated by MongoDB, indexed, sortable
hostname: 'host where to execute the task',
running:FALSE,
task: 'current task number',
tasks:{
[task_id:1, commands:'run these ecommands', hostname:'aaa'],
[task_id:2,commands:'another command', hostname:'bbb']
}
}
The question is how would the next available "worker" know whether it's safe for it to start the next job on a particular host.
You probably need to have some sort of a sortable (indexed) field to indicate the arrival order of the jobs. If you are using MongoDB, then you can let it generate _id which will already be unique, indexed and in time-order since its first four bytes are timestamp.
You can now query to see if there is a job to run for a particular host like so:
// pseudo code - shell syntax, not actual code
var jobToRun = db.queue.findOne({hostname:<myHostName>},{},{sort:{_id:1}});
if (jobToRun.running == FALSE) {
myJob = db.queue.findAndModify({query:{_id:jobToRun._id, running:FALSE},update:{$set:{running:TRUE}}});
if (myJob == null) print("Someone else already grabbed it");
else {
/* now we know that we updated this and we can run it */
}
} else { /* sleep and try again */ }
What this does is checks for the oldest/earliest job for specific host. It then looks to see if that job is running. If yes then do nothing (sleep and try again?) otherwise try to "lock" it up by doing findAndModify on _id and running FALSE and setting running to TRUE. If that document is returned, it means this process succeeded with the update and can now start the work. Since two threads can be both trying to do this at the same time, if you get back null it means that this document already was changed to be running by another thread and we wait and start again.
I would advise using a timestamp somewhere to indicate when a job started "running" so that if a worker dies without completing a task it can be "found" - otherwise it will be "blocking" all the jobs behind it for the same host.
What I described works for a queue where you would remove the job when it was finished rather than setting running back to FALSE - if you set running to FALSE so that other "tasks" can be done, then you will probably also be updating the tasks array to indicate what's been done.

Related

how to switch a schedule task from one host to other in a cluster, when host that contain the schedule task is down in marklogic?

I scheduled a task in a bootstrap node in marklogic and but there might be chance that the host will down ,somehow ,in that case how I switched that task to other host in the cluster.
Note: I have to schedule the task only on a single host at a time from the cluster.
The options for assigning scheduled tasks are currently to set for a specific host, or to leave empty and have it execute on all hosts.
So, if you want to ensure that in the event of a host failure, the task is still executed, you could leave the host assignment empty and add logic inside of the task to determine which host should execute the code, and the others become a no-op.
An example of how to achieve that is to add code to the task to evaluate whether the xdmp:host() is the same host with the open Security forest (assuming that you have HA-replica forest for your Security DB to ensure availability, but could be achieved with any database)
xquery version "1.0-ml";
let $status := xdmp:database("Security") => xdmp:database-forests() => xdmp:forest-status()
where $status/*:host-id/data() eq xdmp:host()
return
(: this will only execute on the host with the open Security forest:)
"execute task logic"

Access denied calling EnumJobs

I'm trying to get the domain username of jobs in a printer queue on Windows Server 2012 R2 Standard. Code snippet below is in Delphi. OpenPrinter and EnumJobs are part of the Windows Spooler API.
Update! Setting maxJobs to a higher multiple of 4 allows for more jobs in the queue to be enumerated. eg. Setting maxJobs=8 allows for two jobs, but not three. maxJobs=12 allows for three jobs.
Solved! It looks like I can just ignore the return value of EnumJobs, and simply see if the number of jobs it returns > 0 (the last argument when calling). This seems to work fine for all instances listed below, including printer via a share.
const
maxJobs = 4;
var
h : THandle;
jia : array [1..maxJobs] of JOB_INFO_1;
jiz, jic : DWord; // size of jia, count of jia
begin
if OpenPrinter('DocTest', h, nil) then
begin
if EnumJobs(h, 0, maxJobs, 1, #jia, SizeOf(jia), jiz, jic) then
[...]
EnumJobs returns true or false depending on different conditions listed below. If it returns false in any of the following situations, the error message I'm retrieving is "System Error. Code: 5. Access is denied".
Clearly a permissions problem. I have assigned Print, Manage this Printer, and Manage documents to Everyone in the printer security settings. All jobs have been submitted after those settings have been assigned. My program is running in a session logged in as the domain administrator.
EnumJobs returns TRUE if I print a job from the same session I'm running this program in, and there's only one job in the queue. (See Update above for change)
EnumJobs returns TRUE if I print from another session on the server (it has terminal services installed) as any user, and there's only one job in the queue. (See Update above for change)
EnumJobs returns FALSE if there is more than one job in the queue. It doesn't matter if the jobs are for the same user or not. (See Update above for change)
EnumJobs returns FALSE if I print a job from another server to the printer share. Both servers are in the same domain. It doesn't matter which user prints the job, including the domain administrator.
What's going on here, in particular getting an access denied when enumerating more than (maxJobs / 4) job(s) at a time?
Ignore the return value of EnumJobs and inspect the out argument pcReturned to see if it's greater than 0. This indicates the number of print jobs found.

Get status of a task Elasticsearch for a long running update query

Assuming I have a long running update query where I am updating ~200k to 500k, perhaps even more.Why I need to update so many documents is beyond the scope of the question.
Since the client times out (I use the official ES python client), I would like to have a way to check what the status of the bulk update request is, without having to use enormous timeout values.
For a short request, the response of the request can be used, is there a way I can get the response of the request as well or if I can specify a name or id to a request so as to reference it later.
For a request which is running : I can use the tasks API to get the information.
But for other statuses - completed / failed, how do I get it.
If I try to access a task which is already completed, I get resource not found .
P.S. I am using update_by_query for the update
With the task id you can look up the task directly:
GET /_tasks/taskId:1
The advantage of this API is that it integrates with
wait_for_completion=false to transparently return the status of
completed tasks. If the task is completed and
wait_for_completion=false was set on it them it’ll come back with a
results or an error field. The cost of this feature is the document
that wait_for_completion=false creates at .tasks/task/${taskId}. It is
up to you to delete that document.
From here https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update-by-query.html#docs-update-by-query-task-api
My use case went like this, I needed to do an update_by_query and I used painless as the script language. At first I did a reindex (when testing). Then I tried using the update_by_query functionality (they resemble each other a lot). I did a request to the task api (the operation hasn't finished of course) and I saw the task being executed. When it finished I did a query and the data of the fields that I was manipulating had disappeared. The script worked since I used the same script for the reindex api and everything went as it should have. I didn't investigate further because of lack of time, but... yeah, test thoroughly...
I feel GET /_tasks/taskId:1 confusing to understand. It should be
GET http://localhost:9200/_tasks/taskId
A taskId looks something like this NCvmGYS-RsW2X8JxEYumgA:1204320.
Here is my trivial explanation related to this topic.
To check a task, you need to know its taskId.
A task id is a string that consists of node_id, a colon, and a task_sequence_number. An example is taskId = NCvmGYS-RsW2X8JxEYumgA:1204320 where node_id = NCvmGYS-RsW2X8JxEYumgA and task_sequence_number = 1204320. Some people including myself thought taskId = 1204320, but that's not the way how the elasticsearch codebase developers understand it at this moment.
A taskId can be found in two ways.
wait_for_deletion = false. When sending a request to ES, with this parameter, the response will be {"task" : "NCvmGYS-RsW2X8JxEYumgA:1204320"}. Then, you can check a status of that task like this GET http://localhost:9200/_tasks/NCvmGYS-RsW2X8JxEYumgA:1204320
GET http://localhost:9200/_tasks?detailed=false&actions=*/delete/byquery. This example will return you the status of all tasks with action = delete_by_query. If you know there is only one task running on ES, you can find your taskId from the response of all running tasks.
After you know the taskId, you can get the status of a task with this.
GET /_tasks/taskId
Notice you can only check the status of a task when the task is running, or a task is generated with wait_for_deletion == false.
More trivial explanation, wait_for_deletion by default is true. Based on my understanding, tasks with wait_for_deletion = true are "in-memory" only. You can still check the status of a task while it's running. But it's completely gone after it is completed/canceled. Meaning checking the status will return you a 'resouce_not_found_exception'. Tasks with wait_for_deletion = false will be stored in an ES system index .task. You can still check it's status after it finishes. However, you might want to delete this task document from .task index after you are done with it to save some space. The deletion request looks like this
http://localhost:9200/.tasks/task/NCvmGYS-RsW2X8JxEYumgA:1204320
You will receive resouce_not_found_exception if a taskId is not present. (for example, you deleted some task twice, or you are deleting an in-memory task, whose wait_for_deletetion == true).
About this confusing taskId thing, I made a pull request https://github.com/elastic/elasticsearch/pull/31122 to help clarify the Elasticsearch document. Unfortunately, they rejected it. Ugh.

Algorithm for checking timeouts

I have a finite set of tasks that need to be completed by clients. Clients get assigned a task on connection, and keep getting new tasks after they finished the previous task. Each tasks need to be completed by 3 unique clients. This makes sure that clients do not give wrong results to the tasks.
However, I don't want clients to take longer than 3000ms. As some tasks are dependent of each other, this could stall the progress.
The problem is that i'm having trouble checking timeouts of tasks - which should be done when no free tasks are available.
At this moment each tasks has a property called assignedClients which looks as follows:
assignedClients: [
{
client: Client,
start: Date,
completed: true
},
{
client: Client,
start: Date,
completed: true
},
{
client: Client,
start: Date,
completed: false
}
]
All tasks (roughly 1000) are stored in a single array. Basically, when a client needs a new task, the pseudo-code is like this:
function onTaskRequest:
for (task in tasks):
if (assignedClients < 3)
assignClientToTask(task, client)
return
// so no available tasks
for (task in tasks):
for (client in assignedClients):
if (client.completed === false && Time.now() - client.start > 3000):
removeOldClientFromAssignedClients()
assignClientToTask(task, client)
But this seems very inefficient. Is there an algorithm that is more effective?
What you want to do is store tasks in a priority queue (which is often implemented as a heap) by when they are available with the oldest first. When a client needs a new task you just peek at the top of the queue. If it can be scheduled at all, it can be scheduled on that task.
When the task is inserted it is given now as its priority. When you fill the task list up, you put it in at a timing that is the expiry of the oldest client to grab it.
If you're using a heap, then all operations should be no worse than O(log(n)) as compared to your current O(n) implementation.
Your data structure looks like JSON, in which case https://github.com/adamhooper/js-priority-queue is the first JavaScript implementation of a priority queue that turned up when I looked in Google. Your pseudocode looks like Python in which case https://docs.python.org/3/library/heapq.html is in the standard library. If you can't find an implementation in your language, https://en.wikipedia.org/wiki/Heap_(data_structure) should be able to help you figure out how to implement it.

Quartz.NET: Need CronTrigger on an iStatefulJob instance to *delay instead of skip* if running job while schedule matures

Greetings, your friendly neighborhood Quartz.NET n00b is back!
I have a Windows Service running iStatefulJob instances on a Quartz.NET CronTrigger based schedule scheme... The CRON String used to schedule the job: "0 0/1 * * * ? *"
Everything works great. However, if I have a job that is set to run, say, at the X:00 mark of every minute, and that job happens to run for MORE than a minute, I notice that the subsequent job runs IMMEDIATELY after the job is finished executing, rather than waiting until its next scheduled run, effectively "queuing" up instead of merely skipping the job till it's next scheduled run.
I put in the trigger a CronTrigger MisfireInstruction of DONOTHING, but the exact same thing happens when a job overruns its next scheduled execution schedule.
How do I get an iStatefulJob instance to merely SKIP a scheduled execution trigger if it is currently running, rather than have it delay it until the first execution completes?
I explicitly set the trigger.MisfireInstruction = MisfireInstruction.CronTrigger.DoNothing;
...But instead of "doing nothing", for a job scheduled to run every minute that takes 90 seconds to complete, I experience the following execution log:
Job runs at 9:00:00am, finishes at 9:01:30am <- job runs for 1:30
Job runs at 9:01:30am, finishes at 9:03:00am <- subsequent job that should have run at 9:01:00
Job runs at 9:04:00am, finishes at 9:05:30am <- shouldn't this one have run at 9:03:00?
Job runs at 9:05:30am, finishes at 9:07:00am <- subsequent job that should have run at 9:05:00
Job runs at 9:08:00am, finishes at 9:09:30am <- shouldn't this have run at 9:07:00?
... it seems like it runs correctly the first time, on the minute... delays for 30 seconds as the 90 second job execution time expires, and then, instead of waiting till the NEXT full minute, EXECUTES IMMEDIATELY at the 30 second mark... Doubly odd, is that it then finishes the SECOND job on the minute mark, but waits till the NEXT minute mark to execute instead of running it back-2-back...
Pretty much seems like it works correctly EVERY OTHER RUN, when it is not running on the :30 marks...
What's the best way to get a job not to delay/queue, but to just SKIP until it is idle and the next schedule matures?
EDIT: I tried going back to iJobs instead of iStatefulJobs using the same DONOTHING trigger misfire instruction, but the job executes EVERY MINUTE despite the prior execution being still active. I can't seem to get it to skip a scheduled run if it is currently running with either iJob or iStatefulJob...
EDIT#2: I think that my triggers are NEVER misfiring, which is why DoNothing as a misfire instruction is useless... Given that's the case, I guess I need another mechanism to detect if a job instance of a schedule is running to ensure the job SKIPS its next execution until its following scheduled time rather than delaying it until first instance completion...
EDIT3: I tried adding an element to the iStatefulJob jobdatamap called "IsRunning"... I set it to TRUE when the execute sequence starts, and then return it to false after job completion. Before executing, it checks the element, which is apparently persisted between jobs, and prematurely quits the execution (logging "JOB SKIPPED!") if it detects it to be true... This unfortunately doesn't work, for probably obvious reasons: If the jobs are running following the bulleted schedule above, then the job is never SIMULTANEOUSLY running along with itself, as it is delaying the run till the job ends, so this check is useless. According to documentation, returning to iJob from iStatefulJob would not help here as the jobdatamap is only persisted between jobs in the Stateful job type...
I still haven't solved how to SKIP a scheduled job instead of delaying it till it's current iteration completes... If anyone has ideas, you're a lifesaver! :)
It should be caused by misfireThreshold of RAMJobStore (http://quartznet.sourceforge.net/apidoc/topic2722.html).
The time span by which a trigger must
have missed its next-fire-time, in
order for it to be considered
"misfired" and thus have its misfire
instruction applied.
It is 60 seconds by default. So job isn't considered as "misfired" until it is late for more than misfiredThreshold value.
To resolve the problem just decrease this threshold (below code sets to 1 ms):
...
properties["quartz.jobStore.misfireThreshold"] = "1";
...
schedulerFactory = new StdSchedulerFactory(properties);
It should resolve the issue.

Resources