Google App Engine Task Queue - performance

I want to run 50 tasks. All these tasks execute the same piece of code. Only difference will be the data. Which will be completed faster ?
a. Queuing up 50 tasks in a queue
b. Queuing up 5 tasks each in 10 different queue
Is there any ideal number of tasks that can be queued up in 1 queue before using another queue ?

The rate at which tasks are executed depends on two factors: the number of instances your app is running on, and the execution rate of the queue the tasks are on.
The maximum task queue execution rate is now 100 per queue per second, so that's not likely to be a limiting factor - so there's no harm in adding them to the same queue. In any case, sharding between queues for more execution rate is at best a hack. Queues are designed for functional separation, not as a performance measure.

The bursting rate of task queues is controlled by the bucket size. If there is a token in the queue's bucket the task should run immediately. So if you have:
queue:
- name: big_queue
rate: 50/s
bucket_size: 50
And haven't queue any tasks in a second all tasks should start right away.
see http://code.google.com/appengine/docs/python/config/queue.html#Queue_Definitions for more information.
Splitting the tasks into different queues will not improve the response time unless the bucket hadn't had enough time to completely fill with tokens.

I'd add another factor into the mix- concurrency. If you have slow running (more than 30 seconds or so) tasks, then AppEngine seems to struggle to scale up the correct number of instances to deal with the requests (seems to max out about 7-8 for me).
As of SDK 1.4.3, there's a setting in your queue.xml and your appengine-web.config you can use to tell AppEngine that each instance can handle more than one task at a time:
<threadsafe>true</threadsafe> (in appengine-web.xml)
<max-concurrent-requests>10</max-concurrent-requests> (in queue.xml)
This solved all my problems with tasks executing too slowly (despite setting all other queue params to the maximum)
More Details (http://blog.crispyfriedsoftware.com)

Queue up 50 tasks and set your queue to process 10 at a time or whatever you would like if they can run independently of each other. I see a similar problem and I just run 10 tasks at a time to process the 3300 or so that I need to run. It takes 45 minutes or so to process all of them but the CPU time used is negligible surprisingly.

Related

Does the Windows scheduler sometimes fail to preempt a running thread immediately to let a higher-priority thread run?

My application operates on pairs of long vectors - say it adds them together to produce a vector result. Its rules state that it must completely finish with one pair before it can be given another. I would like to use multiple threads to speed things up. I am running Windows 10.
I created an OpenMP parallel for construct and divided the vector among all the threads of the team. All threads start, all threads run pretty fast, so the multithreading is effective.
But the speedup is slight, and the reason is that some of the time, one of the worker threads takes way longer than usual. I have instrumented the operation, and I see that sometimes the worker threads take a long time to start - delay varies from 20 microseconds on average to dozens of milliseconds depending on system load. The master thread does not show this delay.
That makes me think that the scheduler is taking some time to start the worker threads. The master thread is already running, so it doesn't have to wait to be started.
But here is the nub of the question: raising the priority of the process doesn't make any difference. I can raise it to high priority or even realtime priority, and I still see that startup of the worker threads is often delayed. It looks like the Windows scheduler is not fully preemptive, and sometimes lets a lower-priority thread run when a higher-priority one is eligible. Can anyone confirm this?
I have verified that the worker threads are created with the default OS priority, namely the base priority of the class of the master process. This should be higher that the priority of any running thread, I think. Or is it normal for there to be some thread with realtime priority that might be blocking my workers? I don't see one with Task Manager.
I guess one last possibility is that the task switch might take 20 usec. Is that plausible?
I have a 4-core system without hyperthreading.

Spring batch multithreading: throttle-limit impact

I have a multi-threaded Step configured with a threadpool with a corePoolSize of 48 threads (it's a big machine) but I did not configure the throttle-limit.
I am wondering if I have been under utilizaing the machine because of this.
The Spring Batch documentation says that throttle-limit is the max amount of concurrent tasks that can run at one time and the default is 4.
I can see in jconsole that in fact there are 48 threads created and they seem to be executing (I can also see that in my logs).
But, so, even though I can see the 48 threads created, does the throttle-limit of 4 mean that only 4 of those 48 threads are indeed executing work concurrently?
Thank you in advance.
Yes, your understanding is correct i.e. only threads equal to throttle limit be doing work concurrently.
In your case, since its a thread - pool , any four threads could be chosen randomly to do the work and rest of threads will remain idle but since threads get rotated for those four tasks, it will give an impression that all threads are doing work concurrently.
corePoolSize simply indicates the number of threads to be started and maintained during job run but that doesn't mean that all are running concurrently what it means that you are trying to avoid thread creation overhead etc during job run.
You have not shared any code or job structure so its hard to point any more specifics.
Hope it helps !!

What is a good way to design and build a task scheduling system with lots of recurring tasks?

Imagine you're building something like a monitoring service, which has thousands of tasks that need to be executed in given time interval, independent of each other. This could be individual servers that need to be checked, or backups that need to be verified, or just anything at all that could be scheduled to run at a given interval.
You can't just schedule the tasks via cron though, because when a task is run it needs to determine when it's supposed to run the next time. For example:
schedule server uptime check every 1 minute
first time it's checked the server is down, schedule next check in 5 seconds
5 seconds later the server is available again, check again in 5 seconds
5 seconds later the server is still available, continue checking at 1 minute interval
A naive solution that came to mind is to simply have a worker that runs every second or so, checks all the pending jobs and executes the ones that need to be executed. But how would this work if the number of jobs is something like 100 000? It might take longer to check them all than it is the ticking interval of the worker, and the more tasks there will be, the higher the poll interval.
Is there a better way to design a system like this? Are there any hidden challenges in implementing this, or any algorithms that deal with this sort of a problem?
Use a priority queue (with the priority based on the next execution time) to hold the tasks to execute. When you're done executing a task, you sleep until the time for the task at the front of the queue. When a task comes due, you remove and execute it, then (if its recurring) compute the next time it needs to run, and insert it back into the priority queue based on its next run time.
This way you have one sleep active at any given time. Insertions and removals have logarithmic complexity, so it remains efficient even if you have millions of tasks (e.g., inserting into a priority queue that has a million tasks should take about 20 comparisons in the worst case).
There is one point that can be a little tricky: if the execution thread is waiting until a particular time to execute the item at the head of the queue, and you insert a new item that goes at the head of the queue, ahead of the item that was previously there, you need to wake up the thread so it can re-adjust its sleep time for the item that's now at the head of the queue.
We encountered this same issue while designing Revalee, an open source project for scheduling triggered callbacks. In the end, we ended up writing our own priority queue class (we called ours a ScheduledDictionary) to handle the use case you outlined in your question. As a free, open source project, the complete source code (C#, in this case) is available on GitHub. I'd recommend that you check it out.

Heroku: Why does increasing number of worker dynos increase processing time of job?

I'm seeing strange behavior when scaling Delayed::Job workers on Heroku.
I have a few thousand jobs that are all basically identical. When I assign 1 worker dyno to that queue, each job completes in about 4s.
When I scale the number of workers to 2, processing time averages 8s per job
When I scale the number of workers to 10, average processing time per job increases to above 30s per job.
I would not expect processing time per job to increase when scaling the number of workers.
As it is currently behaving, there is no way to scale up the number of workers to "churn through" a backlog of jobs, as the increase in processing time offsets any gains in having more workers.
Has anyone else seen this behavior and (more importantly) know how to resolve the issue?
Do you have any metrics on the database processing time? Seems possible that the bottleneck could be in the database engine and so no matter how many workers you have, you'd still be locked up there...

Hadoop Fairschduler doesn't utilize all map slots

Running a 12-node hadoop cluster with total 48 map-slots available. Submitting bunch of jobs, but never see all map slots being utilized. Maximum number of busy slots is floating around 30-35, but never close to 48. Why?
Here's the configuration of fairscheduler.
<?xml version="1.0"?>
<allocations>
<pool name="big">
<minMaps>10</minMaps>
<minReduces>10</minReduces>
<maxRunningJobs>3</maxRunningJobs>
</pool>
<pool name="medium">
<minMaps>10</minMaps>
<minReduces>10</minReduces>
<maxRunningJobs>3</maxRunningJobs>
<weight>3.0</weight>
</pool>
<pool name="small">
<minMaps>20</minMaps>
<minReduces>20</minReduces>
<maxRunningJobs>20</maxRunningJobs>
<weight>100.0</weight>
</pool>
</allocations>
The idea is that jobs in small queue should always have a priority, the next important queue is 'medium' and the less important is 'big'. Sometimes I see jobs in medium or big queue starve although there are more map slots available that are not used.
I think that the issue can be caused because the maxRunningJobs option is not taken into account while computing shares for jobs. I think that parameter is handled after slots (from the exceeding job) has been already assigned to a tasktracker. That is happening every n seconds from the UpdateThread.update()-> update Runability() method from FairScheduler class. I suppose that in your case after some time jobs from “medium” and “big” pool gets a bigger deficit than jobs from the “small” pool, that means that the next task will be scheduled from the job in medium or big pool. When the task is scheduled the restriction of maxRunningJobs take place and puts the exceeding jobs into a non runnable state. The same thing appears on the following update.
This is just my guess after looking after some source of fscheduler. If you can I would probably try to remove maxRunningJobs from the config and see how the scheduler behaves without that limitation and if it takes all of your slots..
Weigths for the pools in my oppinion seems to be to high. Weigh of 100 would mean that this pool should get 100x more slots than the default pool. I would try to lower this number by few factors if you want to have fair sharing between your pools. Otherwise jobs from others pools will be launched just when they will meet their deficit (it is calculated from the running tasks and minShare)
Another option why jobs are starving is maybe because of delay scheduling that is included in the fsched with the aim of improving computation locality? This can be probably improved by increasing a repclication factor but I do not think this is your case..
some docs on the fairscheduler..
The starvation probably occurs because the priority of the small pool is really really high (2^100 more than big 2^97 more than medium). When all the jobs are are ordered by priority and you have waiting jobs in the small pool. The next job in that pool needs 20 slots and it has higher priority than anything else so the open slots just wait there until a currently running job will free them. there are no "unneeded slots" to divide to other priorities
see highlights from the implementation notes of the fair schedulere:
"The fair shares are calculated by dividing the capacity of the
cluster among runnable jobs according to a "weight" for each job. By
default the weight is based on priority, with each level of priority
having 2x higher weight than the next (for example, VERY_HIGH has 4x
the weight of NORMAL). However, weights can also be based on job sizes
and ages, as described in the Configuring section. For jobs that are
in a pool, fair shares also take into account the minimum guarantee
for that pool. This capacity is divided among the jobs in that pool
according again to their weights."
Finally, when limits on a user's running jobs or a pool's running jobs
are in place, we choose which jobs get to run by sorting all jobs in
order of priority and then submit time, as in the standard Hadoop
scheduler. Any jobs that fall after the user/pool's limit in this
ordering are queued up and wait idle until they can be run. During
this time, they are ignored from the fair sharing calculations and do
not gain or lose deficit (their fair share is set to zero).

Resources