Ruby - Timeout when the Queue is not decreasing - ruby

I have a program that uses multiple threads and a queue. The threads are decreasing the queue (in case the data that was acted on was successful processed) but there are certain items (which I don't know up-front) which simply cannot be decreased.
So say I have a queue of size 700, it goes down to 670, 340, 20...and then it gets stuck there.
Is there some way to tell Ruby to do something in case the queue size stays the same for more than X seconds?


Efficient polling algorithm to poll a queue count. Poll more often as the number gets closer to the target

I have a loop that polls a queue waiting for the queue to reach a target value. Once the queue reaches that target value then I initiate another action.
At the moment I have just set the polling to 5 seconds but what I would like to do is vary the polling frequency depending on how far away I am from the target. So if my target valus is 10000 messages on the queue and I start polling the queue. The first count is say 5 items. I sleep for 5 seconds then poll again and the number is 20 items etc. What I would like to do is do the first poll which brings back a count of 5 but then sleep longer because I am so far away from the target value. As I get close to my target value I want to poll more often. So as I approach a count of 10000 my delay is perhaps a second in length.
What would be an efficient algorithm for implementing that.

Find an algorithm to dynamically optimize the size of the queue

Current state of the world.
I have a fixed size Queue of items.
Consumers get items from the Queue at different rates. It changes overtime, I.e 10 request by minute , tomorrow it can be 20 request by minute and so on. It changes overtime.
Items are put on the Queue to keep up with the desired size of the Queue.
The rate of putting items into the queue is also not constant overtime. It can vary.
find the a size of a queue that ensures that queue are not empty.
With a fixed size queue, we would need to keep a huge size of queue so that the changes to dry out the queue are reduced
So a better approach would be to dynamically change the size of the queue, so that based on the rate of consumers and the rate of producers we find an optimal size of a queue such as:
We avoid running out it items on the Queue
We avoid keeping a waste of items by keeping lot of items on the queue.
Maybe the ideal size is keeping the Queue 25% from getting empty.

how to optimally use nifi wait processor

I am currently creating a flow, where I will be merging result of 10K http response. I have couple of questions. (please refer image below, I am numbering my questions as per image).
1) As queue is becoming too long, is it ok to put "concurrent task" as 10 for invokeHTTP? what should drive this? # of cores on the server?
2) wait is showing quite a big number, is this just # of bytes it is writing? or is this using that much memory? if this is just a write, then I might be ok...but if it is some internal queue, then soon I may run out of memory?
does it make sense to reduce this number? by increasing "Run Schedule" from 0 to say 20 sec?
3) what exactly is "Back Pressure Data Size Threshold", value is set at 1 GB, does it meant, if size of ff in queue is more than that, nifi will start dropping it? or will it somehow stop processing of upstream processor?
1) Yes increasing concurrent tasks on InvokeHttp would probably make sense. I wouldn't jump right to 10, but would test increasing from 1 to 2, 2 to 3, etc until it seems to be working better. Concurrent tasks is the number of threads that can concurrently execute the processor, the total number of threads for your NiFi instance is defined in the controller settings from top right menu under Timer Driven threads, you should set the timer driven threads based of the # of CPUs/core you have.
2) The stats on the processor are totals for the last 5 mins, so "In" is the total size of all the flow files that have come in to the processor in the last 5 mins. You can see "Out" is almost the same # which means almost all the flow files in have also been transferred out.
3) Back-pressure stops the upstream processor from executing until the back pressure threshold is reduced. The data size threshold is saying "when the total size of all flow files in the queue exceeds 1GB, then stop executing the upstream processor so that no more data enters the queue while the downstream processor works on the queue". In the case of a self-loop connection, I think back-pressure won't stop the processor from executing otherwise it will end up in a dead-lock where it can't produce more data but also can't work off the queue. In any case, data is never dropped unless you set flow file expiration on the queue.

Suggestion for Oracle AQ dequeue approach

I have a need to dequeue messages coming from an Oracle Queue on a continuous basis.
As far as I could imagine, we can deuque the message in two ways, either through Asyncronous Auto-Notification approach or by manual polling process where one can dequeue one message at a time.
I can't go for Asyncronous notification feature as the number of messages it receives could go upto 1000 within 5 mintues during peak hours and
I do not want to overload the database by spawning multiple callback procedures in the background.
With the manual polling process,I can create a one-time scheduler job that runs 24*7 which calls a stored proc that dequeus the messages in a loop in WAIT mode(kind of listening for a message).
The problem with this approach is that
1) the scheduler job runs continously and occupies one permanent job slot
2) the stored procedure does not EXIT as it runs in a loop waiting for messages.
Are there any alternative/better solutions where I do not need to have a job/procedure running continuously looking for messages?
Can I use auto-notification approach to get notification for the very first message,un-subscribe the subscriber and dequeue further messages and
subscribe to the queue again when there are no more messages ? Is this a safe approach and will i lose any message in between subscription and un-subscription ?
BTW, We use Oracle 10gR2 database, so I can't use PURGE ON NOTIFICATION option.
Appreciate your expert solution!!
You're right, it's not a good idea to use auto-notification for a high-volume queue.
At one client I've seen a one-time scheduler job which runs 24*7, it seems to work reasonably well, and they can enqueue a special "STOP" message (which goes to the top of the queue) that it listens for and stops processing messages.
However, generally I'd lean towards a job that runs regularly (e.g. once per minute, or whatever granularity is suitable for you) which would dequeue all the messages. I'd put the dequeue in a loop with a loop counter and a "maximum messages" limiter based on the maximum number of messages you'd expect in a 1-minute period. The job would keep processing messages until (a) there are no more messages in the queue, or (b) the maximum limit has been reached.
You can then set the schedule for the job based on the maximum delay you want to see between an enqueue and a dequeue. E.g. if it doesn't matter if a message isn't processed within 5 minutes, you could set the job to run once every 5 minutes.
The maximum limit needs to be quite a high figure - e.g. 10x or 100x the expected maximum number - otherwise a spike could flood your queue and it might not keep up. The idea of the maximum limit is to ensure that the job never runs forever. This should give ops enough time to detect a problem with the queue (e.g. if some rogue process is flooding the queue with bogus messages).

Searching an algorithm similar to producer-consumer

I would like to ask if someone would have an idea on the best(fastest) algorithm for the following scenario:
X processes generate a list of very large files. Each process generates one file at a time
Y processes are being notified that a file is ready. Each Y process has its own queue to collect the notifications
At a given time 1 X process will notify 1 Y process through a Load Balancer that has the Round Rubin algorithm
Each file has a size and naturally, bigger files will keep both X and Y more busy
Once a file gets on a Y process it would be impractical to remove it and move it to another Y process.
I can't think of other limitations at the moment.
Disadvantages to this approach
sometimes X falls behind(files are no longer pushed). It's not really impacted by the queueing system and no matter if I change it it will still have slow/good times.
sometimes Y falls behind(a lot of files gather in the queues). Again, the same thing like before.
1 Y process is busy with a very large file. It also has several small files in its queue that could be taken on by other Y processes.
The notification itself is through HTTP and seems somehow unreliable sometimes. Notifications fail and debugging has not revealed anything.
There are some more details that would help to see the picture more clearly.
Y processes are DB threads/jobs
X processes are web apps
Once files reach the X processes, these would also burn resources from the DB side by querying it. It has an impact on the producing part
Now I considered the following approach:
X will produce files like it has before but will not notify Y. It will hold a buffer (table) to populate the file list
Y will constantly search for files in the buffer and retrieve them itself and store them in its own queue.
Now would this change be practical? Like I said, each Y process has its own queue, it doesn't seem to be efficient to keep it anymore. If so, then I'm still undecided on the next bit:
How to decide which files to fetch
I've read through the knapsack problem and I think that has application if I would have the entire list of files from the beginning which I don't. Actually, I do have the list and the size of each file but I wouldn't know when each file would be ready to be taken.
I've gone through the producer-consumer problem but that centers around a fixed buffer and optimising that but in this scenario the buffer is unlimited and I don't really care if it is large or small.
The next best option would be a greedy approach where each Y process locks on the smallest file and takes it. At first it does appear to be the fastest approach and I'm currently building a simulation to verify that but a second opinion would be fantastic.
Just to be sure that everyone gets the big picture, I'm linking here a fast-done diagram.
Jobs are independent from Processes. They will run at a speed and process how many files are possible.
When a Job finishes with a file it will send a HTTP request to the LB
Each process queues requests (files) coming from the LB
The LB works on a round robin rule
The current LB idea is not good
The load balancer as you've described it is a bad idea because it's needlessly required to predict the future, which you are saying is impossible. Moreover, round-robin is a terrible scheduling strategy when jobs have varying lengths.
Just have consumers notify the LB when they're idle. When a new job arrives from a producer, it selects the first idle consumer and sends the job there. When there are no idle consumers, the producer's job is queued in the LB waiting for a free consumer to appear.
This way consumers will always be optimally busy.
You say "Having one queue to serve 100 apps (for example) would be inefficient." This is a huge leap of intuition that's probably wrong. A work queue that's only handling file names can be fast. You need it only to be 100 times faster (because you infer there are 100 consumers) than the average "very large file" handling operation. File handling is normally 10th of seconds or seconds. A queue handler based, say, on an Apache mod or Redis for two random choices, could pretty easily serve 10,000 requests per second. This is a factor of 10 away from being a bottleneck.
If you select from idle consumers on a FIFO basis, the behavior will be round-robin when all jobs are equal length.
If the LB absolutelly cannot queue work
Then let Ty(t) be the total future time needed to complete the work in the queue of consumer y at the current epoch t. The LB's goal is to make Ty(t) values equal for all y and t. This is the ideal.
To get as close as possible to the ideal, it needs an internal model to compute these Ty(t) values. When a new job arrives from a producer at epoch t, it finds consumer y with the the minimum Ty(t) value, assigns the job to this y, and adjusts the model accordingly. This is a variation of the "least time remaining" scheduling strategy, which is optimal for this situation.
The model must inevitably be an approximation. The quality of the approximation will determine its usefulness.
A standard approach (e.g. from OS scheduling), will be to maintain a pair [t, T]_y for each consumer y. T is the estimate of Ty(t) that was computed at the past epoch t. Thus at a later epoch t+d, we can estimate Ty(t+d) as max(T-t,0). The max is because for d>t, the estimated job time has expired, so the consumer should be complete.
The LB uses whatever information it can get to update the model. Examples are estimates of time a job will require (from your description probably based on file size and other characteristics), notification that the consumer has actually finished a job (LB decreases T by the esimated duration of the completed job and updates t), assignment of a new job (LB increases T by the estimated duration of the new job and updates t), and intermediate progress updates of estimated time remaining from consumers during long jobs.
If the information available to the LB is detailed, you will want to replace the total time T in the [t, T]_y pair with a more complete model of the work queued at y: for example a list of estimated job durations, where the head of the list is the one currently being executed.
The more accurate the LB model, the less likely a consumer will starve when work is available, which is what you are trying to avoid.
