How can I distribute a task between many process in Ruby?

How can I distribute a task between many process in Ruby? - ruby

I have a ruby daemon that selects 100 records from database and do a task with it.
To make it faster I usually create 3 instances of the same daemon. And each one selects diferents data by using mysql LIMIT and OFFSET.
The problem is that sometimes a task is performed 2 or 3 times with the same data record.
So I think that trusting only on database LIMIT and OFFSET is not enough ... since 2 or more daemons can actually collects the same data at the same time sometimes.
How can I do it safely? Avoiding 2 instances to select the same data
Daemon 1 => selects records from 1 to 100
Daemon 2 => selects records from 101 to 200
Daemon 3 => selects records from 201 to 300

Rather than rolling your own solution, you might want to look at existing solutions for processing background jobs like Resque (a personal favorite). With Resque, you would queue a job for each of your rows using a trigger that makes sense in your application (it's hard to say without any context) for example a link on your website. At all times you would keep X number of workers running (three in your case) and Resque will do the queue management work for you. Resque uses Redis as a backend, so it supports atomic push/pop out of the gate (no more double-processing).
Resque also comes with a very intuitive and easy to use web interface for monitoring your jobs and workers.

Related

Laravel, Queue, Horizon and >10 Servers (Workers) - Creating several million jobs takes extremely long

I need to create several 10 million jobs.
I have tried it with for-loops and Bus::batch([]) and unfortunately the creation of the jobs takes longer than the processing of the jobs by the 10 servers/workers. That means the workers have to wait until the job shows up in the database (redis etc). With redis-benchmark I could learn that Redis is not the problem.
Anyway... is there a way to create jobs in BULK (not batch)? I'm just thinking of something like:
INSERT INTO ... () VALUES (), (), (), (), ...
Anyway, creating several million jobs in a for-loop or in batch seems to be very slow for some reason. Probably because it's always just 1 query at a time and not an "upsert".
For any help I would be very grateful!

Writing a million records will be kind of slow at any condition. I'd recommend maximize your queue performance using several methods:
Create job that will create all other jobs if possible
Use only QUEUE_CONNECTION=redis for your queues as redis stores data in RAM which is fastest possible.
Create your jobs after response was processed already

How to scale DynamoDB record processing?

I'm building a web-based CRON service using DynamoDB and Lambda. While I don't currently have the following problem, I'm curious about how I could solve it if it arises.
The architecture works like this:
Lambda A - query for all tasks that should occur in the current minute
Lambda A - for each task, increment a counter on the document
Lambda B - listen for the stream event for each document and run the actual CRON task
As far as I can tell, Lambda B should be scalable - AWS should run as many instances as needed to process all the stream events (I think).
But for Lambda A, say I have 1 billion documents that need to be processed each minute.
When I query for each minute's tasks, the Lambda will need to make multiple requests in order to fetch & update all the documents.
How could I architect the system such that all the documents get processed in < 60 seconds?

You're right, Lambda A would have to do a monster scan/query which wouldn't scale.
One way to architect this to make this work would be to partition your cron items so that you can invoke multiple lambdas in parallel (i.e. fan out the work) instead of just one (lambda A) so that each one handles a partition (or set of partitions) instead of the whole thing.
How you achieve this depends on what your current primary key looks like and how else you expect to query these items. Here's one solution:
cronID | rangeKey | jobInfo | counter
1001 | 72_2020-05-05T13:58:00 | foo | 4
1002 | 99_2020-05-05T14:05:00 | bar | 42
1003 | 01_2020-05-05T14:05:00 | baz | 0
1004 | 13_2020-05-05T14:10:00 | blah | 2
1005 | 42_2020-05-05T13:25:00 | 42 | 99
I've added a random prefix (00-99) to the rangeKey, so you can have different lambdas query different sets of items in parallel based on that prefix.
In this example you could invoke 100 lambdas each minute (the "Lambda A" types), with each handling a single prefix set. Or you could have say 5 lambdas, with each handling a range of 20 prefixes. You could even dynamically scale the number of lambda invocations up and down depending on load, without having to update the prefixes in your data in your table.
Since these lambdas are basically the same, you could just invoke lambda A the required number of times, injecting the appropriate prefix(es) for each one as a config.
EDIT
Re the 1MB page limit in your comment, you'll get a LastEvaluatedKey back if your query has been limited. Your lambda can execute queries in a loop, passing the LastEvaluatedKey value back as ExclusiveStartKey until you've got all the result pages.
You'll still need to be careful of running time (and catching errors to retry since this is not atomic) but fanning your lambdas as above will deal with the running time if you fan it widely enough.

I'm not sure about your project but looks like what you are asking is already in the AWS DynamoDb Documentation, read here:
When you create a new provisioned table in Amazon DynamoDB, you must
specify its provisioned throughput capacity. This is the amount of
read and write activity that the table can support. DynamoDB uses this
information to reserve sufficient system resources to meet your
throughput requirements.
You can create an on-demand mode table instead so that you don't have
to manage any capacity settings for servers, storage, or throughput.
DynamoDB instantly accommodates your workloads as they ramp up or down
to any previously reached traffic level. If a workload’s traffic level
hits a new peak, DynamoDB adapts rapidly to accommodate the workload.
For more information
You can optionally allow DynamoDB auto scaling to manage your table's
throughput capacity. However, you still must provide initial settings
for read and write capacity when you create the table. DynamoDB auto
scaling uses these initial settings as a starting point, and then
adjusts them dynamically in response to your application's
requirements
As your application data and access requirements change, you might
need to adjust your table's throughput settings. If you're using
DynamoDB auto scaling, the throughput settings are automatically
adjusted in response to actual workloads. You can also use the
UpdateTable operation to manually adjust your table's throughput
capacity. You might decide to do this if you need to bulk-load data
from an existing data store into your new DynamoDB table. You could
create the table with a large write throughput setting and then reduce
this setting after the bulk data load is complete.
You specify throughput requirements in terms of capacity units—the
amount of data your application needs to read or write per second. You
can modify these settings later, if needed, or enable DynamoDB auto
scaling to modify them automatically.
I hope this can help your doubt.

How to lock on select and release lock after update is committed using spring?

I have started using spring from last few months and I have a question on transactions. I have a java method inside my spring batch job which first does a select operation to get first 100 rows with status as 'NOT COMPLETED' and does a update on the selected rows to change the status to 'IN PROGRESS'. Since I'm processing around 10 million records, I want to run multiple instances of my batch job and each instance has multiple threads. For a single instance, to make sure two threads are not fetching the same set of records, I have made my method as synchonized. But if I run multiple instances of my batch job (multiple JVMs), there is high probability that same set of records might be fetched by both the instances even if I use "optimistic" or "pesimistic lock" or "select for update" since we cannot lock records during selection. Below is the example shown. Transaction 1 has fetched 100 records and meanwhile Transaction2 also fetched 100 records but if I enable locking transaction 2 waits until transaction 1 is updated and committed. But Transaction 2 again does the same update.
Is there any way in spring to make transaction 2's select operation to wait until transaction 1's select is completed ?
Transaction1 Transaction2
fetch 100 records
fetch 100 records
update 100 records
commit
update 100 records
commit
#Transactional
public synchronized List<Student> processStudentRecords(){
List<Student> students = getNotCompletedRecords();
if(null != students && students.size() > 0){
updateStatusToInProgress(students);
}
return student;
}
Note: I cannot perform update first and then select. I would appreciate if any alternative approach is suggested ?

Transaction synchronization should be left to the database server and not managed at the application level. From the database server point of view, no matter how many JVMs (threads) you have, those are concurrent database clients asking for read/write operations. You should not bother yourself with such concerns.
What you should do though is try to minimize contention as much as possible in the design of your solution, for example, by using the (remote) partitioning technique.
if I run multiple instances of my batch job (multiple JVMs), there is high probability that same set of records might be fetched by both the instances even if I use "optimistic" or "pesimistic lock" or "select for update" since we cannot lock records during selection
Partitioning data will by design remove all these problems. If you give each instance a set of data to work on, there is no chance that a worker would select the same of records of another worker. Michael gave a detailed example in this answer: https://stackoverflow.com/a/54889092/5019386.
(Logical) Partitioning however will not solve the contention problem since all workers would read/write from/to the same table, but that's the nature of the problem you are trying to solve. What I'm saying is that you don't need to start locking/unlocking the table in your design, leave this to the database. Some database severs like Oracle can write data of the same table to different partitions on disk to optimize concurrent access (which might help if you use partitioning), but again that's Oracle's business, not Spring's (or any other framework) business.
Not everybody can afford Oracle so I would look for a solution at the conceptual level. I have successfully used the following solution ("Pseudo" physical partitioning) to a problem similar to yours:
Step 1 (in serial): copy/partition unprocessed data to temporary tables (in serial)
Step 2 (in parallel): run multiple workers on these tables instead of the source table with millions of rows.
Step 3 (in serial): copy/update processed data back to the original table
Step 2 removes the contention problem. Usually, the cost of (Step 1 + Step 3) is neglectable compared to Step 2 (even more neglectable if Step 2 is done in serial). This works well if the processing is the bottleneck.
Hope this helps.

how to configure Java EE timers to access same table in a clustered environment

I have an application which contains a Java EE timer that runs on 2 clustered nodes. The task of the timer is to fetch batches of 100 records from a common table (mysql database) and process it. Since the timer runs on both the nodes , both process the same records. So how to configure it properly so that 1st timer can take 1st batch of 100 records and 2nd timer can take the next batch of 100 records.
I have tried putting the pessimistic write lock on the query, and setting the max results to 100 in java code. But it is not good in performance.
Is there any way to handle this scenario with better performance?

What AQ$_PLSQL_NTFNnnnn scheduler jobs are used for?

I don't use Advanced Queueing at all, but amount of AQ$_PLSQL_NTFNnnnn scheduler jobs grows continuously.
Currently there are 8 such jobs. And because of them - I need to refresh maximum simultaneous jobs running count.
About 2 months ago it was ok with limit of 10, currently I have limit 15 and because of 8 "unnecessary" (at least for me) that jobs - I need to increase it to even 20 or 25 :-S
So, what are they used for? Can I just drop/disable them?
UPD: I've increased number of simultaneous jobs to 25 and in this night the amount of AQ... jobs rose up to 25 :-S Is it a joke?!

It sounds to me like something is using AQ somewhere in your database.
I googled around a bit, and there is some possibly useful information here - http://www.ora600.be/node/5547 - mostly the hidden parameter _srvntfn_max_concurrent_jobs that apparently limits the total number of jobs running for notifications.
Information seems to be hard to come by, but apparently notifications go into the table sys.alert_qt, so you could try having a look in there to see what appears.
You could also have a look in the ALL_QUEUES and other ALL_QUEUE* tables to see if there are any queues on your database you are not aware of.
I am assuming you are using Oracle 11gR1 or 11gR2?

When using a PL/SQL Callback function for processing the AQ queue, we have seen these jobs being generated. You can check this table to find any registered subscriptions:
select * from dba_subscr_registrations;
More about AQ PL/SQL Callback

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio