what's good for Oracle DBMS_Scheduler?
keeping a job scheduled(disabled) every time. and enable it and run it when needed.
create the job ,run it and drop it.
I have a table x and whenever a records gets submitted to that table ,I should have a job to process that record.
we may or may not have the record insertions always..
Keeping this in mind..what's better...?
Processing rows as they appear in a table in an asynchronous process can be done in a number of different ways, choose the way that suits you:
Add a trigger to the table which creates a one-off job to process the row using DBMS_JOB. This is suitable if the volume of data being inserted to the table is quite low, and you don't want your job running all the time. The advantage of DBMS_JOB is that the job will not start until the insert is committed; if it is rolled back, the job is also rolled back so doesn't run. The disadvantage is that if there is a sustained spike of activity, all the jobs created will crowd out any other jobs that are running.
Create a single job using DBMS_SCHEDULER which runs regularly, polls the table for new records and processes them. This method would need a column on the table that it can update to mark each record as "processed". For example, add a VARCHAR2(1) flag which is set to 'Y' on insert and set to NULL by the job after processing. You could add an index to that flag which will only store entries for unprocessed rows (so it will be small and fast). This method is much more efficient, especially for large data volumes, because each run of the job can effectively process large chunks of data in bulk at a time.
Use Oracle Advanced Queueing. http://docs.oracle.com/cd/E11882_01/server.112/e11013/aq_intro.htm#ADQUE0100
For (1), a separate job is created for each record in the table. You don't need to create the jobs. You do need to monitor them, however; if one fails, you would need to investigate and re-run manually.
For (2), you just create one job and let it run regularly. If one record fails, it could be picked up by the next iteration of the job. I would process each record in a separate transaction so the failure of one record doesn't affect the failure of other records still in the queue.
For (3), you still create a job like (2) but instead of reading the table it pulls requests off a queue.
Related
I am inserting/updating data into a table. The database system does not provide an "Upsert" functionality. Thus I am using a staging table for the insert followed by a merge into the "final" table and finally I am truncating the staging table.
This leads to a race condition. If new data is inserted into the staging table between the merge+truncate this data is lost.
How can I make sure this does not happen?
I have tried to model this via Wait/Notify, but this is not a clean solution either. The queue for the "Put Data into staging table" PutDatabaseRecord processor could be filled and "MergeVertica for Insert/Update" ExecuteSQL could still execute.
I would use a MonitorActivity processor with a 60 or 30 sec threshold and use the Inactive output with a Continually Send Messages set to "false".
Have the success of the SQL inserts into staging connection into your MonitorActivity, this way if no activity is seen in the last X seconds he will trigger a flowfile that will start your Merge Process.
Download the template from https://codeshare.io/aJNNkn
I have started using spring from last few months and I have a question on transactions. I have a java method inside my spring batch job which first does a select operation to get first 100 rows with status as 'NOT COMPLETED' and does a update on the selected rows to change the status to 'IN PROGRESS'. Since I'm processing around 10 million records, I want to run multiple instances of my batch job and each instance has multiple threads. For a single instance, to make sure two threads are not fetching the same set of records, I have made my method as synchonized. But if I run multiple instances of my batch job (multiple JVMs), there is high probability that same set of records might be fetched by both the instances even if I use "optimistic" or "pesimistic lock" or "select for update" since we cannot lock records during selection. Below is the example shown. Transaction 1 has fetched 100 records and meanwhile Transaction2 also fetched 100 records but if I enable locking transaction 2 waits until transaction 1 is updated and committed. But Transaction 2 again does the same update.
Is there any way in spring to make transaction 2's select operation to wait until transaction 1's select is completed ?
Transaction1 Transaction2
fetch 100 records
fetch 100 records
update 100 records
commit
update 100 records
commit
#Transactional
public synchronized List<Student> processStudentRecords(){
List<Student> students = getNotCompletedRecords();
if(null != students && students.size() > 0){
updateStatusToInProgress(students);
}
return student;
}
Note: I cannot perform update first and then select. I would appreciate if any alternative approach is suggested ?
Transaction synchronization should be left to the database server and not managed at the application level. From the database server point of view, no matter how many JVMs (threads) you have, those are concurrent database clients asking for read/write operations. You should not bother yourself with such concerns.
What you should do though is try to minimize contention as much as possible in the design of your solution, for example, by using the (remote) partitioning technique.
if I run multiple instances of my batch job (multiple JVMs), there is high probability that same set of records might be fetched by both the instances even if I use "optimistic" or "pesimistic lock" or "select for update" since we cannot lock records during selection
Partitioning data will by design remove all these problems. If you give each instance a set of data to work on, there is no chance that a worker would select the same of records of another worker. Michael gave a detailed example in this answer: https://stackoverflow.com/a/54889092/5019386.
(Logical) Partitioning however will not solve the contention problem since all workers would read/write from/to the same table, but that's the nature of the problem you are trying to solve. What I'm saying is that you don't need to start locking/unlocking the table in your design, leave this to the database. Some database severs like Oracle can write data of the same table to different partitions on disk to optimize concurrent access (which might help if you use partitioning), but again that's Oracle's business, not Spring's (or any other framework) business.
Not everybody can afford Oracle so I would look for a solution at the conceptual level. I have successfully used the following solution ("Pseudo" physical partitioning) to a problem similar to yours:
Step 1 (in serial): copy/partition unprocessed data to temporary tables (in serial)
Step 2 (in parallel): run multiple workers on these tables instead of the source table with millions of rows.
Step 3 (in serial): copy/update processed data back to the original table
Step 2 removes the contention problem. Usually, the cost of (Step 1 + Step 3) is neglectable compared to Step 2 (even more neglectable if Step 2 is done in serial). This works well if the processing is the bottleneck.
Hope this helps.
I have a db job running daily that manages to process 10.000 rows from a table of 3.500.000 rows, in three hours.
Tuning the main cursor's select statement can only save me 30 minutes, but I need to reduce the job running time from 3 hours to 10-15 minutes.
I have to state that there is only the main loop for the cursor and for each record there are calls to external systems, in order to get or send data, so this is an overhead I cannot control. The time for each record to be processed after it is fetched is a little less than a second and that is not acceptable ...
Is there something I could do? All ideas are more than welcome!
Imho, you can submit job for each query to external system or try to run in parallel, may be you can use ADVANSED QUEUE. Explain: send each selected row to queue, and quering to external will proceed with AQ
You may try to process rows in parallel.
Basically, the needed job is for large amount of records on a data base, and more records can be inserted all the time:
Select <1000> records with status "NEW" -> process the records -> update the records to status "DONE".
This sounds to me like "Map Reduce".
I think that the job described above can may be done in parallel, even by different machines, but then my concern is:
When I select <1000> records with status "NEW" - how can I know that none of these records are already being processed by some other job ?
The same records should not be selected and processed more than once of course.
Performance is critical.
The naive solution is to do the mentioned basic job in a loop.
It seems related to big data processing / nosql / map reduce etc'.
Thanks
Since considering Performance issue... We can can achieve this.The main goal is to distribute records to clients such way that no to clients get same record.
I irrespective of database...
If you have one more column which is used for locking record. So on fetching those records you can set lock, To prevent from fetching for send time.
But if you don not have such capability then my bets bet would be to create another table or im-memory key-value store, with Record primary key and lock, and on fetching records you need to check of record does not exist in other table....
If you have HBase then it can be achieved easily first approach is achievable with performance.
Advice/suggestions needed for a bit of application design.
I have an application which uses 2 tables, one is a staging table, which many separate processes write to, once a 'group' of processes has finished, another job comes along a aggregates the results together into a final table, then deletes that 'group' from the staging table.
The problem that I'm having is that when the staging table is being cleared out, lots of redo is generated and I'm seeing a lot of 'log file sync' waits in the database. This is a shared database with many other applications and this is causing some issues.
When applying the aggregate, the rows are reduced to about 1 row in the final table for every 20 rows in the staging table.
I'm thinking of getting around this by rather than having a single 'staging' table, I will create a table for each 'group'. Once done, this table can just be dropped, which should result in much less redo.
I only have SE, so partitioned tables isn't an option. Also faster disks for the redo probably isn't an option in the short term either.
Is this a bad idea? Any better solutions to be offered?
Thanks.
Would it be possible to solve the problem by having your process do a logical delete (i.e. set a DELETE_FLAG column in the table to 'Y') and then having a nightly process that truncates the table (potentially writing any non-deleted rows to a separate table before the truncate and then copy them back after the table is truncated)?
Are you certain that the source of the log file sync waits is that your disks can't keep up with the I/O? That's certainly possible, of course, but there are other possible causes of excessive log file sync waits including excessive commits. There is an excellent article on tuning log file sync events on the Pythian blog.
The most common cause of excessive log file syncs is too frequent commits, which are often deliberately coded in a mistaken attempt to reduce system load due to locking. You should commit only when your business transaction is complete.
Loading each group into a separate table sounds like a fine plan to reduce redo. You can truncate individual group table following each aggregation.
Another (but I think probably worse) option is to create a new staging table with the groups that haven't been aggregated then drop the original and rename the new table to replace the staging table.
I prefer Justin's suggestion ("logical delete"), but another option to consider might be a partitioned table, if you have the EE licence. The aggregation process could drop a partition instead of deleting the rows.