I have a Grails application thread (a Quartz job) that does batch processing of a large number of records. The processing creates a new object with a different object type that is based on the input object. The new object refers to the input object. The input object is immutable and is not modified in any way.
I'm attempting to use a ScrollableResults obtained from the Grails criteria API to iterate the set of input objects efficiently. The outer iteration is done outside of a transaction because I want to calculate progress on the batch by counting the number of processed records that I have created. Each output record is created in a separate transaction.
After I successfully process the first input object retrieved from the ScrollableResults, I get the following exception when trying to fetch the second input object:
Message: could not advance using next()
Line | Method
->> 107 | execute in com.spiekerpoint.reacs.ForecastJob
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| 102 | execute in grails.plugins.quartz.GrailsJobFactory$GrailsJob
| 202 | run . . in org.quartz.core.JobRunShell
^ 573 | run in org.quartz.simpl.SimpleThreadPool$WorkerThread
Caused by SQLException: Operation not allowed after ResultSet closed
->> 1086 | createSQLException in com.mysql.jdbc.SQLError
The input object, and by extension the outer cursor (ScrollableResult), have joined the implicit transaction created by the Grails service call that does the processing for the new record and the cursor is implicitly closed.
Here are simplified versions of the classes and job code that are in play:
class InventoryItem {
String name
}
class ForecastInventoryItem {
InventoryItem item
Long forecast
}
/**** Quartz Job Code ****/
//Create a cursor based query for satellite inventory items
Closure criteriaClosure = forecastService.queryInventoryItemsWithCursor(forecast)
ScrollableResults cursor = InventoryItem.createCriteria().scroll(criteriaClosure)
//Forecast the inventory items
while (cursor.next()) {
//Get the next inventory item to forecast from the cursor
InventoryItem inventoryItem = cursor.get()[0]
if (!inventoryItem) {
log.error("Failed to fetch inventory item from scrollable result.")
return
}
//Forecast the item
log.info("Forecasting inventory item ${inventoryItem}")
forecastService.forecastInventoryItem(forecast, inventoryItem)
}
cursor.close()
I have tried the following things to "detach" the inventoryItem instance from the forecastService initiated transaction:
I wrapped the call to forecastService.forecastInventoryItem in a withNewTransaction {} block.
I passed in the inventoryItem id to the service within the new transaction and let it acquire a new copy of the InventoryItem record.
It still insists on closing the cursor on the close of the service transaction.
I could of course put everything into a single transaction but then I loose the benefit of being able to poll for progress on the processing because they won't show up to any thread (ie. the controllers) until the entire transaction closes.
Any thoughts from the Grails gurus on how I might be able to separate the outer cursor from the transaction and keep it from implicitly closing on return from the service call?
I was able to work around the transaction boundary problem inside the Quartz job by avoiding loading and handling of any objects in the job execution handler. I did this by eliminating the cursor and changing the criteria search to use a projection that returns a list of ids for the matching InventoryItem records. I then added a set of ForecastService methods that accepted an id for the InventoryItem and then fetched the InventoryItem record in the method. I also added a service method to flush the Hibernate session cache periodically (every 30 transaction was optimal for this particular case) keeping my memory usage low and avoiding the thrashing that kills performance when doing batch processing in a Hibernate session.
The result was a performance increase from ~3 items per second to ~80 items per second. Also, because each forecast runs in its own transaction, I can asynchronously track the progress of the forecast process by counting the number of ForecastInventoryItem object relative to the number of target InventoryItems.
In summary, do not deal with any domain objects in the job execution method because they will communicate the transaction boundary into the job.
Related
I have a Spring Boot 2.x project with a big Table in my Cassandra Database. In my Liquibase Migration Class, I need to replace a value from one column in all rows.
For me its a big perfomance hit, when I try to solve this with
SELECT * FROM BOOKING
forEach Row
Update Row
Because of the total number of rows. Even when I select only 1 Column.
Is it possible to make something like "partwise/pagination" loop?
Pseudecode
Take first 1000 rows
do Update
Take next 1000 rows
do Update
loop.
Im also happy about all other solution approaches you have.
Must known:
Make sure there is a way to group the updates by partition. If you try a batchUpdate on 1000 rows not in same partition the coordinator of the request will suffer, you are moving the load from your client to the coordinator, and you want the parallelize the writes instead. A batchUpdate with cassandra has nothing to do with the one in relational databases.
For fined-grained operations like this you want to go back to the usage of the drivers with CassandraOperations and CqlSession for maximum control
There is a way to paginate with Spring Data cassandra using Slice but do not have control over how operations are implemented.
Spring Data Cassandra core
Slice<MyEntity> slice = MyEntityRepo.findAll(CassandraPageRequest.first(size));
while(slice.hasNext() && currpage < page) {
slice = personrepo.findAll(slice.nextPageable());
currpage++;
}
slice.getContent();
Drivers:
// Prepare Statements to speed up queries
PreparedStatement selectPS = session.prepare(QueryBuilder
.selectFrom( "myEntity").all()
.build()
.setPageSize(1000) // 1000 per pages
.setTimeout(Duration.ofSeconds(10)); // 10s timeout
PreparedStatement updatePS = session.prepare(QueryBuilder
.update("mytable")
.setColumn("myColumn", QueryBuilder.bindMarker())
.whereColumn("myPK").isEqualTo(QueryBuilder.bindMarker())
.build()
.setConsistencyLevel(ConsistencyLevel.ONE)); // Fast writes
// Paginate
ResultSet page1 = session.execute(selectPS);
Iterator<Row> page1Iter = page1.iterator();
while (0 < page1.getAvailableWithoutFetching()) {
Row row = page1Iter.next();
cqlsession.executeAsync(updatePS.bind(...));
}
ByteBuffer pagingStateAsBytes =
page1.getExecutionInfo().getPagingState();
selectPS.setPagingState(pagingStateAsBytes);
ResultSet page2 = session.execute(selectPS);
You could of course include this pagination in a loop and track progress.
I am creating a Spring Batch process (Spring Boot 2) that reads a file and writes it to a Database. It processes it one record at a time. Read from file, process it, and write (or update) to the Database.
If a record for the same ID exists in the DB, the process has to update the end date of the existing record in DB, and create a new record with new start date. Below is the code:
public class Processor implements ItemProcessor<CelebVO, CelebVO> {
#Autowired
EndorseTableRepository endorseTableRepository;
#Override
#Transactional
public CelebVO process(CelebVO celebVO) {
CelebEndorsement celebEndorsement = endorseTableRepository.findAllByCelebIDAndBrandID(celebVO.getCelebID(),celebVO.getBrandID());
if (celebEndorsement == null) {
CelebEndorsement newEndorsement = new CelebEndorsement(celebVO);
endorseTableRepository.save(newEndorsement);
} else {
celebEndorsement.setEndDate(celebVO.getEffDt.minusDays(1));
endorseTableRepository.save(celebEndorsement);
// create a new row with new start date
CelebEndorsement newEndorsement = new CelebEndorsement(celebVO);
newEndorsement.setStartDate(celebVO.getEffDt());
endorseTableRepository.save(newEndorsement);
}
return celebVO;
}
}
Below is the input txt file (CelebVO):
CelebID BrandID EffDt
J Lo Pepsi 2021-01-05
J Lo Pepsi 2021-05-30
Now, lets suppose we are starting with an empty EndorseTable. When the process picks up the file and reads the records, it will see there are no records for CelebID 'J Lo'. So it will insert a row to the DB.
Now, the process reads the second row and process it. It should see that there is already a record in the table for J Lo. So it should put an endDate to that records and then create a new record.
After this file is processed we should see two records in the table.
But that is not what is happening. Though I do a repository.save() for the first record, it is still not commited to the table. So when the process reads the second row, it doesn't find any rows in the table. It ends up writing only one record to the table.
I tried a repository.saveAndFlush(). That doesn't help.
My chunk size is 1
I tried removing #Transactional. But that breaks the code. So I kept it there.
The chunk-oriented processing model of Spring Batch commits a transaction per chunk, not per record. So in your case, if the insert and the update happen to be in the same chunk, the processor won't see the change of the previous record as the transaction is not committed yet at that point.
Adding #Transactional on your processor's method is incorrect, because the processor will already be executed within the scope of a transaction driven by Spring Batch. What you are trying to do would work if you set the commit interval to 1, but this would impact the performance of your step.
I had to modify the Entity class. I replaced
#ManyToOne(cascade = CascadeType.ALL)
with
#ManyToOne(cascade = {CascadeType.MERGE, CascadeType.DETACH})
and it worked.
I have a simple kafka consumer that collects events and based on the data in them inserts or updates a record in the database - table has a unique ID constraint on ID column and also in the entity field.
Everything works fine when the table is pre-populated and inserts happen every now and then. However when i truncate the table and send a couple thousand events with limited number of ID (i was doing 50 unique ID within 3k events) then events are processed simultaneously and the save() method randomly fails with Unique constraint violation exception. I debugged it and the outcome is pretty simple.
event1={id = 1 ... //somedata} gets picked up, service method saveOrUpdateRecord() looks for the record by ID=1, finds none, inserts a new record.
event2={id = 1 ... //somedata} gets picked up almost at the same time, service method saveOrUpdateRecord() looks for the record by ID=1, finds none (previous one is mid-inserting), tries to insert and fails with constraint violation exception - should find this record and merge it with the input from the event based on my conditions.
How can i get the saveOrUpdateRecord() to run only when the previous one was fully executed to prevent such behaviour? I really dont want to slow kafka consumer down with poll size etc, i just want my service to execute one transaction at a time.
The service method:
public void saveOrUpdateRecord(Object input) {
Object output = repository.findById(input.getId));
if (output == null) {
repository.save(input);
} else {
mergeRecord(input, output);
repository.save(output);
}
}
Will #Transactional annotaion on method do the job?
Make your service thread safe.
Use this:
public synchronized void saveOrUpdateRecord(Object input) {
Object output = repository.findById(input.getId));
if (output == null) {
repository.save(input);
} else {
mergeRecord(input, output);
repository.save(output);
}
}
I have a Dataset, obtained from a DataBase query, of about 5,000 elements. I would like to divide this data into chunks and then have the 'users' (threads) make a HTTP request.
The purpose of this is we have a site that gives realtime information on transient data, I want to simulate multiple concurrent requests against the service.
1 - Tried to create a test plan where the DB query was done and then processed via a HTTP request via a ForEach controller. This works fine when I have only 1 'user', however; if I increase the user count to 2+ then the DB query is run 2+ times and each 'user' runs through the entire 5,000+ data points
2 - I tried moving the DB query into it's own Thread Group and then using BeanShell to put the data into the environment (props.add(...)). This worked in that the data was there but again each 'user' in the http request Thread Group iterated all the data.
Ideally what I would like is to take the data, and have the HTTP Request Thread Group divide the data so that Thread 1 takes the first 2,500 and that Thread 2 takes the second 2,500 (or if there are 4 'users' then thread 1 takes the first 1,250, thread 2 the next 1,250 and so on).
I just started looking at JMeter and I don't think it can do this "automatically" but I wanted to ask in case I'm missing something obvious.
Put a Counter element to testplan with:
Starting value: 1
Increment: 1
Reference name: (for example) cid
and disabled "Track counter independently ...".
Then add JSR223 or BeanShell sampler and write a simple code:
Integer cid = Integer.valueOf(vars.get("cid"));
Integer dataShift = 2500;
Integer startReadDataFrom = (cid - 1) * 2500;
vars.put('startReadDataFrom', String.valueOf(startReadDataFrom));
Then you can use variable ${startReadDataFrom} as a starting point to read data for every thread (0, 2500, 5000, 7500, ...).
The fastest and the easiest way is to store the data from the database into a CSV file, once done you should be able to use CSV Data Set Config and its Sharing Mode feature according to your requirements.
The storing of the data could be done as follows:
Define Result variable name in your JDBC Request Sampler:
Add JSR223 PostProcessor as a child of the JDBC Request sampler
Put the following code into "Script" area:
resultSet = vars.getObject("resultSet")
result = new StringBuilder()
for (Object row : resultSet ) {
iter = row.entrySet().iterator()
while (iter.hasNext()) {
pair = iter.next()
result.append(pair.getValue())
result.append(",")
}
result.append(System.getProperty("line.separator"))
}
org.apache.commons.io.FileUtils.writeStringToFile(new File("data.csv"), result.toString(), "UTF-8")
Once execution will be finished you should see data.csv file in "bin" folder of your JMeter installation containing the data from the database
Why isn't the exception triggered? Linq's "Any()" is not considering the new entries?
MyContext db = new MyContext();
foreach (string email in {"asdf#gmail.com", "asdf#gmail.com"})
{
Person person = new Person();
person.Email = email;
if (db.Persons.Any(p => p.Email.Equals(email))
{
throw new Exception("Email already used!");
}
db.Persons.Add(person);
}
db.SaveChanges()
Shouldn't the exception be triggered on the second iteration?
The previous code is adapted for the question, but the real scenario is the following:
I receive an excel of persons and I iterate over it adding every row as a person to db.Persons, checking their emails aren't already used in the db. The problem is when there are repeated emails in the worksheet itself (two rows with the same email)
Yes - queries (by design) are only computed against the data source. If you want to query in-memory items you can also query the Local store:
if (db.Persons.Any(p => p.Email.Equals(email) ||
db.Persons.Local.Any(p => p.Email.Equals(email) )
However - since YOU are in control of what's added to the store wouldn't it make sense to check for duplicates in your code instead of in EF? Or is this just a contrived example?
Also, throwing an exception for an already existing item seems like a poor design as well - exceptions can be expensive, and if the client does not know to catch them (and in this case compare the message of the exception) they can cause the entire program to terminate unexpectedly.
A call to db.Persons will always trigger a database query, but those new Persons are not yet persisted to the database.
I imagine if you look at the data in debug, you'll see that the new person isn't there on the second iteration. If you were to set MyContext db = new MyContext() again, it would be, but you wouldn't do that in a real situation.
What is the actual use case you need to solve? This example doesn't seem like it would happen in a real situation.
If you're comparing against the db, your code should work. If you need to prevent dups being entered, it should happen elsewhere - on the client or checking the C# collection before you start writing it to the db.