Spring batch Read/write in the same table - spring

I have spring batch application which reads and writes into the same table. I have used pagination for reading the items from the table as my data volume is quite high. When I set the chunk size as more than 1 then my pagination number is getting updated wrongly and hence failing to read some items from the table.
Any idea?
#Bean
public Step fooStep1() {
return stepBuilderFactory.get("step1")
.<foo, foo>chunk(chunkSize)
.reader(fooTableReader())
.writer(fooTableWriter())
.listener(fooStepListener())
.listener(chunkListener())
.build();
}
Reader
#Bean
#StepScope
public ItemReader<foo> fooBatchReader(){
NonSortingRepositoryItemReader<foo> reader = new NonSortingRepositoryItemReader<>();
reader.setRepository(service.getRepository());
reader.setPageSize(chunkSize);
reader.setMethodName("findAllByStatusCode");
List<Object> arguments = new ArrayList<>();
reader.setArguments(arguments);
arguments.add(statusCode);
return reader;
}

Don't use a pagination reader. The problem is, that this reader executes a new query for every chunk. Therefore, if you add items or change items in the same table during writing, the queries will not produce the same result.
Dive a little bit into the code of the pagination reader, it is clearly obvious in there.
If you modify the same table you are reading from, then you have to ensure that your result set doesn't change during the processing of the whole step, otherwise, your results may not be predictable and very likely not what you wanted.
Try to use a jdbccursoritemreader. This one creates the query at the beginning of your step, and hence, the result set is defined at the beginning and will not change during the processing of the step.
Editet
Based on your code to configure the reader which you added, I assume a couple of things:
this is not a standard springbatch item reader
you are using a method called "findAllByStatusCode". I assume, that this is the status field that gets updated during writing
Your Reader-Class is named "NonSortingRepositoryItemReader", hence, I assume that there is no guaranteed ordering in your result list
If 3 is correct, then this is very likely the problem. If the order of the elements is not guaranteed, then using a paging reader will definitely not work.
Every page executes it's own select and then moves to the pointer to the appropriate position in the result.
E.g., if you have a pagesize of 5, the first call will return elements 1-5 of its select, the second call will return elements 6-10 of its select. But since the order is not guaranteed, element at position 1 in the first call could be at position 6 in the second call and therefore be processed 2, whilst element 6 in the first call, could be at position 2 in the second call and therefore never been processed.

Related

Listing and Deleting Data from DynamoDB in parallel

I am using Lambdas and SQS queue to delete the data from DynamoDB. Earlier when I was developing this I found that the only way to delete data from DyanmoDB is to gather the data you want to delete and deleting them in Batches.
At my current organization, most of the infrastructure is in serverless. Hence, I decided to make this piece following serverless and event driven architecture as well.
In a nutshell, I post a message on the SQS queue to delete items under particular partition. Once this message invokes my Lambda, I perform the listing call to DyanmoDB for 1000 items and do the following:
Grab the cursor from this listing call, and post another message to grab next 1000 items from this cursor.
import { DynamoDBClient } from '#aws-sdk/client-dynamodb';
const dbClient = new DynamoDBClient(config);
const records = dbClient.query(...fetchFirst1000ItemsForPrimaryKey);
postMessageToFetchNextItems();
From the fetched 1000 items:
I create a batches of 20 items, and issue set of messages for another lambda to delete these items. A batch of 20 items is posted for deletion until all 1000 have been posted for deletion.
for (let i = 0; i < 1000; i += 20) {
const itemsToDelete = records.slice(i, 20);
postItemsForDeletion(itemsToDelete);
}
Another lambda gets these items and just deletes them:
dbClient.send(new BatchWriteItemCommand([itemsForDeletion]))
The listing lambda receives call to read items from next cursor and the above steps ge t repeated.
This all happens in parallel. Get items, post message to grab next 1000 items, post messages for deletion of items.
While looking good on paper, this doesn't seem to delete all records from DynamoDB. There is no set pattern, there are always some items that remain in the DynamoDB. I am not entirely sure what could be happening but have a theory that parallel deletion and listing could be something that is causing the issue?
I was unable to find any documentation to verify my theory and hence this question here.
A batch write items call will return a list of unprocessed items. You should check for that and retry them.
Look at the docs for https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-dynamodb/classes/batchwriteitemcommand.html and seach for UnprocessedItems.
Fundamentally, a batch write items call is not a transactional write. It's possible for some item writes to succeed while others fail. It's on you to check for failures and retry them. I'm sorry I don't have a link for good sample code.

Spring batch - conditional step flow for Chunk model

I have two step where, step 2 should be skipped if the step 1 processor doesn't returned any item after filtration.
I see ItemListenerSupport can be extended and after process can be utilized.
#Override
public void afterProcess(NumberInfo item, Integer result) {
super.afterProcess(item, result);
if (item.isPositive()) {
stepExecution.setExitStatus(new ExitStatus(NOTIFY));
}
}
My processing is chunk based, I want to set the exit status after all chunks are processed and if any item left unfiltered. I am current adding items left unfiltered to ExecutionContext and utilizing in next step.
How would i prevent next step if all the items of all chunks are filtered out
For programmatic decisions, you can use a JobExecutionDecider. This API gives you access to the StepExecution of the previous step, so you can base the decision to run the next step on any information from the last step execution and its execution context. In your case, it could be the filter count or anything meaningful to your decision that you store in the execution context upfront.
You can find more details about this API and some code examples in the Programmatic Flow Decisions section of the reference documentation.

Is it possible to lock some entries in MongoDB and do a query that do not take into account the locked recors?

I have a mongoDB that contains a list of "task" and two istance of executors. This 2 executors have to read a task from the DB, save it in the state "IN_EXECUTION" and execute the task. Of course I do not want that my 2 executors execute the same task and this is my problem.
I use the transaction query. In this way when An executor try to change state of the task it get "write exception" and have to start again and read a new task. The problem of this approach is that sometimes an Executor get a lot of errors before it can save the change of task state correctly and execute a new task. So it is like I have only one exector.
Note:
- I do not want to block my entire DB on read/write becouse in this way I will slow down the entire process.
- I think it is necessay to save the state of the task because it could be a long task.
I asked if it is possible to lock only certain record and execute a query on the "not-locked" records but each advices that solves my problem will be really appriciated.
Thanks in advance.
EDIT1:
Sorry, I simplified the concept in the question above. Actually I extract n messages that I have to send. I have to send this messages in block of 100 messages so my executors will split the messages extracted in block of 100 and pass them to others executors basically.
Each executor extract the messages and then update them with the new state. I hope this is more clear now.
#Transactional(readOnly = false, propagation = Propagation.REQUIRED)
public List<PushMessageDB> assignPendingMessages(int limitQuery, boolean sortByClientPriority,
LocalDateTime now, String senderId) {
final List<PushMessageDB> messages = repositoryMessage.findByNotSendendAndSpecificError(limitQuery, sortByClientPriority, now);
long count = repositoryMessage.updateStateAndSenderId(messages, senderId, MessageState.IN_EXECUTION);
return messages;
}
DB update:
public long updateStateAndSenderId(List<String> ids, String senderId, MessageState messageState) {
Query query = new Query(Criteria.where(INTERNAL_ID).in(ids));
Update update = new Update().set(MESSAGE_STATE, messageState).set(SENDER_ID, senderId);
return mongoTemplate.updateMulti(query, update, PushMessageDB.class).getModifiedCount();
}
You will have to do the locking one-by-one.
Trying to lock 100 records at once and at the same time have a second process also lock 100 records (without any coordination between the two) will almost certainly result in an overlapping set unless you have a huge selection of available records.
Depending on your application, having all work done by one thread (and the other being just a "hot standby") may also be acceptable as long as that single worker does not get overloaded.

IBM Integration Bus, best practices for calling multiple services

So I have this requirement, that takes in one document, and from that needs to create one or more documents in the output.
During the cause of this, it needs to determine if the document is already there, because there are different operations to apply for create and update scenarios.
In straight code, this would be simple (conceptually)
InputData in = <something>
if (getItemFromExternalSystem(in.key1) == null) {
createItemSpecificToKey1InExternalSystem(in.key1);
}
if (getItemFromExternalSystem(in.key2) == null) {
createItemSpecificToKey2InExternalSystem(in.key1, in.key2);
}
createItemFromInput(in.key1,in.key2, in.moreData);
In effect a kind of "ensure this data is present".
However, in IIB How would i go about achieving this? If i used a subflow for the Get/create cycle, the output of the subflow would be whatever the result of the last operation is, is returned from the subflow as the new "message" of the flow, but really, I don't care about the value from the "ensure data present" subflow. I need instead to keep working on my original message, but still wait for the different subflows to finish before i can run my final "createItem"
You can use Aggregation Nodes: for example, use 3 flows:
first would be propagate your original message to third
second would be invoke operations createItemSpecificToKey1InExternalSystem and createItemSpecificToKey2InExternalSystem
third would be aggregate results of first and second and invoke createItemFromInput.
Have you considered using the Collector node? It will collect your records into N 'collections', and then you can iterate over the collections and output one document per collection.

How does Spark Streaming's countByValueAndWindow work?

I have a Spark Streaming application that is processing a stream of website click events. Each event has a property containing a GUID that identifies the user session that the event belongs to.
My application is counting up the number of events that occurred for each session, using windowing:
def countEvents(kafkaStream: DStream[(String, Event)]): DStream[(String, Session)] = {
// Get a list of the session GUIDs from the events
val sessionGuids = kafkaStream
.map(_._2)
.map(_.getSessionGuid)
// Count up the GUIDs over our sliding window
val sessionGuidCountsInWindow = sessionGuids.countByValueAndWindow(Seconds(60), Seconds(1))
// Create new session objects with the event count
sessionGuidCountsInWindow
.map({
case (guidS, eventCount) =>
guidS -> new Session().setGuid(guidS).setEventCount(eventCount)
})
}
My understanding was that the countByValueAndWindow function is only counting the values in the DStream on which the function is called. In other words, in the code above, the call to countByValueAndWindow should return the event counts only for the session GUIDs in the sessionGuids DStream on which we're calling that function.
But I'm observing something different; the call to countByValueAndWindow is returning counts for session GUIDs that are not in sessionGUIDs. It appears to be returning counts for session GUIDs that were processed in previous batches. Am I just misunderstanding how this function works? I haven't been able to find anything in the way of useful documentation online.
A colleague of mine who is much more versed in the ways of Spark than I has helped me with this. Apparently I was mis-understanding the way that the countByValueAndWindow function works. I thought that it would only return counts for values in the DStream for which you're calling the function. But, in fact, it returns counts for all values across the entire window. To address my issue, I simply perform a join between my input DStream and the DStream resulting from the countByValueAndWindow operation. Thus I only end up with results for values in my input DStream.

Resources