I have a requirement, which is like, I read items from a DB, if possible in a paging way where the items represent the later "batch size", I do some processing steps, like filtering etc. then I want to accumulate the items to send it to a rest service where I can send it to in batches, e.g. n of them at once instead one by one.
Parallelising it on the step level is what I am doing but I am not sure on how to get the batching to work, do I need to implement a reader that returns a list and a processor that receives a list? If so, I read that you will have not a proper account of number items processed.
I am trying to find a way to do it in the most appropriate spring batch way instead of hacking a fix, I also assume that I need to keep state in the reader and wondered if there is a better way not to.
You cannot have something like an aggregating processor. Every single item that is read is processed as single item.
However, you can implement a Reader that groups items and forwards them as a whole group. to get an idea, how this could be done have a look at my answer to this question Spring Batch Processor or Dean Clark's answer here Spring Batch-How to process multiple records at the same time in the processor?.
Both use a SpringBatch's SingleItemPeekableItemReader.
Related
I am in the process of scaling out an application horizontally, and have realised read model updates (external projection via event handler) will need to be handled on a competing consumer basis.
I initially assumed that I would need to ensure ordering, but this requirement is message dependent. In the case of shopping cart checkouts where i want to know totals, i can add totals regardless of the order - get the message, update the SQL database, and ACK the message.
I am now racking my brains to even think of a scenario/messages that would be anything but, however i know this is not the case. Some extra clarity and examples would be immensely useful.
My questions i need help with please are:
What type of messages would the ordering need to be important, and
how would this be resolved using the messages as-is?
How would we know which event to resubscribe from when the processes
join/leave I can see possible timing issues that could cause a
subscription to be requested on a message that had just been
processed by another process?
I see there is a Pinned consumer strategy for best efforts affinity of stream to subscriber, however this is not guaranteed. I could solve this making a specific stream single threaded processing only those messages in order - is it possible for a process to have multiple subscriptions to different streams?
To use your example of a shopping cart, ordering would be potentially important for the following events:
Add item
Update item count
Remove item
You might have sequences like A: 'Add item, remove item' or B: 'Add item, Update item count (to 2), Update item count (to 3)'. For A, if you process the remove before the add, obviously you're in trouble. For B, if you process two update item counts out of order, you'll end up with the wrong final count.
This is normally scaled out by using some kind of sharding scheme, where a subset of all aggregates are allocated to each shard. For Event Store, I believe this can be done by creating a user-defined projection using partitionBy to partition the stream into multiple streams (aka 'shards'). Then you need to allocate partitions/shards to processing nodes in a some way. Some technologies are built around this approach to horizontal scaling (Kafka and Kinesis spring to mind).
I'm trying to create a simple application which writes to Cassandra the page views of each web page on my site. I want to write every 5 minutes the accumulative page views from the start of a logical hour.
My code for this looks something like this:
KTable<Windowed<String>, Long> hourlyPageViewsCounts = keyedPageViews
.groupByKey()
.count(TimeWindows.of(TimeUnit.MINUTES.toMillis(60)), "HourlyPageViewsAgg")
Where I also set my commit interval to 5 minutes by setting the COMMIT_INTERVAL_MS_CONFIG property. To my understanding that should aggregate on full hour and output intermediate accumulation state every 5 minutes.
My questions now are two:
Given that I have my own Cassandra driver, how do I write the 5 min intermediate results of the aggregation to Cassandra? Tried to use foreach but that doesn't seem to work.
I need a write only after 5 min of aggregation, not on each update. Is it possible? Reading here suggests it might not without using low-level API, which I'm trying to avoid as it seems like a simple enough task to be accomplished with the higher level APIs.
Committing and producing/writing output is two different concepts in Kafka Streams API. In Kafka Streams API, output is produced continuously and commits are used to "mark progress" (ie, to commit consumer offsets including the flushing of all stores and buffered producer records).
You might want to check out this blog post for more details: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
1) To write to Casandra, it is recommended to write the result of you application back into a topic (via #to("topic-name")) and use Kafka Connect to get the data into Casandra.
Compare: External system queries during Kafka Stream processing
2) Using low-level API is the only way to go (as you pointed out already) if you want to have strict 5-minutes intervals. Note, that next release (Kafka 1.0) will include wall-clock-time punctuations which should make it easier for you to achieve your goal.
I'm currently implementing EventSourcing for my Go Actor lib.
The problem that I have right now is that when an actor restarts and need to replay all it's state from the event journal, the query might return inconsistent data.
I know that I can solve this using MutationToken
But, if I do that, I would be forced to write all events in sequential order, that is, write the last event last.
That way the mutation token for the last event would be enough to get all the data consistently for the specific actor.
This is however very slow, writing about 10 000 events in order, takes about 5 sec on my setup.
If I instead write those 10 000 async, using go routines, I can write all of the data in less than one sec.
But, then the writes are in indeterministic order and I can know which mutation token I can trust.
e.g. Event 999 might be written before Event 843 due to go routine scheduling AFAIK.
What are my options here?
Technically speaking MutationToken and asynchronous operations are not mutually exclusive. It may be able to be done without a change to the client (I'm not sure) but the key here is to take all MutationToken responses and then issue the query with the highest number per vbucket with all of them.
The key here is that given a single MutationToken, you can add the others to it. I don't directly see a way to do this, but since internally it's just a map it should be relatively straightforward and I'm sure we (Couchbase) would take a contribution that does this. At the lowest level, it's just a map of vbucket sequences that is provided to query at the time the query is issued.
So I'm currently diving the CQRS architecture along with the EventStore "pattern".
It opens applications to a new dimension of scalability and flexibility as well as testing.
However I'm still stuck on how to properly handle data migration.
Here is a concrete use case:
Let's say I want to manage a blog with articles and comments.
On the write side, I'm using MySQL, and on the read side ElasticSearch, now every time a I process a Command, I persist the data on the write side, dispatch an Event to persist the data on the read side.
Now lets say I've some sort of ViewModel called ArticleSummary which contains an id, and a title.
I've a new feature request, to include the article tags to my ArticleSummary, I would add some dictionary to my model to include the tags.
Given the tags did already exist in my write layer, I would need to update or use a new "table" to properly use the new included data.
I'm aware of the EventLog Replay strategy which consists in replaying all the events to "update" all the ViewModel, but, seriously, is it viable when we do have a billion of rows?
Is there any proven strategies? Any feedbacks?
I'm aware of the EventLog Replay strategy which consists in replaying
all the events to "update" all the ViewModel, but, seriously, is it
viable when we do have a billion of rows?
I would say "yes" :)
You are going to write a handler for the new summary feature that would update your query side anyway. So you already have the code. Writing special once-off migration code may not buy you all that much. I would go with migration code when you have to do an initial update of, say, a new system that requires some data transformation once off, but in this case your infrastructure would exist.
You would need to send only the relevant events to the new handler so you also wouldn't replay everything.
In any event, if you have a billion rows of data your servers would probably be able to handle the load :)
Im currently using the NEventStore by JOliver.
When we started, we were replaying our entire store back through our denormalizers/event handlers when the application started up.
We were initially keeping all our data in memory but knew this approach wouldn't be viable in the long term.
The approach we use currently is that we can replay an individual denormalizer, which makes things a lot faster since you aren't unnecessarily replaying events through denomalizers that haven't changed.
The trick we found though was that we needed another representation of our commits so we could query all the events that we handled by event type - a query that cannot be performed against the normal store.
I am writing a Spring Batch application to do the following: There is an input table (PostgreSQL DB) to which someone continually adds rows - that is basically work items being added. For each of these rows, I need to fetch more data from another DB, do some processing, and then do an output transaction which can be multiple SQL queries touching multiple tables (this needs to be one transaction for consistency reasons).
Now, the part between the input and output should be a modular - it already has 3-4 logically separated things, and in future there would be more. This flow need not be linear - what processing is done next can be dependent on the result of previous. In short, this is basically like the flow you can setup using steps inside a job.
My main problem is this: Normally a single chunk processing step has both ItemReader and ItemWriter, i.e., input to output in a single step. So, should I include all the processing steps as part of a single ItemProcessor? How would I make a single ItemProcessor a stateful workflow in itself?
The other option is to make each step a Tasklet implementation, and write two tasklets myself to behave as ItemReader and ItemWriter.
Any suggestions?
Found an answer - yes you are effectively limited to a single step. But:
1) For linear workflows, you can "chain" itemprocessors - that is create a composite itemprocessor to which you can provide all the itemprocessors which do actual work through applicationContext.xml. Composite itemprocessor just runs them one by one. This is what I'm doing right now.
2) You can always create the internal subflow as a seperate spring batch workflow and call it through code in an itemprocessor similar to composite itemprocessor above. I might move to this in the future.