Kafka streams to build materialised views

Kafka streams to build materialised views - apache-kafka-streams

I'm trying to produce some kind of materialized view from a stream of database updates (provided by e.g. the DBMS's transaction log, with the help of e.g. maxwell-daemon). The view is materialized as a Kafka compacted topic.
The view is a simple join and could be expressed as a query like this:
SELECT u.email user_email, t.title todo_title, t.state todo_state
FROM User u
JOIN Todo t
ON t.user_id = u.id
I want the view to be updated every time User or Todo change (a message to be published on the view's kafka topic).
With Kafka Streams it seems to be possible to achieve that by doing this:
Make a KTable of User changes
Make a KTable of Todo changes
Join both
However, I'm not sure of a few things:
Is that even possible ?
Will this maintain original ordering of events ? e.g. if User is changed, then Todo is changed, am I guaranteed to see these changes in this order in the result of the join ?
How to handle transactions ? e.g. multiple database changes might be part of the same transaction. How to make sure that both KTables are updates atomically, and that all join results show only fully-applied transactions ?

Is that even possible ?
Yes. The pattern you describe will compute what you want out-of-the-box.
Will this maintain original ordering of events ? e.g. if User is changed, then Todo is changed, am I guaranteed to see these changes in this order in the result of the join ?
Streams will process data according to timestamps (ie, records with smaller timestamps first). Thus, in general this will work as expected. However, there is no strict guarantee because in stream processing it's more important to make progress all the time (and don't block). Thus, Streams only applies a "best effort approach" with regard to processing records in timestamp order. For example, if one changelog does not provide any data, Streams will just keep going only processing data from the other changelog (and not block). This might lead to "out of order" processing with regard to timestamps from different partitions/topics.
How to handle transactions ? e.g. multiple database changes might be part of the same transaction. How to make sure that both KTables are updates atomically, and that all join results show only fully-applied transactions ?
That's not possible at the moment. Each update will be processed individually and you will see each intermediate (ie, not committed) result. However, Kafka will introduce "transactional processing" in the future that will enable to handle transactions. (see https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging and https://cwiki.apache.org/confluence/display/KAFKA/KIP-129%3A+Streams+Exactly-Once+Semantics)

Related

Spring jpa performance the smart way

I have a service that listens to multiple queues and saves the data to a database.
One queue gives me a person.
Now if I code it really simple. I just get one message from the queue at a time.
I do the following
Start transaction
Select from person table to check if it exists.
Either update existing or create a new entity
repository.save(entity)
End transaction
The above is clean and robust. But I get alot of messages its not fast enough.
To improve performance I have done this.
Fetch 100 messages from queue
then
Start transaction
Select all persons where id in (...) in one query using ids from incomming persons
Iterate messages and for each one check if it was selected above. If yes then update it if not then create a new
Save all changes with batch update/create
End transaction
If its a simple message the above is really good. It performs. But if the message is complicated or the logic I should do when I get the message is then the above is not so good since there is a change some of the messages will result in a rollback and the code becomes hard to read.
Any ideas on how to make it run fast in a smarter way?

Why do you need to rollback? Can't you just not execute whatever it is that then has to be rolled back?
IMO the smartest solution would be to code this with a single "upsert" statement. Not sure which database you use, but PostgreSQL for example has the ON CONFLICT clause for inserts that can be used to do updates if the row already exists. You could even configure Hibernate to use that on insert by using the #SQLInsert annotation.

Version number in event sourcing aggregate?

I am building Microservices. One of my MicroService is using CQRS and Event sourcing. Integration events are raised in the system and i am saving my aggregates in event store also updating my read model.
My questions is why we need version in aggregate when we are updating the event stream against that aggregate ? I read we need this for consistency and events are to be replayed in sequence and we need to check version before saving (https://blog.leifbattermann.de/2017/04/21/12-things-you-should-know-about-event-sourcing/) I still can't get my head around this since events are raised and saved in order , so i really need concrete example to understand what benefit we get from version and why we even need them.
Many thanks,
Imran

Let me describe a case where aggregate versions are useful:
In our reSove framework aggregate version is used for optimistic concurrency control.
I'll explain it by example. Let's say InventoryItem aggregate accept commands AddItems and OrderItems. AddItems increases number of items in stock, OrderItems - decreases.
Suppose you have an InventoryItem aggregate #123 with one event - ITEMS_ADDED with quantity of 5. Aggregate #123 state say there are 5 items in stock.
So your UI is showing users that there are 5 items in stock. User A decide to order 3 items, user B - 4 items. Both issue OrderItems commands, almost at the same time, let's say user A is first by couple milliseconds.
Now, if you have a single instance of aggregate #123 in memory, in the single thread, you don't have a problem - first command from user A would succeed, event would be applied, state say quantity is 2, so second command from user B would fail.
In a distributed or serverless system where commands from A and B would be in separate processes, both commands would succeed and bring aggregate into incorrect state if we don't use some concurrency control. There several ways to do this - pessimistic locking, command queue, aggregate repository or optimistic locking.
Optimistic locking seems to be simplest and most practical solution:
We say that every aggregate has a version - number of events in its stream. So our aggregate #123 has version 1.
When aggregate emits an event, this event data has an aggregate version. In our case ITEMS_ORDERED events from users A and B will have event aggregate version of 2. Obviously, aggregate events should have versions to be sequentially increasing. So what we need to do is just put a database constraint that tuple {aggregateId, aggregateVersion} should be unique on write to event store.
Let's see how our example would work in a distributed system with optimistic concurrency control:
User A issues a command OrderItem for aggregate #123
Aggregate #123 is restored from events {version 1, quantity 5}
User B issues a command OrderItem for aggregate #123
Another instance of Aggregate #123 is restored from events (version 1, quantity 5)
Instance of aggregate for user A performs a command, it succeeds, event ITEMS_ORDERED {aggregateId 123, version 2} is written to event store.
Instance of aggregate for user B performs a command, it succeeds, event ITEMS_ORDERED {aggregateId 123, version 2} it attempts to write it to event store and fails with concurrency exception.
On such exception command handler for user B just repeats the whole procedure - then Aggregate #123 would be in a state of {version 2, quantity 2} and command will be executed correctly.
I hope this clears the case where aggregate versions are useful.

Yes, this is right. You need the version or a sequence number for consistency.
Two things you want:
Correct ordering
Usually events are idempotent in nature because in a distributed system idempotent messages or events are easier to deal with. Idempotent messages are the ones that even when applied multiple times will give the same result. Updating a register with a fixed value (say one) is idempotent, but incrementing a counter by one is not. In distributed systems when A sends a message to B, B acknowledges A. But if B consumes the message and due to some network error the acknowledgement to A is lost, A doesn't know if B received the message and so it sends the message again. Now B applies the message again and if the message is not idempotent, the final state will go wrong. So, you want idempotent message. But if you fail to apply these idempotent messages in the same order as they are produced, your state will be again wrong. This ordering can be achieved using the version id or a sequence. If your event store is an RDBMS you cannot order your events without any similar sort key. In Kafka also, you have the offset id and client keeps track of the offset up to which it has consumed
Deduplication
Secondly, what if your messages are not idempotent? Or what if your messages are idempotent but the consumer invokes some external services in a non-deterministic way. In such cases, you need an exactly-once semantics because if you apply the same message twice, your state will be wrong. Here also you need the version id or sequence number. If at the consumer end, you keep track of the version id you have already processed, you can dedupe based on the id. In Kafka, you might then want to store the offset id at the consumer end
Further clarifications based on comments:
The author of the article in question assumed an RDBMS as an event store. The version id or the event sequence is expected to be generated by the producer. Therefore, in your example, the "delivered" event will have a higher sequence than the "in transit" event.
The problem happens when you want to process your events in parallel. What if one consumer gets the "delivered" event and the other consumer gets the "in transit" event? Clearly you have to ensure that all events of a particular order are processed by the same consumer. In Kafka, you solve this problem by choosing order id as the partition key. Since one partition will be processes by one consumer only, you know you'll always get the "in transit" before delivery. But multiple orders will be spread across different consumers within the same consumer group and thus you do parallel processing.
Regarding aggregate id, I think this is synonymous to topic in Kafka. Since the author assumed RDBMS store, he needs some identifier to segregate different categories of message. You do that by creating separate topics in Kafka and also consumer groups per aggregate.

Which spring transaction isolation level to use to maintain a counter for product sold?

I have an e-commerce site written with Spring Boot + Angular. I need to maintain a counter in my product table for tracking how many has been sold. But the counter sometime becomes inaccurate when many users are ordering the same item concurrently.
In my service code, I have the following transactional declaration:
#Transactional(propagation = Propagation.REQUIRES_NEW, isolation = Isolation.READ_COMMITTED)
in which, after persisting the order (using CrudRepository.save()), I do a select query to sum the quantities being ordered so far, hoping the select query will count all orders have been committed. But that doesn't seem to be the case, from time to time, the counter is less than the actual number.
Same issue happens for my other use case: quantity limit a product. I use the same transaction isolation setting. In the code, I'll do a select query to see how many has been sold and throw out of stock error if we can't fulfill the order. But for hot items, we some times oversold the item because each thread doesn't see the orders just committed in other threads.
So is READ_COMMITTED the right isolation level for my use case? Or I should do pessimistic locking for this use case?
UPDATE 05/13/17
I chose Ruben's approach as I know more about java than database so I took the easier road for me. Here's what I did.
#Transactional(propagation = Propagation.REQUIRES_NEW, isolation = Isolation.SERIALIZABLE)
public void updateOrderCounters(Purchase purchase, ACTION action)
I'm use JpaRepository so I don't play entityManager directly. Instead, I just put the code to update counters in a separate method and annotated as above. It seems to work well so far. I have seen >60 concurrent connections making orders and no oversold and the response time seems ok as well.

Depending on how you retrieve the total sold items count the available options might differ :
1. If you calculate the sold items count dynamically via a sum query on orders
I believe in this case the option you have is using SERIALIZABLE isolation level for the transaction, since this is the only one which supports range locks and prevents phantom reads.
However, I would not really recommend going with this isolation level since it has a major performance impact on your system (or used really carefully on a well designed spots only).
Links : https://dev.mysql.com/doc/refman/5.7/en/innodb-transaction-isolation-levels.html#isolevel_serializable
2. If you maintain a counter on product or some other row associated with the product
In this case I would probably recommend using row level locking eg select for update in a service method which checks the availability of the product and increments the sold items count. The high level algorithm of the product placement could be similar to the steps below :
Retrieve the row storing the number of remaining/sold items count using the select for update query (#Lock(LockModeType.PESSIMISTIC_WRITE) on a repository method).
Make sure that the retrieved row has up to date field values since it could be retrieved from the Hibernate session level cache (hibernate would just execute select for update query on the id just to acquire the lock). You can achieve this by calling 'entityManager.refresh(entity)'.
Check the count field of the row and if the value is fine with your business rules then increment/decrement it.
Save the entity, flush the hibernate session, and commit the transaction (explicitly or implicitly).
A meta code is below :
#Transactional
public Product performPlacement(#Nonnull final Long id) {
Assert.notNull(id, "Product id should not be null");
entityManager.flush();
final Product product = entityManager.find(Product.class, id, LockModeType.PESSIMISTIC_WRITE);
// Make sure to get latest version from database after acquiring lock,
// since if a load was performed in the same hibernate session then hibernate will only acquire the lock but use fields from the cache
entityManager.refresh(product);
// Execute check and booking operations
// This method call could just check if availableCount > 0
if(product.isAvailableForPurchase()) {
// This methods could potentially just decrement the available count, eg, --availableCount
product.registerPurchase();
}
// Persist the updated product
entityManager.persist(product);
entityManager.flush();
return product;
}
This approach will make sure that no any two threads/transactions will be ever performing a check and update on the same row storing the count of a product concurrently.
However, because of that it will also have some performance degradation effect on your system hence it is essential to make sure that atomic increment/decrement is being used as far in the purchase flow as possible and as rare as possible (eg, right in the checkout handling routine when customer hits pay). Another useful trick for minimizing the effect of a lock would be adding that 'count' column not on a product itself but on a different table which is associated with the product. This will prevent you from locking the products rows, since the locks will be acquired on a different row/table combination which are used purely during the checkout stage.
Links: https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-reads.html
Summary
Please note that both of the techniques introduce extra synchronization points in your system hence reducing throughput. So please make sure to carefully measure the impact it has on your system via performance test or any other technique which is being used in your project for measuring the throughput.
Quite often online shops choose going towards overselling/booking some items rather then affecting the performance.
Hope this helps.

With these transaction settings, you should see the stuff that is committed. But still, your transaction handling isn't water tight. The following might happen:
Let's say you have one item in stock left.
Now two transactions start, each ordering one item.
Both check the inventory and see: "Fine enough stock for me."
Both commit.
Now you oversold.
Isolation level serializable should fix that. BUT
the isolation levels available in different databases vary widely, so I don't think it is actually guaranteed to give you the requested isolation level
this limits seriously limits scalability. The transactions doing this should be as short and as rare as possible.
Depending on the database you are using it might be a better idea to implement this with a database constraint. In oracle, for example, you could create a materialized view calculating the complete stock and put a constraint on the result to be non-negative.
Update
For the materialized view approach you do the following.
create materialized view, that calculates the value that you want to constraint, e.g. the sum of orders. Make sure the materialized view gets updated in the transaction that change the content of the underlyingt tables.
For oracle this is achieved by the ON COMMIT clause.
ON COMMIT Clause
Specify ON COMMIT to indicate that a fast refresh is to occur whenever the database commits a transaction that operates on a master table of the materialized view. This clause may increase the time taken to complete the commit, because the database performs the refresh operation as part of the commit process.
See https://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_6002.htm for more details.
Put a check constraint on that materialized view to encode the constraint that you want, e.g. that the value is never negative. Note, that a materialized view is just another table, so you can create constraints just as you would normaly do.
See fore example https://www.techonthenet.com/oracle/check.php

How to implement an ETL Process

I would like to implement a synchronization between a source SQL base database and a target TripleStore.
However for matter of simplicity let say simply 2 databases. I wonder what approaches to use to have every change in the source database replicated in the target database. More specifically, I would like that each time some row changes in the source database that this can be seen by a process that will read the changes and populate the target database accordingly while applying some transformation in the middle.
I have seen suggestion around the mechanism of notification that can
be available in the database, or building tables such that changes can
be tracked (meaning doing it manually) and have the process polling it
at different intervals, or the usage of Logs (change data capture,
etc...)
I'm seriously puzzle about all of this. I wonder if anyone could give some guidance and explanation about the different approaches with respect to my objective. Meaning: name of methods and where to look.
My organization mostly uses: Postgres and Oracle database.
I have to take relational data and transform them in RDF so as to store them in a triplestore and keep that triplestore constantly synchronized with the data is the SQL Store.
Please,
Many thanks
PS:
A clarification between ETL and replication techniques as in Change Data capture, with respect to my overall objective would be appreciated.
Again i need to make sense of the subject, know what are the methods, so i can further start digging for myself. So far i have understood that CDC is the new way to go.

Assuming you can't use replication and you need to use some kind of ETL process to actually extract, transform and load all changes to the destination database, you could use insert, update and delete triggers to fill a (manually created) audit table. Columns GeneratedId, TableName, RowId, Action (insert, update, delete) and a boolean value to determine if your ETL process has already processed this change. Use that table to get all the changed rows in your database and transport them to the destination database. Then delete the processed rows from the audit table so that it doesn't grow too big. How often you have to run the ETL process depends on the amount of changes occurring in the source database.

the best way to track data changes in oracle

as the title i am talking about, what's the best way to track data changes in oracle? i just want to know which row being updated/deleted/inserted?
at first i think about the trigger, but i need to write more triggers on each table and then record down the rowid which effected into my change table, it's not good, then i search in Google, learn new concepts about materialized view log and change data capture,
materialized view log is good for me that i can compare it to original table then i can get the different records， even the different of the fields, i think the way is the same with i create/copy new table from original (but i don't know what's different?);
change data capture component is complicate for me :), so i don't want to waste my time to research it.
anybody has the experience the best way to track data changes in oracle?

You'll want to have a look at the AUDIT statement. It gathers all auditing records in the SYS.AUD$ table.
Example:
AUDIT insert, update, delete ON t BY ACCESS
Regards,
Rob.

You might want to take a look at Golden Gate. This makes capturing changes a snap, at a price but with good performance and quick setup.
If performance is no issue, triggers and audit could be a valid solution.
If performance is an issue and Golden Gate is considered too expensive, you could also use Logminer or Change Data Capture. Given this choice, my preference would go for CDC.
As you see, there are quite a few options, near realtime and offline.
Coding a solution by hand also has a price, Golden Gate is worth investigating.

Oracle does this for you via redo logs, it depends on what you're trying to do with this info. I'm assuming your need is replication (track changes on source instance and propagate to 1 or more target instances).
If thats the case, you may consider Oracle streams (other options such as Advanced Replication, but you'll need to consider your needs):
From Oracle:
When you use Streams, replication of a
DML or DDL change typically includes
three steps:
A capture process or an application
creates one or more logical change
records (LCRs) and enqueues them into
a queue. An LCR is a message with a
specific format that describes a
database change. A capture process
reformats changes captured from the
redo log into LCRs, and applications
can construct LCRs. If the change was
a data manipulation language (DML)
operation, then each LCR encapsulates
a row change resulting from the DML
operation to a shared table at the
source database. If the change was a
data definition language (DDL)
operation, then an LCR encapsulates
the DDL change that was made to a
shared database object at a source
database.
A propagation propagates the staged
LCR to another queue, which usually
resides in a database that is separate
from the database where the LCR was
captured. An LCR can be propagated to
a number of queues before it arrives
at a destination database.
At a destination database, an apply
process consumes the change by
applying the LCR to the shared
database object. An apply process can
dequeue the LCR and apply it directly,
or an apply process can dequeue the
LCR and send it to an apply handler.
In a Streams replication environment,
an apply handler performs customized
processing of the LCR and then applies
the LCR to the shared database object.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio