KTable vs KSqlDb - apache-kafka-streams

i'd like to understand difference between KTable and KsqlDb. I need two data flows from my "states" topic:
Actual snapshot of a state as key-value store
Subscription to events of state data changes
I may created compacted-topic and use KTable as key value store with updates for the 1 case. Also i will use consumer to subscribe for state events for the second case.
Is it possible to use KSqlDb for those cases? What is the difference?

Yes, it is possible to use ksqlDB in both cases. ksqlDB benefits from robust database features and simplifies software developing with an interface as familiar as a relational database. For a comprehensive comparison you can read this article.

Related

How to deal with concurrent events in an event-driven architecture

Suppose I have a eCommerce application designed in an event-driven architecture. I would publish events like ProductCreated and ProductPriceUpdated. Typically both events are published in seperate channels.
Now a consumer of those events comes into play and would react on these, for example to generate a price-chart for specific products.
In fact this consumer has the requirement to firstly consume the ProductCreated event to create a Product entity with the necessary information in its own bounded context. Only if a product has been created price points can be added to the chart. Depending on the consumers performance it can easily happen that those events arrive "out-of-order".
What are the possible strategies to fulfill this requirement?
The following came to my mind:
Publish both events onto the same channel with ordering guarantees. For example in Kafka both events would be published in the same partition. However this would mean that a topic/partition would grow with its events, I would have to deal with different schemas and the documentation would grow.
Use documents over events. Simply publishing every state change of the product entity as a single ProductUpdated event or similar. This way I would lose semantics from the message and need to figure out what exactly changed on consumer-side.
Defer event consumption. So if my consumer would consume a ProductPriceUpdated event and I don't have such a product created yet, I postpone the consumption by storing it in a database and come back at a later point or use retry-topics in Kafka terms.
Create a minimal entity. Once I receive a ProductPriceUpdated event I would probably have a correlation id or something to identify the entity and simple create a Entity just with this id and once a ProductCreated event arrives fill in the missing information.
Just thought of giving you some inline comments, based on my understanding for your requirements (#1,#3 and #4).
Publish both events onto the same channel with ordering guarantees. For example in Kafka both events would be published in the same partition. However this would mean that a topic/partition would grow with its events, I would have to deal with different schemas and the documentation would grow.
[Chris] : Apache Kafka preserves the order of messages within a partition. But, the mapping of keys to partitions is consistent only as long as the number of partitions in a topic does not change. So as long as the number of partitions is constant, you can be sure the order is guaranteed. When partitioning keys is important, the easiest solution is to create topics with sufficient partitions and never add partitions.
Defer event consumption. So if my consumer would consume a ProductPriceUpdated event and I don't have such a product created yet, I postpone the consumption by storing it in a database and come back at a later point or use retry-topics in Kafka terms.
[Chris]: If latency is not of a concern, and if we are okay with an additional operation overhead of adding a new entity into your solution, such as a storage layer, this pattern looks fine.
Create a minimal entity. Once I receive a ProductPriceUpdated event I would probably have a correlation id or something to identify the entity and simple create a Entity just with this id and once a ProductCreated event arrives fill in the missing information.
[Chris] : This is kind of a usual integration pattern (Messaging Later -> Backend REST API) we adopt, works over a unique identifier, in this case a correlation id.
This can be easily acheived, if you have a separate topics and consumer per events and the order of messages from the producer is gaurenteed. Thus, option #1 becomes obsolete.
From my perspective, option #3 and #4 look one and the same, and #4 would be ideal.
On an another note, if you thinking of KAFKA Streams/Table into your solution, just go for it, as there is a stronger relationship between streams and tables is called duality.
Duality of streams and tables makes your application to support more elastic, fault-tolerant stateful transactions and to run interactive queries. And, KSQL add more flavour into it, because, this use is just of of Data Enrichment at the integration layer.

How to implement Event sourcing and a database in a microservice architecture?

I have been learning lately about microservices architecture and it's features.
in this source it appears that event sourcing is replacing a database, however, it is later stated:
The event store is difficult to query since it requires typical queries to reconstruct the state of the business entities. That is likely to be complex and inefficient. As a result, the application must use Command Query Responsibility Segregation (CQRS) to implement queries.
In the CQRS Page the author seems to describe a singular database that listens to all events and reconstructs itself.
My question(s) is:
What is actually needed to implement event sourcing with a queryable database? particularly:
Where is the events database? Where is the queryable database? Do I need to have multiple event stores for every service or can I store events in a message broker like Kafka? is the CQRS database actually is one "whole" database that collects all the events? And how can all of this scale?
I'm sorry if I'm not clear with my question, I am very confused myself. I guess I'm looking for a full example architecture of how things will look in the grand picture.
Where is the queryable database?
I'm guessing this is the most useful starting point, because it will be most familiar. The queryable database is in the same place that your this-is-the-entire-database was when you weren't doing event sourcing.
That could be a database exclusively to support this microservice, or it could be a database that is shared by several microservices, with some part of the schema where this microservice has exclusive write authority. Another way of thinking about this: the microservices are using different logical databases, which might be physically deployed together.
Where is the events database?
Same general idea - you can have one events database per microservice; or you could have several different microservices sharing the same database. Again, you have partitioning of authority, and the same logical vs physical separation to consider.
What changes with the introduction of events and CQRS is that the query/reporting database no longer stores the authoritative copy of the information that is used by the microservice. The authoritative information lives in the event store, and the query/reporting database acts more like a cache.
Our command handlers will typically load information only from the authoritative store (aka the events); that's the data that we lock if we are processing commands concurrently.
We copy information that is stored in the events into the query/reporting database(s). Depending on our needs, that can be done synchronously by the command handlers, but it is more common to use background batch processing to do that work, meaning that the data in the reporting database will often be a little bit stale.
can I store events in a message broker like Kafka?
Current consensus is that Kafka cannot reliably be used for event sourcing as understood by the CQRS community.
https://issues.apache.org/jira/browse/KAFKA-2260
https://cwiki.apache.org/confluence/display/KAFKA/KIP-27+-+Conditional+Publish
Roughly, the problem is this: when you have two processes with the authority to write events, how do you ensure that they don't introduce inconsistencies? With event stores we can use locks, or conditional writes (aka compare and swap), to ensure that nobody came along and snuck in a few extra events that might change the events we are writing.
With Kafka, there doesn't seem to be a mechanism that supports prevention, so you need to lean more into apologies, or something.
the CQRS database actually is one "whole" database that collects all the events?
Logically? No. But you certain can combine them physically into the same appliance. For example, message-db is "just" a postgres schema with some tables, functions, and so on. You certainly could combine that with the tables you use for queries and reports.
I'm looking for a full example architecture of how things will look in the grand picture.
The materials published by Greg Young in 2010 might be a decent starting point.
Event Source is not replacing the DB. It has some benefits and challenges. So, we should choose it wisely. If you are not comfortable then don't choose it. You can implement Microservice Style without event sourcing.
Query able DB - Simple solution is to implement CQRS pattern and keep your Query DB in sync with Event Source DB.
Event DB should be with owner service like if you are keeping events about Order than it should be in Order service. (Yeah, other service can have replica of the same).
You may use Kafka as intermediate storage for event but not the final one.
CQRS is not about one DB. It an pattern where we use to DB models, one is for Command and Another one is for Query.
If you understand Java then please refer Book "Microservice Patterns - Chris Richardson" and if you are from C# or Microsoft technology stack then you may refer "https://github.com/dotnet-architecture/eShopOnAzure".

How to introduce CQRS to an existing project?

I want to introduce CQRS using asynchronous events and read models stored in elasticsearch.
I have two questions:
How can I fill elasticsearch by data to make it consistent with entire system? I have a lot of data in other microservices (user, item etc.). After deploy feature with CQRS elasticsearch will listen for different events to track system state.
What are strategies to make elasticsearch consistent with entire system when something went wrong (for example RabbitMQ didn't consume some events).
Well the way you are placing up the scenarios , the two things those directly strike are :
Eventual Consistency
Following Saga patterns to take care of reverting the data
If you are having some write database other than elastic search then probably sending events through some messaging broker only can make the data consistent in elastic search engine.
In case any data revert is to be done then probably saga patterns are to be followed.

Joining separate topics with Kafka Streams?

In my current project we have created a data-pipeline using Kafka, Kafka Connect, Elasticsearch. The data ends up on a topic "signal-topic" and is off the form
KeyValue<id:String, obj:Signal>
Now Im trying to introduce Kafka Streams to be able to do some processing of the data in its way from Kafka to Elasticsearch.
My first goal is to be able to enhance the data with different kinds of side-information. A typical scenario would be to attach another field to the data based on some information already existing in the data. For instance, the data contains a "rawevent"-field and based on that I want to add a "event-description" and then output to a different topic.
What would be the "correct" way of implementing this?
I was thinking of maby having the side-data on a separate
topic in kafka
KeyValue<rawEvent:String, eventDesc:String>
and having streams joining the two topics , but I'm not sure how to accomplish that.
Would this be possible? All examples that I've come across seem to require that the keys of the data-sources would be the same and since mine are'nt I'm not sure its possible.
If anyone have a snippet for how this could be done it would be great.
Thanks in advance.
You have two possibilities:
You can extractrawEvent from Signal and set as new Key to do the join against a KTable<rawEvent:String, eventDesc:String>. Something like KStream#selectKey(...)#join(KTable...)
You can do KStream-GlobalKTable join: this allows to extract a non-key join attribute from the KStream (in your case rawEvent) that is used to do a GlobalKTable lookup to compute the join.
Note, that both joins do provide different semantics as a KStream-KTable join is synchronized on time, while a KStream-GlobalKTable join is not synchronized. Check out this blog post for more details: https://www.confluent.io/blog/crossing-streams-joins-apache-kafka/

Given a data ingestion platform with various use cases, what would a good data store be for user configuration data?

The initial use case for our multi-tenant data ingestion platform was to pull in RSS data, file meta data and SQL query results. For this, ElasticSearch was chosen as the data store and Kafka as the microservices message broker.
New streaming, low-latency and time-series data are another requirement. Thus, ElasticSearch is not a contender for this in favor of Aerospike or InfluxDB.
The initial plan was to put user account and configuration data into an ElasticSearch index/topic, as I wanted to have everything in ES.
Based on our growing requirements I can see we may have a variety of different database types depending on the use case. Would continuing to store this information in ES still be a good idea?
Using Kafka as the micro-services bus.
Since you are asking in a Kafka tag, I'm assuming that no matter the use-case and its data store, Kafka will definitely be used.
So why not store user configuration in Kafka?
It sounds like a fairly small topic, so you can set the retention to 100 years or something similar. If you expect user configuration to change often, you can make it a compacted topic. Now when you microservices start, they just need to read this topic and store the configuration in their memory. This will give you the flexibility to choose the right data store for your application data without worrying too much about the configuration.

Resources