Kafka connect - connector per event type - jdbc

I'm using kafka to transfer application events to the sql hisotrical database. The events are structured differently depending on the type eg. OrderEvent, ProductEvent and both have the relation Order.productId = Product.id. I want to store this events in seperate sql tables. I came up with two approaches to transfer this data, but each has a technical problem.
Topic per event type - this approach is easy to configure, but the order of events is not guaranteed with multiple topics, so there may be problem when product doesn't exist yet when the order is consumed. This may be solved with foreign keys in the database, so the consumer of the order topic will fail until the product be available in database.
One topic and multiple event types - using the schema regisrty it is possible to store multiple event types in one topic. Events are now properly ordered but I've stucked with jdbc connector configuration. I haven't found any solution how to set the sql table name depending of the event type. Is it possible to configure connector per event type?
Is the first approach with foreign keys correct? Is it possible to configure connector per event type in the second approach? Maybe there is another solution?

Related

How to deal with concurrent events in an event-driven architecture

Suppose I have a eCommerce application designed in an event-driven architecture. I would publish events like ProductCreated and ProductPriceUpdated. Typically both events are published in seperate channels.
Now a consumer of those events comes into play and would react on these, for example to generate a price-chart for specific products.
In fact this consumer has the requirement to firstly consume the ProductCreated event to create a Product entity with the necessary information in its own bounded context. Only if a product has been created price points can be added to the chart. Depending on the consumers performance it can easily happen that those events arrive "out-of-order".
What are the possible strategies to fulfill this requirement?
The following came to my mind:
Publish both events onto the same channel with ordering guarantees. For example in Kafka both events would be published in the same partition. However this would mean that a topic/partition would grow with its events, I would have to deal with different schemas and the documentation would grow.
Use documents over events. Simply publishing every state change of the product entity as a single ProductUpdated event or similar. This way I would lose semantics from the message and need to figure out what exactly changed on consumer-side.
Defer event consumption. So if my consumer would consume a ProductPriceUpdated event and I don't have such a product created yet, I postpone the consumption by storing it in a database and come back at a later point or use retry-topics in Kafka terms.
Create a minimal entity. Once I receive a ProductPriceUpdated event I would probably have a correlation id or something to identify the entity and simple create a Entity just with this id and once a ProductCreated event arrives fill in the missing information.
Just thought of giving you some inline comments, based on my understanding for your requirements (#1,#3 and #4).
Publish both events onto the same channel with ordering guarantees. For example in Kafka both events would be published in the same partition. However this would mean that a topic/partition would grow with its events, I would have to deal with different schemas and the documentation would grow.
[Chris] : Apache Kafka preserves the order of messages within a partition. But, the mapping of keys to partitions is consistent only as long as the number of partitions in a topic does not change. So as long as the number of partitions is constant, you can be sure the order is guaranteed. When partitioning keys is important, the easiest solution is to create topics with sufficient partitions and never add partitions.
Defer event consumption. So if my consumer would consume a ProductPriceUpdated event and I don't have such a product created yet, I postpone the consumption by storing it in a database and come back at a later point or use retry-topics in Kafka terms.
[Chris]: If latency is not of a concern, and if we are okay with an additional operation overhead of adding a new entity into your solution, such as a storage layer, this pattern looks fine.
Create a minimal entity. Once I receive a ProductPriceUpdated event I would probably have a correlation id or something to identify the entity and simple create a Entity just with this id and once a ProductCreated event arrives fill in the missing information.
[Chris] : This is kind of a usual integration pattern (Messaging Later -> Backend REST API) we adopt, works over a unique identifier, in this case a correlation id.
This can be easily acheived, if you have a separate topics and consumer per events and the order of messages from the producer is gaurenteed. Thus, option #1 becomes obsolete.
From my perspective, option #3 and #4 look one and the same, and #4 would be ideal.
On an another note, if you thinking of KAFKA Streams/Table into your solution, just go for it, as there is a stronger relationship between streams and tables is called duality.
Duality of streams and tables makes your application to support more elastic, fault-tolerant stateful transactions and to run interactive queries. And, KSQL add more flavour into it, because, this use is just of of Data Enrichment at the integration layer.

Spring boot kafka: Microservice multi instances, concurrency and partitions

I have a question about the way of publishing and reading messages in kafka for microservices arquitectures with multiple instance of the same microservices for writing and reading.
My main problem here is that the microservices that publish and read are configure with an autoscaling but a default numer of instances of 1.
The point is that I have an entity, let call it "Event" that are stored in the DDBB and each entity has its own ID in the DDBB. When some specific command are executed in a specific entity (let say with entityID = ajsha87) it must be published a message that will be readed by a consumer. if each of this messages for the same entity is writen in diferent partitions and cosumed at the same time (Concurrency issue) I will have a lot of problems.
My question is about if according to the entityID for example I can set in which partitions all events of this specific entity will be published. For another entity with different ID I dont care about the partion but the messages for the same entity must be always published in the same partition to avoid that a consumer will read a messages (2) published after a message (1).
There is any mechanism to do that, or each time I save the entity I have randomly store in the DDBB the partition ID in which its messages will be published?
Same happens with consumers. Only one consumer can read a partition at the same time because if not, a consumer number 1 can read the message (1) from partition (1) realted with entity (ID=78198) and then another can read the message (2) from partition (1) ralated with the same entity and process the message 2 before number one.
There is any mechanish about subscribe each instance only to one partition according to the microservice autoscaling?
Another option it will be to assign dinamically for each new publisher instance a partition, but I dont know how to configure that dinamically to set diferent particions IDs according to the microservice instance
I am using spring boot by the way
Thanks for you answer and recomendations and sorry if my english is not good enough.
If you use Hash Partitioner as the partitioner in producer config (This is the default partitioner in many libraries), and use same key for same entity (let say with entityID = ajsha87) kafka manages to send all messages with same key to same partition.
If you are using group consumer, One consumer instance take the responsibility of one partition and all messages published to that partition consumes by that instance only. Instance can be changed if there is rebalancing when upscaling. but still messages in same partition will read from one consumer instance.

Joining separate topics with Kafka Streams?

In my current project we have created a data-pipeline using Kafka, Kafka Connect, Elasticsearch. The data ends up on a topic "signal-topic" and is off the form
KeyValue<id:String, obj:Signal>
Now Im trying to introduce Kafka Streams to be able to do some processing of the data in its way from Kafka to Elasticsearch.
My first goal is to be able to enhance the data with different kinds of side-information. A typical scenario would be to attach another field to the data based on some information already existing in the data. For instance, the data contains a "rawevent"-field and based on that I want to add a "event-description" and then output to a different topic.
What would be the "correct" way of implementing this?
I was thinking of maby having the side-data on a separate
topic in kafka
KeyValue<rawEvent:String, eventDesc:String>
and having streams joining the two topics , but I'm not sure how to accomplish that.
Would this be possible? All examples that I've come across seem to require that the keys of the data-sources would be the same and since mine are'nt I'm not sure its possible.
If anyone have a snippet for how this could be done it would be great.
Thanks in advance.
You have two possibilities:
You can extractrawEvent from Signal and set as new Key to do the join against a KTable<rawEvent:String, eventDesc:String>. Something like KStream#selectKey(...)#join(KTable...)
You can do KStream-GlobalKTable join: this allows to extract a non-key join attribute from the KStream (in your case rawEvent) that is used to do a GlobalKTable lookup to compute the join.
Note, that both joins do provide different semantics as a KStream-KTable join is synchronized on time, while a KStream-GlobalKTable join is not synchronized. Check out this blog post for more details: https://www.confluent.io/blog/crossing-streams-joins-apache-kafka/

Addressing CRUD "tables" in event sourcing

I'm starting down an ES journey and want to know if traditional support tables should be stored in the event log or should those be handled differently? These tables would typical have a CRUD page. In other words, would it be common to have 2 approaches in the same application, one for support tables and one for transactional data?
A support table would be like "Account" in an accounting application or "Product Type" or the actual "Product" table in an ERP application (I'm not writing an ERP application - that's an example of the type of table I'm talking about).
If we store CRUD-type data in the event log, then we might have events:
ProductCreated
ProductUpdated
ProductDeleted (which would just mark it as deleted)
Then, do we attempt to find out what changed (in ProductUpdated event) and just store the change and replay to get the latest image of the Product?
Mostly, I'm after what approach to use for CRUD tables - traditional or store in the event log? Additional information would be great!
Suppose you start purely with an event log, including for events like ProductCreated, etc., and no other data store. What happens then is that every time your application starts up, it has to replay all the events in the log to build its current state.
Now, suppose you create a traditional SQL table to store the current state of your app (say a products table) and the ID of the last event that was processed to get to that state (say a last_event table). What happens then is every time your app starts up, it has to replay only the events with higher IDs than the stored ID and process those to build its new state.
On the flip side, your app now has to be careful to keep these two states synchronised. If you need to have concurrency, you'll need to be careful to do atomic operations only on your SQL tables--but that should be reasonably easy with transacctions.
Your support tables are just a read-model/projection of the event stream. In general you don't create those support models in case you need them. You create a read-model only if you use it somewhere in the UI.
Anyway, one important benefit behind Event sourcing is that you won't need to use join in your queries. That is, you create a table for each read-model that contains all the data it needs - full denormalisation. You keep that table super-optimised for the query.

DocumentDB unique concurrent insert?

I have a horizontally event-source driven application that runs using an Azure Service Bus Topic and a Service Bus Queue. Some events for building up my domain model's state are received through the topic by all my servers, while the ones on the queue (the ones received a lot more often and not mutating domain model state) are distributed among the servers in order to distribute the load.
Now, every time one of my servers receives an event through the queue or topic, it stores it in a DocumentDB which it uses as event store.
Now here's the problem. How can I be sure that the same document is not inserted twice? Let's say 3 servers receive the same event. They all try to store it. How can I make it fail for 2 of the servers in the case they decide to do it all at the same time? Is there any form of unique constraint I can set in DocumentDB or some kind of transaction scope to prevent the document from being inserted twice?
The id property for each document has a uniqueness constraint. You can use this constraint to ensure that duplicate documents are not written to a collection.

Resources