UNKOWN_PRODUCER_ID and deleting store-changelog topic - apache-kafka-streams

kafka-streams creates lots of implicit topics depending the topology of our kafka-stream, lately while we made some incompatible changes in the avro schemas, we told our administrators to delete these topics and also store directories for the kafka-stream stores.
Now we started to see some UnkownProducerIdException in our logs. Can deleting these implicit Topics create such exception?
We always assumed, if we delete these topics and stores when we will restart the kafka-stream application, those will created and maintained automatically, is this assumption correct?
I see in apache JIRA issues KAFKA-6817, KAFKA-7190 and KIP-360 for UNKOWN_PRODUCER_ID but those doesn't seem to be directly related with out.
What should be correct action for our case (changed AVRO Schema), are we allowed to delete those implicit topics and store or we should do something else.
Also does 'auto.topic.create.enable' flag has any effect on the creation of those implicit topics?
Thx for answers.

Now we started to see some UnkownProducerIdException in our logs. Can deleting these implicit Topics create such exception?
Yes.
We always assumed, if we delete these topics and stores when we will restart the kafka-stream application, those will created and maintained automatically, is this assumption correct?
Yes, Kafka Streams will recreate those topics.
Also does 'auto.topic.create.enable' flag has any effect on the creation of those implicit topics?
No. Kafka Streams does not rely on auto topic creation (in fact, it is in general recommend to disable auto topic creation) but issues an explicit create topic request via AdminClient.

Related

How to migrate to Event-Sourcing?

we are migrating from a legacy monolith application to a microservice architecture. we use CQRS and event sourcing pattern and message broker (rabbitmq) for communication mechanism. now We are facing a challenge about how can convert the old database to new architecture and how can use event sourcing for these? Assuming the old database did not have events, can we do the data conversion without creating events? what is the start point of our old database data in the event sourcing pattern?
One important thing to remember is that many databases internally event source: every write goes to a log and that log is used to update tables, replicate etc., after which the log is truncated. It's equivalent to event sourcing with a lot of snapshots and very little retention of events and old snapshots.
In these databases (which include the likes of Postgres, MySQL, Oracle, SQL Server, Cassandra, CosmosDB, to name ones I know from experience do this), there's a technique called Change Data Capture which essentially taps into the log and exposes a stream of changes to the database which can be treated as events from the database (or by extension as commands: "one service's events are another service's commands"). Debezium can be used to write CDC records to Kafka; for RabbitMQ you may need to roll something yourself, in which case you'll want to get acquainted with how CDC is exposed in your database.
Even if the database doesn't support CDC, if the data isn't that large, you can often turn it into an ersatz event stream by periodically dumping its data (if the records are timestamped, this can even work if the data is particularly slow moving) and implementing a service to track what changed: this won't tell you about changes that netted out, but it's often better than nothing. This sort of dump is also likely to be required if you need a "genesis" event to ensure that your initial state is current to when you moved to event-sourcing or CDC.
This whole broad family of techniques has limitations compared to full event sourcing: reifying what changed is not as valuable as reifying what changed and why it changed. But it can be a useful middle ground in migrating to event-sourcing.
By referring #alexey-zimarev's answer at this post, it's essential to have the starting event in your event sourced database. You can not configure an event-sourced aggregate without replaying its events. Therefore, you need to map the legacy snapshot to an individual domain event of your relevant aggregate.
Either the way, considering event souring definition by Martin Fowler:
The fundamental idea of Event Sourcing is that of ensuring every
change to the state of an application is captured in an event object,
and that these event objects are themselves stored in the sequence
they were applied for the same lifetime as the application state
itself.
So that, it's not an appropriate solution to migrate legacy snapshots into the newer one without extracting and storing domain events. It will turn your event-sourced project into a semi-event-sourced project which is not considered as a paradigm to design and develop.
You have an event store that is a database for events. you can create event data that you need for the old database and insert into the event store. After that, do event replaying for creating read models.

AWS SNS — How generic should topics be and when should we reuse/create topics?

We are introducing SNS + SQS to handle event production and propagation in our micro services architecture, which has so far relied on HTTPS calls to communicate with each other. We are considering connecting multiple SQS queues onto one SNS topic. The events in the queues will then be consumed by a lambda or a service running in EC2.
My question is, how generic should the topics be? When should we create new topics?
Say, we have a user domain which needs to publish two events—created and deleted. Two options we are considering are:
OPTION A: Have two topics, "user-created" and "user-deleted". Each topic guarantees a single event type.
the consumers would not have to worry about discarding events that they are not interested in, as they know already know the messages coming from a "user-created" topic is only related to user creations.
multiple different parts of the code publishing to the same topic
OPTION B: Have one topic, "users", that accepts multiple event types
the consumers would have an additional responsibility of filtering through the events or taking different actions depending on the type of the event (they can also configure their queues subscriptions to filter certain event types)
can ensure a single publisher for each topic
Does anyone have a strong preference for either of the options and why would that be?
On a related note, where would you include the cloud configuration for each of the resources? (should the queue resource creation be deployed together with the consumers, or should they live independently from any of the publishers/consumers?)
I think you should go with Option B and keep all events concerning a given "domain" (e.g. "user") in a single topic:
keeps your infrastructure simple
you might introduce services interested in multiple event types (e.g. "create" and "delete"). Its kind of tricky to get the ordering right consuming this from two topics; imagine a "user-delete" event arrive before the "user-create" event
throughput might be an issue, this really depends on your domain (creating and deleting users doesn't sound like a high volume issue)
think about changes in the data structures in your topics, introducing changes in two or more topics simultaniously can get complicated pretty fast
Concerning your other question: Keep your topic/infrastructure configuration separate from your services. It's an individual piece of infrastructure (like a database) and should kept separate; especially if you introduce more consumers & producers to your system.
EDIT: This might be an example "setup":
Repository user-service contains the service/lambda code, cloudformation/terraform templates for the service and its topic subscriptions
Repository sns contains all cloudformation/terraform templates concerning SNS topics
Repository sqs contains all cloudformation/terraform templates concerning SQS topics
You can think about keeping the SNS & SQS infra code in a single repository (the last two), but I would strongly recommend everything specific to a certain service/lambda to be kept in separate repositories.
Generally it helps to think about your topics as a "database", this line of thinking should point you in the right direction for all your questions.

Kafka Streams Add New Source to Running Application

Is it possible to add another source topic to an existing topology of a running kafka streams java application. Based on the javadoc (https://kafka.apache.org/23/javadoc/org/apache/kafka/streams/KafkaStreams.html) I am guessing the answer is no.
My Use Case:
REST api call triggers a new source topic should be processed by an existing processor. Source topics are stored in a DB and used to generate the topology.
I believe the only option is to shutdown the app and restart it allowing for the new topic to be picked up.
Is there any option to add the source topic without shutting down the app?
You cannot modify the program while it's running. As you point out, to change anything, you need to stop the program and create a new Topology. Depending on your program and the change, you might actually need to reset the application before restarting it. Cf. https://docs.confluent.io/current/streams/developer-guide/app-reset-tool.html

ruby-kafka: is it possible to publish to two kafka instances at the same time

Current flow of the project that I'm working on involves pushing to a local kafka using ruby-kafka gem.
Now the need arose to add producer for the remote kafka, and duplicate also messages there.
And I'm looking for a better way, than calling Kafka.new(...) twice...
Could you please help me, and do you happen to have any ideas?
Another approach to consider would be writing the data once from your application, and then asynchronously replicating the message from one Kafka cluster to another. There are multiple ways of doing this including Apache Kafka's MirrorMaker, Confluent's Replicator, Uber's uReplicator etc.
Disclaimer: I work for Confluent.

How to send incremental updates and snapshot sync using ActiveMQ topics

Here is my use case: I am developing a trading application and i want to send incremental stock updates (bidQty etc) to active consumers instead of the whole quote and a snapshot update to a new consumer (to start with).
Now, is it possible to override any ActiveMQ's class (implementors of Topic) to achieve this behavior? Any clues on this would be helpful .
If the same is possible in any other openSource provider, please let me know.
This is NOT a case where you simply can change the implementation of topic. You should actually avoid changing the implementation of core ActiveMQ features to solve specific business requirements. Fixing bugs and adding core messaging features is another thing.
There are multiple ways to solve your use case with regular ActiveMQ features.
Separate Sync and Update channel
I would probably divide the "sync/snapshot" channel from the "incremental update" channel.
One way is to implement the "snapshot-sync" as JMS request/reply where the consumer asks the provider for a sync, then continues to rely on incremental updates pushed via the topic.
Advisory messages and Selectors
You can also implement it all using a single topic using a mix of AdvisoryMessages and JMS Selectors.
An idea (you can do this in many ways):
Introduce two message properties: MsgType and Receiver
Mark each incremental update with MsgType=inc
Mark each snapshot with some client id of the consumer, Receiver=.
Have the producer listen to advisory messages from ActiveMQ and and fire a snapshot/sync message marked with Receiver= and MsgType=snapshot when there is a new client subscribing the stock topic.
The client subscribes with a selector of something like
MsgType='inc' OR (MsgType='snapshot' AND Receiver=<me>)
This way you can trigger snapshot syncs with specific clients as well as incremental updates for all clients.
If you start think about the dynamics you already have, you can probably think of another ten or so solutions.
Retroactive Consumers
You might have some use of a Retroactive Consumer - the example actually shows a scenario similar to yours.

Resources