how to generate id for message when using Kafka stream? - apache-kafka-streams

I want to have a try about implementing a normal chatting system after have read many artifles in confluent kafka. But I have met some problems when doing some structure design.
When using mysql as my data's db, I can give id to every meaningful message, like user_id in user table, message_id for message table. After having id in model table, it is very convinient for client and server doing some comunication.
But in Kafka stream, how can I give every meaningful model a unique id in KTable? Or is it really necessary for me to do this?

Maybe I can answer the question for myself.
In mysql, we can directly use sequenceId because all data will go to one place and then be auto allocated a new id. But when the table grows too large, we also need to split table to several little tables.In that case, we also should to regenerate the unique id for each record, because auto generated id in these tables is begun from 0.
Maybe it is the same in Kafka. When we only have one partition in kafka, we also can use the id from kafka generated id because all the message will go to only one place, so they will never be dumplicated. But when we want more partitions, we also have to be careful that these generated id from different partition is not global unique.
So what we should do is to generate id for ourself. UUID is a fast way to do this, but I we want to have a number, we can use a little algorithm to implement this. Maybe use the structure like this in a distributed enviroment:
[nodeid+threadId+current_time+auto_increased_number]

Related

How to handle data migrations in distributed microservice databases

so im learning about microservices and common patterns and i cant seem to find how to address this one issue.
Lets say that my customer needs a module managing customers, and a module managing purchase orders.
I believe that when dealing with microservices its pretty natural to split these two functionalities into separate services - each having its own data.
CustomerService
PurchaseOrderService
Also, he wants to have a table of purchase orders displaying the data of both customers and purchase orders, ie .: Customer name, Order number.
Now, i dont want to use the API Composition pattern because the user must be able to sort over any column he wants which (afaik) is impossible to do without slaughtering the performance using that pattern.
Instead, i choose CQRS pattern
after every purchase order / customer update a message is sent to the message broker
message broker notifies the third service about that message
the third service updates its projection in its own database
So, our third service .:
PurchaseOrderTableService
It stores all the required data in the single database - now we can query it, sort over any column we like while still maintaining a good performance.
And now, the tricky part .:
In the future, client can change his mind and say "Hey, i need the purchase orders table to display additional column - 'Customer country'"
How does one handle that data migration? So far, The PurchaseOrderTableService knows only about two columns - 'Customer name' and 'Order number'.
I imagine that this probably a pretty common problem, so what can i do to avoid reinventing the wheel?
I can of course make CustomerService generate 'CustomerUpdatedMessage' for every existing customer which would force PurchaseOrderTableService to update all its projections, but that seems like a workaround.
If that matters, the stack i thought of is java, spring, kafka, postgresql.
Divide the problem in 2:
Keeping live data in sync: your projection service from now on also needs to persist Customer Country, so all new orders will have the country as expected.
Backfill the older orders: this is a one off operation, so how you implement it really depends on your organization, technologies, etc. For example, you or a DBA can use whatever database tools you have to extract the data from the source database and do a bulk update to the target database. In other cases, you might have to solve it programmatically, for example creating a process in the projection microservice that will query the Customer's microservice API to get the data and update the local copy.
Also note that in most cases, you will already have a process to backfill data, because the need for the projection microservice might arrive months or years after the orders and customers services were created. Other times, the search service is a 3rd party search engine, like Elastic Search instead of a database. In those cases, I would always keep in hand a process to fully reindex the data.

Kafka connect - connector per event type

I'm using kafka to transfer application events to the sql hisotrical database. The events are structured differently depending on the type eg. OrderEvent, ProductEvent and both have the relation Order.productId = Product.id. I want to store this events in seperate sql tables. I came up with two approaches to transfer this data, but each has a technical problem.
Topic per event type - this approach is easy to configure, but the order of events is not guaranteed with multiple topics, so there may be problem when product doesn't exist yet when the order is consumed. This may be solved with foreign keys in the database, so the consumer of the order topic will fail until the product be available in database.
One topic and multiple event types - using the schema regisrty it is possible to store multiple event types in one topic. Events are now properly ordered but I've stucked with jdbc connector configuration. I haven't found any solution how to set the sql table name depending of the event type. Is it possible to configure connector per event type?
Is the first approach with foreign keys correct? Is it possible to configure connector per event type in the second approach? Maybe there is another solution?

What should be projection primary key on query side - CQRS, Event Sourcing, Microservices

I have one thing that confuses me.
I have 2 microservices.
One creates commands and other consumes commands and produces events (events are stored in Event Store).
In my example aggregates have Guid as Entity ID, and Guid is created when aggregate is created.
Thing that confuses me is, should that key (generated on write side) be transfered via Event to query side (microservice that created command)?
Or maybe query side (projection) should have separate id in read DB.
Or maybe I should generate some shared key?
What is best solution here?
I think it all depends on your setup.
If you are doing CQRS, and you have a separate read-service (within the same bounded context), then it is up to the read-side service to model the data as it wish, either reusing the same keys or not.
If you are communicating between two different services (separate bounded contexts) then I recommend you create new primary keys in the receiving service and use the incoming key as a foreign key. Just as you would do with relationships between two tables in a SQL-database.
I think this depends on your requirements. Is there a specific reason to have different keys?
Given that you are using Guids as your PK, it seems simplest to reuse the PKs assigned by the write side.
Some reasons you might want to keep the keys consistent:
During command processing an ID was returned to the client that they may have cached and should reasonably expect to be able to use that key when querying the read side.
If your write side data is long-lived and there is an bug on your read side output, it is gonna be much easier to debug what went wrong if your keys are consistent on write and read side.
Entities in the write side will use the write side Guid PK of another entity as its FK. When you emit an event for this new dependent entity you would want the read side to be able to build the relationship back to the principal.
This is kind of an odd question.
Your primary key on a projection could literally be anything or you might not even have one.
There is no "correct answer" for this question ... It depends entirely on the projection.
What if my projection was say just a flattening out of information associated to an aggregate ... As example we have an "order" and we make a row per order showing summary information about that order. Using an "OrderId" here would seemingly make some sense as my primary key.
What if my projection was building out counts of orders by Product? Well then using a "ProductItemId" would make a lot more sense.
What if in either of these cases the Ids themselves ("OrderId" and "ProductItemId") could change? Well then using another key might make a lot of sense.
What if this is an append-only table? I might not even want to have a key.
Again, there is not a ... correct ... answer here there are many situations that you may run into.

How to create streams in Apache Storm based on Oracle or MySQL table polling?

I need to polling a table and create a tuple for each row for streaming in Apache Storm.
Where could I find a example ?
Based on your requirements, it does't have much to do with storm. This is a database related question.
Since you have no info about the database to use, the table structure and so on, I would drop a rough steps:
Supposing the table has a last-updated-timestamp or an increment ID, using it as a marker to pull data. take ID for example.
1) execute sql select * from myyable where id > ${last retrived id} order by id limit 100 every 100ms. ${last retrived id} will be -1 for initial.
2) iterate the result sets and send out tuples
3) update ${last retrived id} with the last record's id.
(and please notice that if using the last updated timestamp, there would be some difference because different record can have the same last update timestamp)
hope this helps
We have a Storm MySql Spout that caters to your requirements. It tails the bin logs to generate the tuples.
https://github.com/flipkart-incubator/storm-mysql
You can use the table filters to actually listen to bin log events of only the table you are interested in. So whenever a Insert/Delete/Update is done on the table it generates a tuple.
The spout also gives you "at least once/at most once" guarantees. Since it stores the bin log offsets in Zookeeper, in the event of a crash it can recover from where it last was. There is no need for any polling.
Disclaimer: Author of the aforementioned spout

How to Increment the ID value in Cassandra Table Automatically?

I have a challenge when I am inserting a values in Cassandra table , I have one column name is "ID", this ID column values are increase the automatically like mysql auto_increment column. I think Counter DataType is not suitable in this Scenario. Please any one help me to design the Schema, I don't want use the UUID's also for Replace the ID column
In short I don't believe it is possible. The nature of Cassandra is that it does not do a read before write. With only one exception, lightweight transactions, but all they do is what's called "compare and swap", but there is no way, the autoincrement can be implemented on the server side.
Even with counters, you won't be able to achieve the desired result, if you increase the counter every time you add a record to the table, because you will not know whether the current value (even if it is totally consistent), is a result of an increment from your process, or from a concurrent process.
The only way is to implement this mechanism on the application side.

Resources