In my current project we have created a data-pipeline using Kafka, Kafka Connect, Elasticsearch. The data ends up on a topic "signal-topic" and is off the form
KeyValue<id:String, obj:Signal>
Now Im trying to introduce Kafka Streams to be able to do some processing of the data in its way from Kafka to Elasticsearch.
My first goal is to be able to enhance the data with different kinds of side-information. A typical scenario would be to attach another field to the data based on some information already existing in the data. For instance, the data contains a "rawevent"-field and based on that I want to add a "event-description" and then output to a different topic.
What would be the "correct" way of implementing this?
I was thinking of maby having the side-data on a separate
topic in kafka
KeyValue<rawEvent:String, eventDesc:String>
and having streams joining the two topics , but I'm not sure how to accomplish that.
Would this be possible? All examples that I've come across seem to require that the keys of the data-sources would be the same and since mine are'nt I'm not sure its possible.
If anyone have a snippet for how this could be done it would be great.
Thanks in advance.
You have two possibilities:
You can extractrawEvent from Signal and set as new Key to do the join against a KTable<rawEvent:String, eventDesc:String>. Something like KStream#selectKey(...)#join(KTable...)
You can do KStream-GlobalKTable join: this allows to extract a non-key join attribute from the KStream (in your case rawEvent) that is used to do a GlobalKTable lookup to compute the join.
Note, that both joins do provide different semantics as a KStream-KTable join is synchronized on time, while a KStream-GlobalKTable join is not synchronized. Check out this blog post for more details: https://www.confluent.io/blog/crossing-streams-joins-apache-kafka/
Related
i'd like to understand difference between KTable and KsqlDb. I need two data flows from my "states" topic:
Actual snapshot of a state as key-value store
Subscription to events of state data changes
I may created compacted-topic and use KTable as key value store with updates for the 1 case. Also i will use consumer to subscribe for state events for the second case.
Is it possible to use KSqlDb for those cases? What is the difference?
Yes, it is possible to use ksqlDB in both cases. ksqlDB benefits from robust database features and simplifies software developing with an interface as familiar as a relational database. For a comprehensive comparison you can read this article.
Im using debezium embedded connector to listen to changes in database. It gives me a ChangeEvent<SourceRecord,SourceRecord> object.
I want to further use confluent plugin KCBQ which uses SinkRecord to put data to bigqery. But I'm not able to figure out how to join these two pieces.
Eventually, how do i ensure updates, deletes and schema changes from MySQL are propagated to BigQuery from Embedded Debezium
You will possibly have to use a single message transform if you have to any custom transforms. However for this scenario , since this seems to be a commonly used transform , the extract new state transform seems to accomplish this. May be worth having a look and trying something similar
https://issues.redhat.com/browse/DBZ-226
https://issues.redhat.com/browse/DBZ-1896
I'm using kafka to transfer application events to the sql hisotrical database. The events are structured differently depending on the type eg. OrderEvent, ProductEvent and both have the relation Order.productId = Product.id. I want to store this events in seperate sql tables. I came up with two approaches to transfer this data, but each has a technical problem.
Topic per event type - this approach is easy to configure, but the order of events is not guaranteed with multiple topics, so there may be problem when product doesn't exist yet when the order is consumed. This may be solved with foreign keys in the database, so the consumer of the order topic will fail until the product be available in database.
One topic and multiple event types - using the schema regisrty it is possible to store multiple event types in one topic. Events are now properly ordered but I've stucked with jdbc connector configuration. I haven't found any solution how to set the sql table name depending of the event type. Is it possible to configure connector per event type?
Is the first approach with foreign keys correct? Is it possible to configure connector per event type in the second approach? Maybe there is another solution?
I have two transactional tables originating from different databases in different servers. I would like to join them based on common attribute and store the result altogether in different database.
I have been looking for various options in NIFI to execute this as a job which runs monthly.
So far, I have been trying out various options but doesn't seem to work out. For example, I used ExecuteSQL1 & ExecuteSQL2 -> MergeContent-> PutSQL
Could anyone provide pointers on the same?
NiFi is not really meant to do a streaming join like this. The best option would be to implement the join in the SQL query using a single ExecuteSQL processor.
As Bryan said, NiFi doesn't (currently) do this. Perhaps look at Presto, you can set up multiple connections "under the hood" and use its JDBC driver to do what Bryan described, a join across tables in different DBs.
I'm thinking about adding a JoinTables processor that would let you join two tables using two different DBCPConnectionPool controller services, but there are lots of things to consider, such as being able to do the join in memory for example. For joining dimensions to fact tables, we could try to load the smaller table into memory and then we could do more of a streaming join for larger fact tables, for example. Feel free to file a New Feature Jira if you like, and we can discuss there.
My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.