Kafka connect API SourceRecord to SinkRecord transformation - apache-kafka-connect

Im using debezium embedded connector to listen to changes in database. It gives me a ChangeEvent<SourceRecord,SourceRecord> object.
I want to further use confluent plugin KCBQ which uses SinkRecord to put data to bigqery. But I'm not able to figure out how to join these two pieces.
Eventually, how do i ensure updates, deletes and schema changes from MySQL are propagated to BigQuery from Embedded Debezium

You will possibly have to use a single message transform if you have to any custom transforms. However for this scenario , since this seems to be a commonly used transform , the extract new state transform seems to accomplish this. May be worth having a look and trying something similar
https://issues.redhat.com/browse/DBZ-226
https://issues.redhat.com/browse/DBZ-1896

Related

Getting transformation configuration from custom processor: Nifi

I trying to test a functionality for Nifi. The data I pulled from database consist of specific columns say "id". I need to use Nifi to transform the column name to "customer_id". I understood this is a easy job using something like jolt. But my problem is I need to pull these configuration or rules from somewhere else let say in another database or some other place. I don't want to hard code in the jolt transform to specify the column names instead get it from some other location. Is there any best practice or best way of doing this? Will I have to write any customer processor for this and if so what is the best place to start referring for writing the custom processors?
Many different ways you can do transforms as well as JOLT - it is worth researching using Records and Schema's in NiFi.
But on to your problem - you could use LookupRecord with LookupServices for pulling the configurations, for example, you could pull them out of a database or from a REST endpoint. There are many LookupServices - read the LookupRecord docs page for a list of them.

Kafka Connect JDBC: Does it possble load banle in bulk mode, but only if any record in table changed?

I have situtation when i need load dimension table to kafka.
Juts because i want expose all my application data through kafka, as common way over all company departments/products.
But my dimension is correct only as snaphsot, immpossible to process them in incremental mode. Because Kafka Stream i add "batch_id"(timestamp of load ops). I know that this is HACK, but it's work fine to me because i want stream only fact table which are very very big and also don't want have two different way to expose data.
So no i have abillity process my dimensions as stream with logical window by "bacth_id".
But now I need load dimmesion by time interval (e g. 30 secs). My dimmesions add/update/delete rate is very low. Some of dimesions do not updated for a quaters.
So my question does it possible to use bulk mode with some condition.
For example only if any record in table have changed column "update_datetime? Does it possible mix bulk + timestamp mode?
As #cricket_007 explain in his comment, threre is no such functionaliy.
So threre are two way for resolve this issue.
Writec custom puller or write custom plugin got kafka-conenct.
I take in work first way. Because i use k8s, which are very comfortable for maintain a lot of different services. And separate service is much better to monitor.
But if you don;t have comfortable ingrastructure (with resource negotiation, service discovery, auotamted ci/cd etc) for microservices. I recomed write custom plugin to kafka-connect.

Apache Nifi - Federated Search

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

Joining separate topics with Kafka Streams?

In my current project we have created a data-pipeline using Kafka, Kafka Connect, Elasticsearch. The data ends up on a topic "signal-topic" and is off the form
KeyValue<id:String, obj:Signal>
Now Im trying to introduce Kafka Streams to be able to do some processing of the data in its way from Kafka to Elasticsearch.
My first goal is to be able to enhance the data with different kinds of side-information. A typical scenario would be to attach another field to the data based on some information already existing in the data. For instance, the data contains a "rawevent"-field and based on that I want to add a "event-description" and then output to a different topic.
What would be the "correct" way of implementing this?
I was thinking of maby having the side-data on a separate
topic in kafka
KeyValue<rawEvent:String, eventDesc:String>
and having streams joining the two topics , but I'm not sure how to accomplish that.
Would this be possible? All examples that I've come across seem to require that the keys of the data-sources would be the same and since mine are'nt I'm not sure its possible.
If anyone have a snippet for how this could be done it would be great.
Thanks in advance.
You have two possibilities:
You can extractrawEvent from Signal and set as new Key to do the join against a KTable<rawEvent:String, eventDesc:String>. Something like KStream#selectKey(...)#join(KTable...)
You can do KStream-GlobalKTable join: this allows to extract a non-key join attribute from the KStream (in your case rawEvent) that is used to do a GlobalKTable lookup to compute the join.
Note, that both joins do provide different semantics as a KStream-KTable join is synchronized on time, while a KStream-GlobalKTable join is not synchronized. Check out this blog post for more details: https://www.confluent.io/blog/crossing-streams-joins-apache-kafka/

BIRT Scripted Data Source using existing JDBC DataSource

I know that my overall problem is generally approached using two of the more common solutions such as a join data set or a sub-table, sub-report. I have looked at those and I am not sure this will work effectively.
Background:
JDBC data source has local data which includes a series of id's that reference a record in a master data repository interfaced via a web service. This is where the need for a scripted data source arises. The data can be filtered on either attributes within the local JDBC data and/or the extended data from the web service. The complication is that my only interface is the id argument to the webservice.
Ideal Solution:
Aside from creating a reporting table or other truly desirable scenarios I am looking to creating a unified data source through a single scripting data source that will handle all the complexities. This leaves the report generation and parameter creation a bit cleaner, hopefully. The idea is to leverage the JDBC query as well as the web service queries in the scripted data source do the filtering and joins and create that singular unified view.
I tried using the following code as a reference to use the existing JDBC connection in the BIRT report definition to execute the query. However if I think my breakdown on what should be in open vs fetch given this came from beforeFactory for a completely different purpose may be giving me errors...truth is I see no errors it just returns 0 records.
a link
I have also found a code snippet to dynamically load a JDBC connection but that seems a bit obtuse and a ton of overhead for what I am needing to do. a link
In short: How in all-that-is-holy do you simply run a query against a database within a scripted data source if you wanted to do. The merit of doing that is another issue, but technically how?
Thanks in Advance!

Resources