K-Connect Sink - When to kickoff downstream process? - apache-kafka-connect

I have a use case where I source data from JDBC and sink to S3. I wanted to be able to kick off downstream process when sink for a specific data has landed successfully. How to do this?

Related

Kafka Connect JDBC Source and Sink Connector Part of the Same Plugin

I've created a JDBC Source and Sink Connector for sending some changes on my Oracle XA Database Table to Kafka and also capture the Acknowledgement of the message published to Kafka. Some info regarding the Kafka Connect Plugin I've Deployed :
They're deployed in Distributed Mode. Works on a Single Pod but failing when I increase pod size to 2.
Both Source & Sink Connectors are created on the same Kafka Connect Plugin and running on the same JVM.
The Source Connector("connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector") job is to poll the DB table(MY_APP_TABLE) on timestamp basis with a custom query and send data to Kafka.
For now I've set the batch Size to 500 and "tasks.max": "2"for both Source & Sink Connector that I'm gonna change to handle 300 TPS on the DB Table.
The Sink Connector("connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector") job is to get the Acknowledgement on message published from Kafka. So, once the event is published to Kafka this Sink Connector updates an indicator on the DB table(MY_APP_TABLE) that indicates the specific record has been published to Kafka Topic Successfully.
Here are some configurations that's in my connect-distributed.properties :
group.id=app_data_sync
offset.storage.topic=oracle-connector-offsets-storage
offset.storage.replication.factor=3
offset.storage.partitions=50
config.storage.topic=oracle-connector-config-storage
config.storage.replication.factor=3
status.storage.topic=oracle-connector-status-storage
status.storage.replication.factor=3
status.storage.partitions=50
listeners=https://0.0.0.0:8443
rest.advertised.listener=https
rest.advertised.host.name=0.0.0.0
rest.advertised.port=8443
To Handle about 300 TPS on my DB Table and meet the SLA of 10 mins to have the DB changes relayed to Kafka(with Exactly-Once-Semantic) , I'm trying to have at-least 2 Pods(1 Core, 4GB), both running same Source & Sink Connector. Need Clarification & Suggestion on the following:
Can both Source & Sink Connectors be created in Distributed mode as part of the same Kafka Connect plugin and use the same group.id and offset topic, config topic and status topics?
Is there a better alternative to capture the Acknowledgement of the messages published by Source Connector and have them recorded in the Source DB?
Though I've not generated the high TPS, functionally this is working for me with Single Pod. But getting below error with two pods : ERROR IO error forwarding REST request: (org.apache.kafka.connect.runtime.rest.RestClient:143) org.apache.kafka.connect.runtime.rest.errors.ConnectRestException: IO Error trying to forward REST request: javax.net.ssl.SSLHandshakeException: General SSLEngine problem. Is that related to these configs rest.advertised.host.name=0.0.0.0 , listeners=https://0.0.0.0:8443 in my connect-distributed.properties ?

Transferring data from elasticsearch to influx

I am trying to send data from Elasticsearch to Influxdb. Is there any other way to do this except writing plugins and configuration files. I am new to both these databases, and trying to understand the overall picture.
Also, am I right in understanding that Kapacitor processes influx data and then sends it to Kafka for streaming? Or should I stream data using Kapacitor only?
I am trying to learn all these new technologies in as short time frame, and all the new terminologies have got me confused. Thanks for your time and help.
ElasticSearch is a search engine, not a database. InfluxDB is a time series data. I did't understand why you need to transfer data from search results to a time series database.
Kapacitor can process data in 2 different ways. Either in batch mode or in streaming mode. Assume some application streaming sensor data(or some time series data) to InfluxDb. You can set Kapacitor to process that data as soon it is available in InfluxDB by setting kapacitor in streaming mode. Or in case you need to process data from InfluxDB every 2 hour, you can configure that job as batch job. Once you process the data, Kapacitor can persist the data in InfluxDB. Or in case you need to stream the data to Kafka, Kpacitor has a Kafka plugin. Please note Kapacitor has more plugins than I mentioned in my answer.

Delete data in source once data has been pushed to kafka server

I'm using confluent platform 3.3 to pull data from Oracle database. Once the data has been pushed to kafka server the retrieved data should be deleted in the database.
Are there any way to do it ? Please suggest.
There is no default way of doing this with Kafka.
How are you reading your data from the database, using Kafka Connect, or with custom code that you wrote?
If the latter is the case I'd suggest implementing the delete in your code, collect ids once Kafka has confirmed send and batch delete regularly.
Alternatively you could write a small job that reads your Kafka topic with a different consumer group than your actual target system and deletes based on the records it pulls from the topic. If you run this job every few minutes, hours,... you can keep up with the sent data as well.

How to know when a file has been sunk on HDFS using Spring Cloud Dataflow

I'm downloading a file source and creating and stream to process line by line to finally sink into HDFS.
For that purpose I'm using Spring Cloud Dataflow + Kafka.
Question: is there any way to know when the complete file has been sunk into HDFS to trigger an event?
is there any way to know when the complete file has been sunk into HDFS to trigger an event?
This type of use-case typically falls under task/batch as opposed to streaming pipeline. If you build a filehdfs task (batch-job) application, you could then have a stream listening to various task-events in order to make further downstream decisions or data processing.
Please refer to "Subscribing to Task/Batch Events" from the reference guide for more details.

Does Apache Kafka Store the messages internally in HDFS or Some other File system

We have a project requirement of testing the data at Kafka Layer. So JSON files are moving into hadoop area and kafka is reading the live data in hadoop(Raw Json File). Now I have to test whether the data sent from the other system and read by kafka should be same.
Can i validate the data at kafka?. Does kafka store the messages internally on HDFS?. If yes then is it stored in a file structure similar to what hive saves internally just like a single folder for single table.
Kafka stores data in local files (ie, local file system for each running broker). For those files, Kafka uses its own storage format that is based on a partitioned append-only log abstraction.
The local storage directory, can be configured via parameter log.dir. This configuration happens individually for each broker, ie, each broker can use a different location. The default value is /tmp/kafka-logs.
The Kafka community is also working on tiered-storage, that will allow brokers to no only use local disks, but to offload "cold data" into a second tier: https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage
Furthermore, each topic has multiple partitions. How partitions are distributed, is a Kafka internal implementation detail. Thus you should now rely on it. To get the current state of your cluster, you can request meta data about topics and partitions etc. (see https://cwiki.apache.org/confluence/display/KAFKA/Finding+Topic+and+Partition+Leader for an code example). Also keep in mind, that partitions are replicated and if you write, you always need to write to the partition leader (if you create a KafkaProducer is will automatically find the leader for each partition you write to).
For further information, browse https://cwiki.apache.org/confluence/display/KAFKA/Index
I think you can, but you have to do that manually. You can let kafka sink whatever output to HDFS. Maybe my answer is a bit late and this 'confluent' reference appeared after that, but briefly one can do the followings:
Assuming you have all servers are running (check the confluent
website)
Create your connector:
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics='your topic'
hdfs.url=hdfs://localhost:9000
flush.size=3
Note: The approach assumes that you are using their platform
(confluent platform) which I haven't use.
Fire the kafka-hdfs streamer.
Also you might find more useful details in this Stack Overflow discussion.
This happens with most of the beginner. Let's first understand that component you see in Big Data processing may not be at all related to Hadoop.
Yarn, MapReduce, HDFS are 3 main core component of Hadoop. Hive, Pig, OOOZIE, SQOOP, HBase etc work on top of Hadoop.
Frameworks like Kafka or Spark are not dependent on Hadoop, they are independent entities. Spark supports Hadoop, like Yarn, can be used for Spark's Cluster mode, HDFS for storage.
Same way Kafka as an independent entity, can work with Spark. It stores its messages in the local file system.
log.dirs=/tmp/kafka-logs
You can check this at $KAFKA_HOME/config/server.properties
Hope this helps.

Resources