I'm using confluent platform 3.3 to pull data from Oracle database. Once the data has been pushed to kafka server the retrieved data should be deleted in the database.
Are there any way to do it ? Please suggest.
There is no default way of doing this with Kafka.
How are you reading your data from the database, using Kafka Connect, or with custom code that you wrote?
If the latter is the case I'd suggest implementing the delete in your code, collect ids once Kafka has confirmed send and batch delete regularly.
Alternatively you could write a small job that reads your Kafka topic with a different consumer group than your actual target system and deletes based on the records it pulls from the topic. If you run this job every few minutes, hours,... you can keep up with the sent data as well.
Related
I have Debezium in a container, capturing all changes of PostgeSQL database records. In addition a have a Kafka container to store the topic messages. At last I have a JDBC container to write all changes to another database.
These three containers are working as expected, performing snapshots of the old data in the database on specific tables and streaming new changes while there are reflected into the destination database.
I have figure out that during this streaming the PostgreSQL WAL is increasing, to overcome this situation I enabled the following property on the source connector to clear all retrieved logs.
"heartbeat.interval.ms": 1000
Now the PostgreSQL WAL file is getting cleared in every heartbeat as the retrieved changed as flushed. But meanwhile even the changes are committed into the secondary database the kafka topics are remaining with the exact size.
Is there any way or property into sink connector that will force kafka to delete commited messages?
Consumers have no control over topic retention.
You may edit the topic config directly to reduce the retention time, but then your consumer must read the data within that time.
I have micro-services running and when a web user update data in DB using micro-service end-points, I want to send updated data to NiFi also. This data contains updated list of names, deleted names, edited names etc. How to do it? which processor I have to use from NiFi side?
I am new to NiFi. I am yet to try anything from my side. I am reading google documents which can guide me.
No source code is written. I want to start it. But I will share here once I write it.
Expected result is NiFi should get updated list of names and NiFi should refer updated list for generating required alerts/triggers etc.
You can actually do it in lots of ways. MQ, Kafka, HTTP(usinh ListenHTTP). Just deploy the relevant one to you and configure it, even listen to a directory(using ListFile & FetchFile).
You can connect NiFi to pretty much everything, so just choose how you want to connect your micro services to NiFi.
Currently we are making a memory cache mechanism implementation in search functionalities.
Now the data is getting very large and we are not able to handle it in memory. Also we are getting more input source from different systems (oracle, flat file, and git).
Can you please share me how can we achieve this process?
We thought ES will help on this. But how can we provide input if any changes happen in end source? (batch processing will NOT help)
Hadoop - Not that level of data we are NOT handling
Also share your thoughts.
we are getting more input source from different systems (oracle, flat file, and git)
I assume that's why you tagged Kafka? It'll work, but you bring up a valid point
But how can we provide input if any changes happen...?
For plain text, or Git events, you'll obviously need to alter some parser engine and restart the job to get extra data in the message schema.
For Oracle, the GoldenGate product will publish table column changes, and Kafka Connect can recognize those events and update the payload accordingly.
If all you care about is searching things, plenty of tools exist, but you mention Elasticsearch, so using Filebeat works for plaintext, and Logstash can work with various other types of input sources. If you have Kafka, then feed events to Kafka, let Logstash or Kafka Connect update ES
I have a requirement to read huge CSV file from Kafka topic to Cassandra. I configured Apache Nifi to achieve the same.
Flow:
User does not have a control on Nifi setup. He only specifies the URL where the CSV is located. The web application writes the URL into kafka topic. Nifi fetches the file and inserts into Cassandra.
How will I know that Nifi has inserted all the rows from the CSV file into Cassandra? I need to let the user know that inserting is done.
Any help would be appreciated.
I found the solution.
Using MergeContent processor, all FlowFiles with the same value for "fragment.identifier" will be grouped together. Once MergeContent has defragmented them, we can notify the user.
We have a project requirement of testing the data at Kafka Layer. So JSON files are moving into hadoop area and kafka is reading the live data in hadoop(Raw Json File). Now I have to test whether the data sent from the other system and read by kafka should be same.
Can i validate the data at kafka?. Does kafka store the messages internally on HDFS?. If yes then is it stored in a file structure similar to what hive saves internally just like a single folder for single table.
Kafka stores data in local files (ie, local file system for each running broker). For those files, Kafka uses its own storage format that is based on a partitioned append-only log abstraction.
The local storage directory, can be configured via parameter log.dir. This configuration happens individually for each broker, ie, each broker can use a different location. The default value is /tmp/kafka-logs.
The Kafka community is also working on tiered-storage, that will allow brokers to no only use local disks, but to offload "cold data" into a second tier: https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage
Furthermore, each topic has multiple partitions. How partitions are distributed, is a Kafka internal implementation detail. Thus you should now rely on it. To get the current state of your cluster, you can request meta data about topics and partitions etc. (see https://cwiki.apache.org/confluence/display/KAFKA/Finding+Topic+and+Partition+Leader for an code example). Also keep in mind, that partitions are replicated and if you write, you always need to write to the partition leader (if you create a KafkaProducer is will automatically find the leader for each partition you write to).
For further information, browse https://cwiki.apache.org/confluence/display/KAFKA/Index
I think you can, but you have to do that manually. You can let kafka sink whatever output to HDFS. Maybe my answer is a bit late and this 'confluent' reference appeared after that, but briefly one can do the followings:
Assuming you have all servers are running (check the confluent
website)
Create your connector:
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics='your topic'
hdfs.url=hdfs://localhost:9000
flush.size=3
Note: The approach assumes that you are using their platform
(confluent platform) which I haven't use.
Fire the kafka-hdfs streamer.
Also you might find more useful details in this Stack Overflow discussion.
This happens with most of the beginner. Let's first understand that component you see in Big Data processing may not be at all related to Hadoop.
Yarn, MapReduce, HDFS are 3 main core component of Hadoop. Hive, Pig, OOOZIE, SQOOP, HBase etc work on top of Hadoop.
Frameworks like Kafka or Spark are not dependent on Hadoop, they are independent entities. Spark supports Hadoop, like Yarn, can be used for Spark's Cluster mode, HDFS for storage.
Same way Kafka as an independent entity, can work with Spark. It stores its messages in the local file system.
log.dirs=/tmp/kafka-logs
You can check this at $KAFKA_HOME/config/server.properties
Hope this helps.