I'm starting with kafka and I need to control the inserts in a specific Oracle table, send the new records through kafka at the moment. I have no control over the database, so, in principle, Debizium is excluded. How can I do this? Without using triggers.
I've made a producer read data from Oracle with a java program in eclipse but, that would make constant requests to the database. I use java for simulated a ETL with consumer.
PS: I work with Windows but that's secondary.
If I understand your problem correctly, you are trying to route inserts from Kafka to Oracle Database. There could be few possibilities:
You implement Kafka consumer and as soon as your kafka cluster gets a message consumer makes a insert. You could reuse your java code here- just remove the polling part. Please visit here
If you have kafka deployed in a cloud environment and are using it as a service(aws msk) you would have the option to handling the events. Again you can use java program or can write a python script to make inserts. Please visit here
I would like to understand your throughput requirements, whether you really need kafka as a distributed messaging system or a simple aws sqs would work just fine. If you can use sqs things would be straightforward for you. You create a queue and you write a listener in
python or java
boto3 is an excellent python library for working with sqs
Related
Is such a situation even possible ? :
There is an application "XYZ" (in which there is no Kafka) that exposes a REST api. It is a SpringBoot application with which Angular application communicates.
A new application (SpringBoot) is created which wants to use Kafka and needs to fetch data from "XYZ" application. And it wants to do this using Kafka.
The "XYZ" application has an example endpoint [GET] api/message/all which displays all messages.
Is there a way to "connect" Kafka directly to this endpoint and read data from it ? In short, the idea is for Kafka to consume data directly from the EP. Communication between two microservices, where one microservice does not have a kafka.
What suggestions do you have for solving this situation. Because I guess this option is not possible. Is it necessary to add a publisher in application XYZ which will send data to the queue and only then will they be available for consumption by a new application ??
Getting them via the REST-Interface might not be a very good idea.
Simply put, in the messaging world, message delivery guarantees are a big topic and the standard ways to solve that with Kafka are usually
Producing messages from your service using the Producer-API to a Kafka topic.
Using Kafka-Connect to read from an outbox-table.
Since you most likely have a database already attached to your API-Service, there might arise the problem of dual writes if you choose to produce the messages directly to a topic. What this means, is that writes to a database might fail while it might be successfully written to Kafka/vice-versa. So you can end up with inconsistent states. Depending on your use case this might be a problem or not.
Nevertheless, to overcome that, the outbox pattern can come in handy.
Via the outbox pattern, you'd basically write your messages to a table, a so-called outbox-table, and then you'd use Kafka-Connect to poll this table of the database. Kafka Connect is basically a cluster of workers that consume this database table and forward the entries of the table to a Kafka topic. You might want to look at confluent cloud, they offer a fully managed Kafka-Connect service. Like this you don't have to manage the cluster of workers yourself. Once you have the messages in a Kafka topic, you can consume them with the standard Kafka Consumer-API/ Stream-API.
What you're looking for is a Source-Connector.
A source connector for a specific database. E.g. MongoDB
E.g. https://www.confluent.io/hub/mongodb/kafka-connect-mongodb
For now, most source-connectors produce in an at-least-once fashion. This means that the topic you configure the connector to write to might contain a message twice. So make sure that if you need them to be consumed exactly once, you think about deduplicating these messages.
I have a Java Program to run in Apache flink in AWS i want to run
real time communication through web socket how can i integrate serverless web socket in Apache flink Java ???
Thanks You
Flink is designed to help you process and move data continuously between storage or streaming solutions. It is not intended to, and would not work well with websockets directly for these reasons:
When submitting a job, the runtime serializes your logic and moves it to other TaskManager instances so that it can parallelize them. These can be on another machine entirely. Now, if you were intending to service a websocket with that code, it has just moved elsewhere!
TaskManagers can be stopped and restarted (scaling event, recovering from a checkpoint/savepoint, etc). That's where your websocket connection will be cut.
Also, the Flink planner can decide that your source functions need be read twice if it helps the processing. This means that your websockets would need to maintain a history of messages received, and make sure they are sent once to each operator instance.
This being said you can have a webserver managing the websocket, piping messages back and forth to a Kafka topic, which then Flink can operate on.
Since you're talking about AWS, I suggest you learn about their Websocket API Gateway service. I believe these can be connected easily with Kinesis, which Flink can read from and write to easily.
I am working on Microservice architecture. One of my service is exposed to source system which is used to post the data. This microservice published the data to redis. I am using redis pub/sub. Which is further consumed by couple of microservices.
Now if the other microservice is down and not able to process the data from redis pub/sub than I have to retry with the published data when microservice comes up. Source can not push the data again. As source can not repush the data and manual intervention is not possible so I tohught of 3 approaches.
Additionally Using redis data for storing and retrieving.
Using database for storing before publishing. I have many source and target microservices which use redis pub/sub. Now If I use this approach everytime i have to insert the request in DB first than its response status. Now I have to use shared database, this approach itself adding couple of more exception handling cases and doesnt look very efficient to me.
Use kafka inplace if redis pub/sub. As traffic is low so I used Redis pub/sub and not feasible to change.
In both of the above cases, I have to use scheduler and I have a duration before which I have to retry else subsequent request will fail.
Is there any other way to handle above cases.
For the point 2,
- Store the data in DB.
- Create a daemon process which will process the data from the table.
- This Daemon process can be configured well as per our needs.
- Daemon process will poll the DB and publish the data, if any. Also, it will delete the data once published.
Not in micro service architecture, But I have seen this approach working efficiently while communicating 3rd party services.
At the very outset, as you mentioned, we do indeed seem to have only three possibilities
This is one of those situations where you want to get a handshake from the service after pushing and after processing. In order to accomplish the same, using a middleware queuing system would be a right shot.
Although a bit more complex to accomplish, what you can do is use Kafka for streaming this. Configuring producer and consumer groups properly can help you do the job smoothly.
Using a DB to store would be a overkill, considering the situation where you "this data is to be processed and to be persisted"
BUT, alternatively, storing data to Redis and reading it in a cron-job/scheduled job would make your job much simpler. Once the job is run successfully, you may remove the data from cache and thus save Redis Memory.
If you can comment further more on the architecture and the implementation, I can go ahead and update my answer accordingly. :)
Current flow of the project that I'm working on involves pushing to a local kafka using ruby-kafka gem.
Now the need arose to add producer for the remote kafka, and duplicate also messages there.
And I'm looking for a better way, than calling Kafka.new(...) twice...
Could you please help me, and do you happen to have any ideas?
Another approach to consider would be writing the data once from your application, and then asynchronously replicating the message from one Kafka cluster to another. There are multiple ways of doing this including Apache Kafka's MirrorMaker, Confluent's Replicator, Uber's uReplicator etc.
Disclaimer: I work for Confluent.
Background:
I'm trying to import data from kafka to elasticsearch, and there are 2 kinds of clients. One is web client, another one is agent client.
Web client will handle csv file when user upload, web client reads every 10,000 rows from csv file and send the data message with the csv total lines count to Producer. Producer send the message to kafka, then consumer pulls the message, and imports data into elasticsearch. At the same time consumer uses the data messages length and csv total count to update task progress, also updates error logs if it has. At last our web client would know the errors and importing progress.
Agent client watch log file changes, once the new log is coming, it would send message to producer, the same as web client, but it does not care about progress. As the logs is always growing like nginx logs.
Framework:
Here is the framework that I used:
The producer and consumer are our python programs that used kafka-python.
Problems:
Sometimes the consumer crashed, it would been auto restart and
reimported the same data again.
Sometime client send too many
messages, Producer might miss some, as the http request has
limitation I guess.
Question:
Is there any better framework to do those thing? like using kafka-connect-elasticsearch , spark streaming ?
Yes - use the Kafka Connect Elasticsearch connector. This will make your life a LOT easier. The Kafka Connect API is specifically designed to do all of this hard stuff for you (restarts, offset management, etc). As an end-user you just need to set up a configuration file. You can read an example of using Kafka Connect here.
Kafka Connect is part of Apache Kafka. The Elasticsearch connector is open source and available on its own on github. Alternatively, just download Confluent Platform which bundles the latest version of Kafka with connectors (including Elasticsearch, HDFS, etc) and a bunch of other useful tools.