How to write to multiple distinct Elasticsearch clusters using the Kafka Elasticsearch Sink Connector - elasticsearch

Is is possible to use a single Kafka instance with the Elasticsearch Sink Connector to write to separate Elasticsearch clusters with the same index? Documentation. The source data may be a backend database or an application. An example use-case is that one cluster may be used for real-time search and the other may be used for analytics.
If this is possible, how do I configure the sink connector? If not, I can think of a couple of options:
Use 2 Kafka instances, each pointing to a different Elasticsearch cluster. Either write to both, or write to one and copy from it to the other.
Use a single Kafka instance and write a stream processor which will write to both clusters.
Are there any others?

Yes you can do this. You can use a single Kafka cluster and single Kafka Connect worker.
One connector can write to one Elasticsearch instance, and so if you have multiple destination Elasticsearch you need multiple connectors configured.
The usual way to run Kafka Connect is in "distributed" mode (even on a single instance), and then you submit one—or more—connector configurations via the REST API.
You don't need a Java client to use Kafka Connect - it's configuration only. The configuration, per connector, says where to get the data from (which Kafka topic(s)) and where to write it (which Elasticsearch instance).
To learn more about Kafka Connect see this talk, this short video, and this specific tutorial on Kafka Connect and Elasticsearch

Related

Sending data from elasticsearch to kafka and finally to influxdb?

I would like to know how can I send data from elasticsearch to kafka and then to influxdb?
I've already tried using confluent platform with sources connector from elasticsearch and sink connector from influxdb, but the problem is that I'm stuck on sending data from elasticsearch to kafka
moreover once my computer is off I no longer have the backup of the connectors and I have to start from scratch
that's why my questions:
How to send data from elasticsearch to kafka? using confluent platform?
Do I really have to use confluent platform if I want to use kafka connect?
Kafka Connect is Apache 2.0 Licensed and is included with Apache Kafka download.
Confluent (among other companies) write plugins for it, such as Sinks to Elasticsearch or Influx.
It appears the Elasticsearch source on Confluent Hub is not built by Confluent, for example.
Related - Use Confluent Hub without Confluent Platform installation
once my computer is off I no longer have the backup of the connectors and I have to start from scratch
Kafka Connect distributed mode stores its config data in Kafka topics... Kafka defaults to store topic data in /tmp... Which is deleted when you shutdown your computer
Similarly, if you are using Docker for any of these systems without mounted volumes, Docker also is not persistent by default

Kafka connect with EventStoreDB

I'm working on a small academic project - Event sourcing with EventStoreDB and Apache Kafka as a broker. The idea is that get events from EventStoreDB and push them to Kafka for further distribution. I saw Apache Kafka has connections to different DB systems but didn't find any connector with EvenStoreDB.
How can I create(code or use existing one) Kafka connector to EventStoreDB, so these two systems would be able to transfer events vise-versa, from Kafka to EventStoreDB and from EventStoreDB to Kafka?
There is no official Kafka Connect Connector between Kafka and EventStoreDB, and I haven't heard about any unofficial so far. Still, there is a tool called Replicator that enables replicating data from EventStoreDB to Kafka (https://replicator.eventstore.org/docs/features/sinks/kafka/). It's open-sourced, so you can either use it or check the implementation.
For the EventStoreDB to Kafka, I recommend using the subscriptions mechanism: catch-up if you need an ordering guarantee, persistent if ordering is not critical: https://developers.eventstore.com/clients/grpc/subscriptions.html. The crucial part here is to define how to map EventStoreDB streams to Kafka topics and partitions. Typically you'd expect to have at least an ordering guarantee on the stream level, so single stream events should land to the same partition.
For Kafka to EventStoreDB integration, you could either write your own pass-through service or try to use the HTTP sink connector (e.g. https://docs.confluent.io/kafka-connect-http/current/overview.html). EventStoreDB exposes HTTP API (https://developers.eventstore.com/clients/http-api/v5/introduction/). Sidenote, this API (Atom pub based) may be replaced with another HTTP API in the future, so the structure may change.
You can use Event Store Replicator, which has a Kafka sink.
Keep in mind that it doesn't do anything with regards to events schema, so things like Kafka Streams and KSQL might not work properly.
The sink was created solely for the purpose of pushing events to Kafka being used as a message broker.

How to configure the Kafka Cluster to work with Elastic Search Cluster?

I have to build a log-cluster and monitoring cluster ( For high-availability ) like this topology. I'm wondering to know how to config those log-shippers clusters. ( I have 2 Topo in the Image)
If I use Kafka with FileBeat in Kafka Cluster, Will Elastic Search
receive duplication data because Kafka has replicas in data?
If I use Logstash (In Elastic Search Cluster) for getting logs from
Kafka Cluster, how the config should be because I think that
Logstash will not know where to read the log efficiency on Kafka
Cluster.
Cluster topology
Thanks for reading. If you have any idea, please discuss with me ^^!
As i see both configurations are compatible with Kafka, you can use filebeat, logstash or mixed them in consumer and producer stages!
IMHO all depends about your needs, ie: sometimes we use some filters to rich the data before ingest to kafka (producer stage), or before index the data to elastic (consumer stage), in this case is better work with logsatsh, because is easier using filters than in filebeat
But if you want to play with raw data, maybe filebeat is betther, because the agent is lighter.
About your questions:
Kafka has the data replicted, but for HA propouses, you only read one time the data with the same consumer group
For read the log from kafka with logstash, you can use the logstash input plugin for kafka, is easy and works fine!
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html

Kafka to Elasticsearch, HDFS with Logstash or Kafka Streams/Connect

I use Kafka for message queue/processing. My question is about performance/best practice. I will do my own performance tests but maybe someone has results/experience already.
The data is raw in a Kafka (0.10) topic and I want to transfer it structured to ES and HDFS.
Now I see 2 possibilities:
Logstash (Kafka input plugin, grok filter (parsing), ES/webhdfs output plugin)
Kafka Streams (parsing), Kafka Connect (ES sink, HDFS sink)
Without any tests I would say that the second option is better/cleaner and more reliable?
Logstash "best practice" for getting data into Elasticsearch. WebHDFS won't have the raw performance of the Java API that is part of the Kafka Connect plugin, however.
Grok could be done in a Kafka Streams process, so your parsing could be done in either location.
If you are on an Elastic subscription, then they would like to sell Logstash. Confluent would like to sell Kafka Streams + Kafka Connect.
Avro seems to be the best medium for data transfer, and the Schema Registry is a popular way to do that. IIUC, Logstash doesn't work well with a Schema Registry or Avro, and prefers JSON.
In the Hadoop landscape, I would offer the intermediate options of Apache Nifi or Streamsets.
In the end, it really depends on your priorities, and how well you (and your team) can support these tools.

Kafka-Connect vs Filebeat & Logstash

I'm looking to consume from Kafka and save data into Hadoop and Elasticsearch.
I've seen 2 ways of doing this currently: using Filebeat to consume from Kafka and send it to ES and using Kafka-Connect framework. There is a Kafka-Connect-HDFS and Kafka-Connect-Elasticsearch module.
I'm not sure which one to use to send streaming data. Though I think that if I want at some point to take data from Kafka and place it into Cassandra I can use a Kafka-Connect module for that but no such feature exists for Filebeat.
Kafka Connect can handle streaming data and is a bit more flexible. If you are just going to elastic, Filebeat is a clean integration for log sources. However, if you are going from Kafka to a number of different sinks, Kafka Connect is probably what you want. I'd recommend checking out the connector hub to see some examples of open source connectors at your disposal currently http://www.confluent.io/product/connectors/

Resources