Nifi Cluster reading duplicate data from kafka

Nifi Cluster reading duplicate data from kafka - apache-nifi

We have a cluster running with 6 nodes. Now when I add a Kafka consumer, each cluster node should pull unique data, as in each node should fetch from a diff partition: https://bryanbende.com/development/2016/09/15/apache-nifi-and-apache-kafka.
The same is also mentioned in the nifi docs. However in our case each node is pulling the same data from Kafka leading to duplication. Can someone please help. Are there any specific configurations required to get the same done?

Related

Can you run an elasticsearch data node after deleting the data folder?

I am running a three node Elasticsearch (ELK) cluster. All nodes have all and the same roles, e.g. data, master, etc. The disk on node 3 where the data folder is assigned became corrupt and that data is probably unrecoverable. The other nodes are running normally and one of them assumed the master role instead.
Will the cluster work normally if I replace the disk and make the empty directory available to elastic again, or am I risking crashing the whole cluster?
EDIT: As this is not explicitly mentioned in the answer, yes, if you add your node with an empty data folder, the cluster will continue normally as if you added a new node to the cluster, but you have to deal with the missing data. In my case, I lost the data as I do not have replicas.

Let me try to explain that in simple way.
Your data got corrupt at node-3 so if you add that that node again, it will not have the older data, i.e. the shards stored in node-3 will remain unavailable for the cluster.
Did you have the replica shards configured for the indexes?
What is the current status(yellow/red) of the cluster when you have
node-3 removed?
If a primary shard isn't available then the master-node promotes one of the active replicas to become the new primary. If there are currently no active replicas then status of the cluster will remain red.

How to know total nodes in an elasticsearch cluster?

I have 3 nodes elasticsearch cluster. If more than one node goes down then I can easily check them manually. Suppose nodes in the cluster got increased then it will be difficult to check them manually. So, how can I get all the nodes(specifically name of the nodes) of the cluster even if they are down?
To get live/healthy nodes I hit the api endpoint:
curl -X GET "hostname/ip:port/_cat/nodes?v&pretty"
Is there any endpoint by using which I can get total nodes and unhealthy/down nodes in elasticsearch cluster?
I was trying to list all the nodes using discovery.seed.hosts present in elasticsearch.yml config file. But I don't know how to do it or is it the right approach or not.

I don't think there is any API to know about offline nodes. If your entire cluster is down or single node down, then Elastic doesn't provide any way to check the node's health. You need to depend on an external script or code or monitoring tool which will ping all your nodes and print status.
You can write a custom script which will call below API and it will return all the nodes which are available in the cluster. Once you have received response, you can filter out IP or hostname of the node and whichever are not coming in response you can consider it as down node.
GET _cat/nodes?format=json&filter_path=ip,name
Another option is to enable cluster monitoring which will give you status of entire cluster but again it will show information about running node only.
Please check this answer for how Kibana show offline node in Cluster Monitoring.

which node of Elasticsearch cluster (master , data, ingest ) gather data from logstash?

I have 3 nodes of Elasticsearch in my cluster. how they connect to each other and how to set output filter of my logstash to send data to ES cluster(actually which node is responsible for gathering data)?

actually Logstash send data to cluster and you can check it from /etc/logstash/conf.d/*. ingest node is responsible for indexing documents on cluster. by default all nodes are ingest. you can have dedicated ingest node but with 3 nodes you don't need.

Why Druid segments become unavailable after data ingestion

Druid cluster shows unavailable for certain segments of data of data source after data ingestion.
Ex: 72.4% available (2352 segments, 647 segments unavailable)
We have a clustered deployment 3 nodes :
master node (coordinator amd overlord)
Data node (historical and middlemanager)
Query node (broker and router)
Any specific reason why it is happening so.

The issue is resolved after clean restart of master and data nodes. However just restarting nodes without cleaning data did not work

Adding cluster to existing elastic search in elk

Currently I have existing
1. Elastic search
2. Logstash
3. Kibana
I have existing data on them.
Now i have setup ELK cluster with 3 Master nodes , 5 data nodes 3 client nodes.
But i am not sure how can i get existing data into them.
Is it possible that if i make the existing ES node as data node and then attach it to the cluster . Then will that data gets replicated to other data nodes as well? and then take that node offline

Option 1
How about just try with fewer nodes? It is not hard to test if it is supported if you setup one node, feed some data, and add one more and configure them as a cluster to see if data get synchronized.
Option 2
Another option is to use an elasticsearch migration tool like https://github.com/taskrabbit/elasticsearch-dump, basically, you could setup a clean cluster and migrate all your data in old node to this cluster.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Nifi Cluster reading duplicate data from kafka - apache-nifi

Related

Can you run an elasticsearch data node after deleting the data folder?

How to know total nodes in an elasticsearch cluster?

which node of Elasticsearch cluster (master , data, ingest ) gather data from logstash?

Why Druid segments become unavailable after data ingestion

Adding cluster to existing elastic search in elk

Categories

Resources