Distribution of content among cluster nodes within edge NiFi processors - apache-nifi

I was exploring NiFi documentation. I must agree that it is one of the well documented open-source projects out there.
My understanding is that the processor runs on all nodes of the cluster.
However, I was wondering about how the content is distributed among cluster nodes when we use content pulling processors like FetchS3Object, FetchHDFS etc. In processor like FetchHDFS or FetchSFTP, will all nodes make connection to the source? Does it split the content and fetch from multiple nodes or One node fetched the content and load balance it in the downstream queues?

I think this document has an answer to your question:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html
For other file stores the idea is the same.
will all nodes make connection to the source?
Yes. If you did not limit your processor to work only on primary node - it runs on all nodes.

The answer by #dagget has traditionally been the approach to handle this situation, often referred to as the "list + fetch" pattern. List processor runs on Primary Node only, listings sent to RPG to re-distribute across the cluster, input port receives listings and connect to a fetch processor running on all nodes fetching in parallel.
In 1.8.0 there are now load balanced connections which remove the need for the RPG. You would still run the List processor on Primary Node only, but then connect it directly to the Fetch processors, and configure the queue in between to load balance.

Related

Can I use the same flow.xml.gz for two different Nifi cluster?

We have a 13 nodes nifi cluster with around 50k processors. The size of the flow.xml.gz is around 300MB. To bring up the 13 nodes Nifi cluster, it usually takes 8-10 hours. Recently we split the cluster into two parts, 5nodes cluster and 8 nodes cluster with the same 300MB flow.xml.gz in both. Since then we are not able to get the Nifi up in both the clusters. Also we are not seeing any valid logs related to this issue. Is it okay to have the same flow.xml.gz . What are the best practices we could be missing here when splitting the Nifi Cluster.
You ask a number of questions that all boil down to "How to improve performance of our NiFi cluster with a very large flow.xml.gz".
Without a lot more details on your cluster and the flows in it, I can't give a definite or guaranteed-to-work answer, but I can point out some of the steps.
Splitting the cluster is no good without splitting the flow.
Yes, you will reduce cluster communications overhead somewhat, but you probably have a number of input processors that are set to "Primary Node only". If you load the same flow.xml.gz on two clusters, both will have a primary node executing these, leading to contention issues.
More importantly, since every node still loads all of the flow.xml.gz (probably 4 Gb unzipped), you don't have any other performance benefits and verifying the 50k processors in the flow at startup still takes ages.
How to split the cluster
Splitting the cluster in the way you did probably left references to nodes that are now in the other cluster, for example in the local state directory. For NiFi clustering, that may cause problems electing a new cluster coordinator and primary node, because a quorum can't be reached.
It would be cleaner to disconnect, offload and delete those nodes first from the cluster GUI so that these references are deleted. Those nodes can then be configured as a fresh cluster with an empty flow. Even if you use the old flow again later, test it out with an empty flow to make it a lot quicker.
Since you already split the cluster, I would try to start one node of the 8 member cluster and see if you can access the cluster menu to delete the split-off nodes (disconnecting and offloading probably doesn't work anymore). Then for the other 7 members of the cluster, delete the flow.xml.gz and start them. They should copy over the flow from the running node. You should adjust the number of candidates expected in nifi.properties (nifi.cluster.flow.election.max.candidates) so that is not larger than the number of nodes to slightly speed up this process.
If successful, you then have the 300 MB flow running on the 8 member cluster and an empty flow on the new 5 member cluster.
Connect the new cluster to your development pipeline (NiFi registry, templates or otherwise). Then you can stop process groups on the 8 member cluster, import them on the new and after verifying that the flows are running on the new cluster, delete the process group from the old, slowly shrinking it.
If you have no pipeline or it's too much work to recreate all the controllers and parameter contexts, you could take a copy of the flow.xml.gz to one new node, start only that node and delete all the stuff you don't need. Only after that should you start the others (with their empty flow.xml.gz) again.
For more expert advice, you should also try the Apache NiFi Users email list. If you supply enough relevant details in your question, someone there may know what is going wrong with your cluster.

Elasticsearch cluster setup

I'm curerntly running a single node ES-Instance. As there are some limitations with a single server setup in ES, and the queries are becoming pretty slow sometimes, I want to upgrade to a full cluster.
The ES-Instance currently only stores data, and is not doing any fancy stuff (Transformations, Ingest Pipelines, ...). All I currently need is a place to store my data at, and to retrieve it (Search + Aggregations). There are more reads than writes.
In a lot of forums and blog posts I read about the "Split-Brain" issue. To circumvent this, the minimum node count should be 3.
The idea is to keep the amount of machines low, because this is a private project and I do not want to also manage a lot of OS in my spare time..
The structure I thought about was:
- 1 Coordinator + Voting-only Node
- 2 Master-eligible + Data Nodes
minimum_master_nodes: 2 to circumvent Split-Brains
Send all ES-Queries to the Coordinator, which will then issue the requests on the data nodes and reduce the final results.
My question is: Does this make sense? Or is it better to use 3 master-eligible + Data nodes?
Online I found no guidance for ES-Newbies to get an idea of the structure of a simple cluster.
You are in right direction and I can see most of your thinking is also right so don't consider yourself as ES newbie :).
Anyway as you are going to have 3 nodes in your cluster, why note make all three nodes as master eligible nodes and why you are making a dedicated co-ordinating node when by default every ES node works as a co-ordinating node and in your small project you won't need a dedicated co-ordinating node. this way you will have a simple configuration, just don't assign any explicit role to any node as by default all ES nodes are master, data and co-ordinating node.
Also, you should invest some time to identify the slow logs and its cause to make it more performant rather than adding more resources that too in personal project, please refer to my short tips on improving the search performance

Elastic search coordinating-node

we are new to elasticsearch and beginning to set-up a coordination node for our UI client to query the index. didn't really understand the difference between master node and coordination node. does coordination has to be scaled up separately based in the site traffic? will other nodes share the load?
The master node is responsible for managing the cluster topology. It neither indexes data nor participates in search tasks.
The data nodes are the real work horses of your ES cluster and are responsible for indexing data and running searches/aggregations.
Coordinating nodes (formerly called "client nodes") are some kind of load balancers within your ES cluster. They are optional and if you don't have any coordinating nodes, your data nodes will be the coordinating nodes. They don't index data but their main job is to distribute search tasks to the relevant data nodes (which they know where to find thanks to the master node) and gather all the results before aggregating them and returning them to the client application.
So depending on your cluster size, amount of data and SLA requirements, you might need to spawn one or more coordinating nodes in order to properly serve your clients. Without any real numbers, it is hard to advise anything at this point, but the above describes how each kind of node works.
If you're just beginning and don't have much data, you don't need any dedicated coordinating node, a simple data node is perfectly fine.

How do the Flowfiles get distributed across the cluster nodes?

For example, if I have a GetFile processor that I have designated to be isolated, how do the flow files coming from that processor get distributed across the cluster nodes?
Is there any additional work / processors that need to be added?
In Apache NiFi today the question of load balancing across the cluster has two main answers. First, you must consider how data gets to the cluster in the first place. Second, once it is in the cluster do you need to rebalance.
For getting data into the cluster it is important that you select protocols which are themselves scalable in nature. Protocols which offer queuing semantics are good for this whereas protocols which do not offer queuing semantics are problematic. As an example of one with queueing semantics think JMS queues or Kafka or some HTTP APIs. Those are great because one or more clients can pull from them in a queue fashion and thus spread the load. An example of a protocol which does not offer such behavior would bet GetFile or GetSFTP and so on. These are problematic because the client(s) have to share state about which data they see to pull. To address even these protocols we've moved to a model of 'ListSFTP' and 'FetchSFTP' where ListSFTP occurs on one node in the cluster (primary node) and then it uses Site-to-Site feature of NiFi to load balance to the rest of the cluster then each node gets its share of work and does FetchSFTP to actually pull the data. The same pattern is offered for HDFS now as well.
In describing that pattern I also mentioned Site-to-Site. This is how two nifi clusters can share data which is great for Inter-site and Instra-Site distribution needs. It also works well for spreading load within the same cluster. For this you simply send the data to the same cluster and NiFi takes care then of load balancing and fail-over and detection of new nodes and removed nodes.
So there are great options already. That said we can do more and in the future we plan to offer a way for you to on a connection indicate it should be auto-load-balanced and then it will behind the scenes do what I've described.
Thanks
Joe
Here is an updated answer, that works even simpler with newer versions of NiFi. I am running Apache NiFi 1.8.0 here.
The approach I found here is to use a processor on the primary node, that will emit flow files to be consumed via a load balanced connection.
For example, use one of the List* processors, in "Scheduling" set its "Execution" to run on the primary node.
This should feed into the next processor. Select the connection and set its "Load Balance Strategy".
You can read more about the feature in its design document.

Elasticsearch architecture

Is there a way to sync multiple ES clusters with each other? The ES docs discourage from having a cluster spanning multiple data centers. So to avoid that I'd be having distinct ES clusters in each datacenter. I also need to have the same data indexed in each cluster.
One way to achieve that would be to send each document to each cluster. But issuing 'n' write requests seems unnecessary. Additionally, if some write requests fail, the clusters could potentially go out of sync.
Is there a way for a cluster to "subscribe" to changes in another cluster? Or send the writes to a master cluster (whichever one is the closest to the data source) and let it eventually replicate to the other ones?
edit: I've read about tribe nodes. The docs say that it works just for reads and has some limitations. Is that something that would let me do this?
You can set up custom routing/allocation strategy on datacenter id [1]. This will ensure that one replica of the shard goes into each data center. Example
cluster.routing.allocation.awareness.force.dc.values: dc1,dc2
cluster.routing.allocation.awareness.attributes: dc
[1] https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-cluster.html

Resources