Distributing data read from GetMongo in a nifi cluster - apache-nifi

I have a clustered nifi setup and we are running GetMongo processor with the Primary mode on, so that duplicate data is not fetched. This seems to be working fine. However once I have this data I want the following processes in the chain to run on a cluster, as in parallel processing to be done on this data which has been fetched. Somehow this is not happening. So my question is below assuming GetMongo has fetched 30000 records and they are in the queue:
1) How do I check whether a processor is running its process on a single node or on all nodes. The config has been set to all nodes, but when the processor is running I see it displays 1 in the top right corner.
2) If one processor has been set to run only on primary node, do all other processors in the flow also run on Primary mode?
Example:
In the screenshot above, my getmongo is running in primary node, how do I make sure that the execute script processor runs in parallel on all 3 nifi nodes. As of now if I check the view status history in the executescript process I see data flowing only through the primary node.

Yes, that's correct. When you mark the source processor to run only the Primary Node, all the subsequent steps will only happen on that node alone since the data is residing only that node (primary node), even when you have the NiFi in a clustered mode. To make it work the way you want, you can follow either of the following two approaches:
Approach #1 : Comibination of RPG and Site-To-Site
Here your flow will look like this:
Create an Input Port on the Root Group (the very top level of the NiFi canvas)
Make GetMongo run only on Primary Node.
Connect the success relationship of the processor to a Remote Processor Group (RPG). This RPG can be configured with the cluster details itself and configure it to connect to the port you added in step #1.
From the input port, connect it to your processing logic.
Useful Links:
https://pierrevillard.com/2017/02/23/listfetch-pattern-and-remote-process-group-in-apache-nifi/
This is cumbersome and would make your flow very complex but this how it has to be done, till NiFi 1.8. With NiFi 1.8, you can use the following approach.
Approach #2 : Load-Balanced Connections (Apache NiFi 1.8+)
Apache NiFi had a new release - 1.8, a week ago. With this release, there is a new feature (a long time coming and very much desired one) was introduced. It is called Load-Balanced Connections.
In this approach, you can simply ignore the RPG/Site-To-Site combination and rather do the following:
Connect the output of your source processor, in this case GetMongo with the subsequent processors.
Right click the success relationship of the source processor.
Click configure
Go to Settings tab
Set the Load Balance Strategy to the desired one, preferably Roudd robin in your case.
Useful Links:
https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster
https://pierrevillard.com/2018/10/29/nifi-1-8-revolutionizing-the-list-fetch-pattern-and-more/

Related

Apache NiFi - Can it scale at the processor level?

Newbie Alert to Apache NiFi!
Curious to understand (and read relevant material) on the scalability aspects of Apache NiFi pipeline in a clustered set up.
Imagine there is a 2 node cluster Node 1 & Node 2.
A simple use case as an example:
Query a Database Table in batches of 100 (Lets say there are 10 batches).
For each batch, call a REST API (invoke Http).
If a pipeline is triggered on Node 1 in a cluster, Does this mean all the 10 batches are run only in Node 1?
Is there any work distribution "out-of-the-box" available in NiFi at every processor level? Along the lines of 5 batches are executed for the REST API calls per node.
Is the built-in queue of NiFi distributed in nature?
Or is the recommended way to scale at the processor level is to publish the output of the previous processors to a messaging middleware (like Kafka) and then make the subsequent NiFi processor to consume from it?
What's the recommended way to scale at every processor level in NiFi?
every queue has a load balancing strategy parameter with following options:
Do not load balance: Do not load balance FlowFiles between nodes in the cluster. This is the default.
Partition by attribute: Determines which node to send a given FlowFile to based on the value of a user-specified FlowFile Attribute.
Round robin: FlowFiles will be distributed to nodes in the cluster in a round-robin fashion.
Single node: All FlowFiles will be sent to a single node in the cluster.
Details in documentation:
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Load_Balancing

How do we know when a flow is completed in case we have multiple flowfiles running parallely?

I have a requirement where we have a template which uses SQL as source and SQL as destination and data would be more than 100GB for each table so here template will be instantiated multiple times based on tables to be migrated and also each table is partitioned into multiple flowfiles. How do we know when the process is completed? As here there will be multiple flowfiles we are unable to conclude as it hits end processor.
I have tried using SitetoSiteStatusReportingTask to check queue count, but it provides count based on connection and its difficult to fetch connectionid for each connection then concatenate as we have large number of templates. Here we have another problem in reporting task as it provides data on all process groups which are available on NIFI canvas which will be huge data if all templates are running and may impact in performance even though I used avro schema to fetch only queue count and connection id.
Can you please suggest some ideas and help me to achieve this?
you have multiple solution :
1 - you can use the wait/notify duo processor.
if you dont want multiple flowfile running parallely :
2 - set backpressure on Queue
3 - specify group level flow file concurrency (recommended but Nifi 1.12 only )

How to limit Nifi processor to run on a single node in cluster?

We are building a data workflow with NiFi and want the final (custom) processor (which runs the deduplication logic) to run only one one of the NiFi cluster nodes (instead of running on all of them). I see that NiFi 1.7.0 (which is not yet released) has a PrimaryNodeOnly annotation to enforce a single node execution behaviour. Is there a way or workaround to enforce such behaviour in NiFi 1.6.0?
NOTE: In addition to #PrimaryNodeOnly, it would be better if NiFi provides a way to run a processor on a single node only (i.e., some annotation like #SingleNodeOnly). This way the execution node need not necessarily be the primary node which therefore will reduce the load on primary node. This is just an ask for future and not necessary to solve the problem mentioned above.
There is no specific workaround to enforce it in previous versions, it is on the data flow designer to mark the intended processor(s) to run on the Primary Node only. You could write a script to query the NiFi API for processors of certain types or names, then check/set the strategy as Primary Node Only.
In NiFi 1.6.0 it's possible and looks like this:

How is data written to HDFS?

I'm trying to understand how is data writing managed in HDFS by reading hadoop-2.4.1 documentation.
According to the following schema :
whenever a client writes something to HDFS, he has no contact with the namenode and is in charge of chunking and replication. I assume that in this case, the client is a machine running an HDFS shell (or equivalent).
However, I don't understand how this is managed.
Indeed, according to the same documentation :
The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Is the schema presented above correct ? If so,
is the namenode only informed of new files when it receives a Blockreport (which can take time, I suppose) ?
why does the client write to multiple nodes ?
If this schema is not correct, how is file creation working with HDFs ?
As you said DataNodes are responsible for serving read/write requests and block creation/deletion/replication.
Then they send on a regular basis “HeartBeats” ( state of health report) and “BlockReport”( list of blocks on the DataNode) to the NameNode.
According to this article:
Data Nodes send heartbeats to the Name Node every 3 seconds via a TCP
handshake, ... Every tenth heartbeat is a Block Report,
where the Data Node tells the Name Node about all the blocks it has.
So block reports are done every 30 seconds, I don't think that this may affect Hadoop jobs because in general they are independent jobs.
For your question:
why does the client write to multiple nodes ?
I'll say that actually, the client writes to just one datanode and tell him to send data to other datanodes(see this link picture: CLIENT START WRITING DATA ), but this is transparent. That's why your schema considers that the client is the one who is writing to multiple nodes

Hadoop Datanode, namenode, secondary-namenode, job-tracker and task-tracker

I am new in hadoop so I have some doubts. If the master-node fails what happened the hadoop cluster? Can we recover that node without any loss? Is it possible to keep a secondary master-node to switch automatically to the master when the current one fails?
We have the backup of the namenode (Secondary namenode), so we can restore the namenode from Secondary namenode when it fails. Like this, How can we restore the data's in datanode when the datanode fails? The secondary namenode is the backup of namenode only not to datenode, right? If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?
How can we restore the entire cluster data if anything happens?
And my final question, can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?
Thanks in advance
Although, It is too late to answer your question but just It may help others..
First of all let me Introduce you with Secondary Name Node:
It Contains the name space image, edit log files' back up for past one
hour (configurable). And its work is to merge latest Name Node
NameSpaceImage and edit logs files to upload back to Name Node as
replacement of the old one. To have a Secondary NN in a cluster is not
mandatory.
Now coming to your concerns..
If the master-node fails what happened the hadoop cluster?
Supporting Frail's answer, Yes hadoop has single point of failure so
whole of your currently running task like Map-Reduce or any other that
is using the failed master node will stop. The whole cluster including
client will stop working.
Can we recover that node without any loss?
That is hypothetical, Without loss it is least possible, as all the
data (block reports) will lost which has sent by Data nodes to Name
node after last back up taken by secondary name node. Why I mentioned
least, because If name node fails just after a successful back up run
by secondary name node then it is in safe state.
Is it possible to keep a secondary master-node to switch automatically to the master when the current one fails?
It is staright possible by an Administrator (User). And to switch it
automatically you have to write a native code out of the cluster, Code
to moniter the cluster that will cofigure the secondary name node
smartly and restart the cluster with new name node address.
We have the backup of the namenode (Secondary namenode), so we can restore the namenode from Secondary namenode when it fails. Like this, How can we restore the data's in datanode when the datanode fails?
It is about replication factor, We have 3 (default as best practice,
configurable) replicas of each file block all in different data nodes.
So in case of failure for time being we have 2 back up data nodes.
Later Name node will create one more replica of the data that failed
data node contained.
The secondary namenode is the backup of namenode only not to datenode, right?
Right. It just contains all the metadata of data nodes like data node
address,properties including block report of each data node.
If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?
HDFS will forcely try to continue the job. But again it depends on
replication factor, rack awareness and other configuration made by
admin. But if following Hadoop's best practices about HDFS then it
will not get failed. JobTracker will get replicated node address to
continnue.
How can we restore the entire cluster data if anything happens?
By Restarting it.
And my final question, can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?
yes, you can use any programming language which support Standard file
read write operations.
I Just gave a try. Hope it will help you as well as others.
*Suggestions/Improvements are welcome.*
Currently hadoop cluster has a single point of failure which is namenode.
And about the secondary node isssue (from apache wiki) :
The term "secondary name-node" is somewhat misleading. It is not a
name-node in the sense that data-nodes cannot connect to the secondary
name-node, and in no event it can replace the primary name-node in
case of its failure.
The only purpose of the secondary name-node is to perform periodic
checkpoints. The secondary name-node periodically downloads current
name-node image and edits log files, joins them into new image and
uploads the new image back to the (primary and the only) name-node.
See User Guide.
So if the name-node fails and you can restart it on the same physical
node then there is no need to shutdown data-nodes, just the name-node
need to be restarted. If you cannot use the old node anymore you will
need to copy the latest image somewhere else. The latest image can be
found either on the node that used to be the primary before failure if
available; or on the secondary name-node. The latter will be the
latest checkpoint without subsequent edits logs, that is the most
recent name space modifications may be missing there. You will also
need to restart the whole cluster in this case.
There are tricky ways to overcome this single point of failure. If you are using cloudera distribution, one of the ways explained here. Mapr distribution has a different way to handle to this spof.
Finally, you can use every single programing language to write map reduce over hadoop streaming.
Although, It is too late to answer your question but just It may help others..firstly we will discuss role of Hadoop 1.X daemons and then your issues..
1. What is role of secondary name Node
it is not exactly a backup node. it reads a edit logs and create updated fsimage file for name node periodically. it get metadata from name node periodically and keep it and uses when name node fails.
2. what is role of name node
it is manager of all daemons. its master jvm proceess which run at master node. it interact with data nodes.
3. what is role of job tracker
it accepts job and distributes to task trackers for processing at data nodes. its called as map process
4. what is role of task trackers
it will execute program provided for processing on existing data at data node. that process is called as map.
limitations of hadoop 1.X
single point of failure
which is name node so we can maintain high quality hardware for the name node. if name node fails everything will be inaccessible
Solutions
solution to single point of failure is hadoop 2.X which provides high availability.
high availability with hadoop 2.X
now your topics ....
How can we restore the entire cluster data if anything happens?
if cluster fails we can restart it..
If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?
we have default 3 replicas of data(i mean blocks) to get high availability it depends upon admin that how much replicas he has set...so job trackers will continue with other copy of data on other data node
can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?
basically mapreduce is execution engine which will solve or process big data problem in(storage plus processing) distributed manners. we are doing file handling and all other basic operations using mapreduce programming so we can use any language of where we can handle files as per the requirements.
hadoop 1.X architecture
hadoop 1.x has 4 basic daemons
I Just gave a try. Hope it will help you as well as others.
Suggestions/Improvements are welcome.

Resources