Data Transfer approach between Nifi Cluster - hortonworks-data-platform

I am looking for a reliable data transfer approach between different Nifi cluster.
I have two Nifi cluster - one fetching data from source and another for pushing data into Hive/HDFS.
I need to transfer data from first nifi cluster to another nifi cluster. Is there any component available in nifi to do this ?
Nifi Cluster 1
GetFile --> Publish to Port
Nifi Cluster 2
Read from Port --> Publish to HDFS
Thanks

Apache NiFi provides the site-to-site feature for transferring data between two instances. You can read about it here:
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#site-to-site

Related

How to import data from HDFS (Hadoop) into ElasticSearch?

We have a big Hadoop cluster and recently installed Elastic Search for evaluation.
Now we want to bring data from HDFS to ElasticSearch.
ElasticSearch is installed in a different cluster and so far - we could run a Beeling or HDFS script to extract data from Hadoop into some file and then from a local file bulk load it to ElasticSearch.
Wondering if there is a direct connection from HDFS to ElasticSearch.
I start reading about it here:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/install.html
But since our team is not DevOps (does not configure nor manage Hadoop cluster) and can only access Hadoop via Kerberos/user/pass - wondering if this is possible to configure (and how) without involving whole DevOps team that manages Hadoop cluster to install/setup all these libraries before direct connect?
How to do it from a Client side?
Thanks.

What is difference between Apache flume and Apache storm?

What is difference between Apache flume and Apache storm?
Is is possible to ingest logs data into Hadoop cluster using storm?
Both are used for streaming data so can storm be used as an alternative to flume?
Apache Flume is a service for collecting large amounts of streaming data, particularly logs. Flume pushes data to consumers using mechanisms it calls data sinks. Flume can push data to many popular sinks right out of the box, including HDFS, HBase, Cassandra, and some relational databases.
Apache Storm involves streaming data. It is the bridge between batch processing and stream processing, which Hadoop is not natively designed to handle. Storm runs continuously, processing a stream of incoming data and dicing it into batches, so Hadoop can more easily ingest it. Data sources are called spouts and each processing node is a bolt. Bolts perform computations and processes on the data, including pushing output to data stores and other services.
If you need something that works out of the box, choose Flume, once you decide whether to push or pull makes more sense. If streaming data is, for now, just a small add-on to your already developed Hadoop environment, Storm is a good choice.
It is possible to ingest logs data into the Hadoop cluster using a storm
We can use the storm as an alternative to the flume

HDFS 2.8+: are local data nodes preferred for reading by clients?

I was wondering if there is the read-side equivalent to the configuration option dfs.namenode.block-placement-policy.default.prefer-local-node, or do HDFS clients try to read from data nodes that share the same host name/ip address?
Example: is it a good idea to deploy a SolrCloud cluster along with HDFS data nodes? Will the solr daemons prefer the local data node daemons for reading?

HDF NIfi - Does Nifi writes provenance/data on HDP Node?

​Hi
I have HDF cluster with 3 Nifi instance which lunches jobs(Hive/Spark) on HDP cluster. Usually nifi writes all information to different repositories available on local machine.
My question is - Does nifi writes any data,provenance information or does spilling on HDP nodes (ex. data nodes in HDP cluster) while accessing HDFS,Hive or Spark services ?
Thanks
Apache NiFi does not use HDFS for any of its internal repositories/data. The only interaction between NiFi and Hadoop services would be through specific processors made to interact with these services, such as PutHDFS, PutHiveQL, etc.
Provenance data can be pushed out of NiFi using the SiteToSiteProvenanceReportingTask and then stored in whatever location is appropriate (HDFS, HBase, etc).

What is the best practice for nifi production deployment

I have a three node nifi cluster. We just installed nifi packages on linux machines and cluster with separate zookeeper cluster. I am planning to monitor nifi performance via nagios but we saw hortonworks ambari provides fetures for management and monitoring also.
What is the best practice for nifi deployment on prod
how should we scale up
how can we monitor nifi
Should we monitor queue/process performance
Should use something like ambari
regards..
Edit-1:
#James actually I am collecting user event logs from several sources within company.
All events are first written to Kafka. Nifi consumes kafka, does simple transformations like getting a field from payload to attribute.
After transformations data is written to both elasticsearch and hdfs. Before writing to hdfs we are merging flowfiles so writing to hdfs is in batches.
I have around 50k/s event.

Resources