Nifi error Wwth putsolrcontentstream processor - hadoop

I am creating a flow of tweets using nifi and analyze them in solr but tweets coming into nifi but nothing happeing into solr.But error in nifi processor putsolrcontentstream could not connect to localhost:2181/solr cluster not found/not ready.
Putsolrcontentstream processor error:

Are you running in clustered mode?
I just set up a local (Standard mode) Solr core and in the Solr Location property I used http://localhost:8983/solr/myDemoCore. Might you be forgetting to mention the core's name?
If you haven't created a core:
cd path/to/solr/bin/
./solr create -c myDemoCore
./solr restart
Then use http://localhost:8983/solr/myDemoCore in the Solr Location property and try again.
Edit: I see that you're using Windows-- just change your path notation accordingly.

Related

How can I use NiFi to read/write directly from ADLS without HDInsight

We would like to use NiFi to connect with ADLS (using PutHDFS and FetchHDFS) without having to install HDInsight. Subsequently we want to use Azure DataBricks to run Spark jobs, and hoping that it can be done using NiFi's ExecuteSparkInteractive processor. From all the examples I could find, invariably HDP or HDInsight seem to be required.
Can anyone share the pointers how it can be done without needing HDP or HDInsight?
Thanks in advance.
As far as I can tell, ADLS won't work well (or work at all) with *HDFS processors available in Apache NiFi. There was a feature request made - NIFI-4360 and a subsequent PR raised for the same - #2158 but it was briefly reviewed but now not much progress is there. You can fork that or copy pasta that code-base and hopefully review it.
I did a test-setup more than a year ago. The PutHDFS processor worked with some additional classpath resources. The following dependencies have been required:
adls2-oauth2-token-provider-1.0.jar
azure-data-lake-store-sdk-2.0.4-SNAPSHOT.jar
hadoop-azure-datalake-2.0.0-SNAPSHOT.jar
jackson-core-2.2.3.jar
okhttp-2.4.0.jar
okio-1.4.0.jar
See also the following Blog for more details. You can copy the libs, the core-site.xml and hdfs-site.xml from an HDInsight setup to the machine where NiFi is running. You also should set the dfs.adls.home.mountpoint properly, directing to root or a data directory. Be aware that this is not officially supported, so phps. you should also consider Azure Data Factory or StreamSets as an option for Data Ingest.
PutHDFS does not expect a classic hadoop cluster in the first place. It expects core-site.xml only for conventional reasons. As you will see in the below example a minimalist config file to have PutHDFS work with ADLS.
Using NiFi PutHDFS processor to ingress into ADLS is simple. Below steps will lead to the solution
Have ADLS Gen1 set up(ADLS has been renamed as ADLS Gen1)
Additionally have OAUTH authentication set up for your ADLS account. See here
Create an empty core-site.xml for configuring PuHDFS processor
Update core-site.xml with the following properties(I am using Client keys mode for auth in this example)
fs.defaultFS = adl://<yourADLname>.azuredatalakestore.net
fs.adl.oauth2.access.token.provider.type = ClientCredential
fs.adl.oauth2.refresh.url = <Your Azure refresh endpoint>
fs.adl.oauth2.client.id = <Your application id>
fs.adl.oauth2.credential = <Your key>
Update your NiFi PutHDFS processor to refer to the core-site.xml and additional ADLS libraries(hadoop-azure-datalake-3.1.1.jar and azure-data-lake-store-sdk-2.3.1.jar) created in previous step as shown below.
Update the upstream processors and test.

Right Keytab for Nifi Processor

I have Nifi 3 node cluster (Installed Via Hortonworks Data Flow - HDF ) in Kerborized environment. As part of installation Ambari has created nifi service keytab .
Hi
Can I use this nifi.service.keytab for configuring processors like PutHDFS who talks to Hadoop services ?
The nifi.service.keytab is machine specific and always expect principal names with machine information. ex nifi/HOSTNAME#REALM
If I configure my Processor with nfii/NODE1_Hostname#REALM information then I see kerberos authentication exception in other two nodes.
How do I dynamically resolve hostname to use nifi service keytab ?
The keytab principal name field supports Apache NiFi Expression Language, so you can use an expression like the following: nifi/${hostname()}#REALM, and each node will resolve that expression (independently) to something like nifi/host1.nifi.com#REALM or nifi/host2.nifi.com#REALM, etc.
If you do not want it to be the explicit hostname, you can also set an environment variable on each node (export NIFI_KEYTAB_HOSTNAME="modified_host_format_1", etc.) and reference the environment variable in the EL expression the same way: nifi/${NIFI_KEYTAB_HOSTNAME}#REALM.

Does Embedded flume agent need hadoop to function on cluster?

I am trying to write embedded flume agent in my web service to transfer my logs to another hadoop cluster where my flume agent is running. To work with Embedded flume agent, do we need hadoop to be run in server where my web service is running.
TLDR: I think, no.
Longer version: I haven't checked, but in the developer guide (https://flume.apache.org/FlumeDeveloperGuide.html#embedded-agent) it says
Note: The embedded agent has a dependency on hadoop-core.jar.
(https://flume.apache.org/FlumeDeveloperGuide.html#embedded-agent)
And in the User Guide (https://flume.apache.org/FlumeUserGuide.html#hdfs-sink), you can specify the HDFS path:
HDFS directory path (eg hdfs://namenode/flume/webdata/)
On the other hand, are you sure you want to work with the embedded agent instead of running Flume where you want to put the data and use HTTP Source for example? (https://flume.apache.org/FlumeUserGuide.html#http-source) (...or any other source you can send data to)

Accessing Hadoop data using REST service

I am trying to update HDP architecture so data residing in Hive tables can be accessed by REST APIs. What are the best approaches how to expose data from HDP to other services?
This is my initial idea:
I am storing data in Hive tables and I want to expose some of the information through REST API therefore I thought that using HCatalog/WebHCat would be the best solution. However, I found out that it allows only to query metadata.
What are the options that I have here?
Thank you
You can very well use WebHDFS which is basically a REST Service over Hadoop.
Please see documentation below:
https://hadoop.apache.org/docs/r1.0.4/webhdfs.html
The REST API gateway for the Apache Hadoop Ecosystem is called KNOX
I would check it before explore any other options. In other words, Do you have any reason to avoid using KNOX?
What version of HDP are you running?
The Knox component has been available for quite a while and manageable via Ambari.
Can you get an instance of HiveServer2 running in HTTP mode?
This would give you SQL access through J/ODBC drivers without requiring Hadoop config and binaries (other than those required for the drivers) on the client machines.

Resource manager configuration in QJM HA setup of Hadoop 2.6.0

We have configured Hadoop as high availability so that we can achieve automatic failover using Quorum Journal Manager. It is working fine as expected.
But we are not sure how to configure resource manager in 2.6.0 version.
Resource manager is needed for running map reduce programs. We need the configuration steps for resource manager failover setup in Hadoop 2.6.0 between the name nodes.
I don't know about MapReduce. If you have multiple Resource Manager (one Active at a time) you need to set its logical names
http://gethue.com/hadoop-tutorial-yarn-resource-manager-high-availability-ha-in-mr2/
However I am not sure if the logical names for rm1,rm2(...) will be the same. If anybody can confirm this?
It can be achieved in oozie by using "logicaljt" as job-tracker value in your workflow
Source: https://issues.cloudera.org/browse/HUE-1631

Resources