Right Keytab for Nifi Processor - apache-nifi

I have Nifi 3 node cluster (Installed Via Hortonworks Data Flow - HDF ) in Kerborized environment. As part of installation Ambari has created nifi service keytab .
Hi
Can I use this nifi.service.keytab for configuring processors like PutHDFS who talks to Hadoop services ?
The nifi.service.keytab is machine specific and always expect principal names with machine information. ex nifi/HOSTNAME#REALM
If I configure my Processor with nfii/NODE1_Hostname#REALM information then I see kerberos authentication exception in other two nodes.
How do I dynamically resolve hostname to use nifi service keytab ?

The keytab principal name field supports Apache NiFi Expression Language, so you can use an expression like the following: nifi/${hostname()}#REALM, and each node will resolve that expression (independently) to something like nifi/host1.nifi.com#REALM or nifi/host2.nifi.com#REALM, etc.
If you do not want it to be the explicit hostname, you can also set an environment variable on each node (export NIFI_KEYTAB_HOSTNAME="modified_host_format_1", etc.) and reference the environment variable in the EL expression the same way: nifi/${NIFI_KEYTAB_HOSTNAME}#REALM.

Related

How can I use NiFi to read/write directly from ADLS without HDInsight

We would like to use NiFi to connect with ADLS (using PutHDFS and FetchHDFS) without having to install HDInsight. Subsequently we want to use Azure DataBricks to run Spark jobs, and hoping that it can be done using NiFi's ExecuteSparkInteractive processor. From all the examples I could find, invariably HDP or HDInsight seem to be required.
Can anyone share the pointers how it can be done without needing HDP or HDInsight?
Thanks in advance.
As far as I can tell, ADLS won't work well (or work at all) with *HDFS processors available in Apache NiFi. There was a feature request made - NIFI-4360 and a subsequent PR raised for the same - #2158 but it was briefly reviewed but now not much progress is there. You can fork that or copy pasta that code-base and hopefully review it.
I did a test-setup more than a year ago. The PutHDFS processor worked with some additional classpath resources. The following dependencies have been required:
adls2-oauth2-token-provider-1.0.jar
azure-data-lake-store-sdk-2.0.4-SNAPSHOT.jar
hadoop-azure-datalake-2.0.0-SNAPSHOT.jar
jackson-core-2.2.3.jar
okhttp-2.4.0.jar
okio-1.4.0.jar
See also the following Blog for more details. You can copy the libs, the core-site.xml and hdfs-site.xml from an HDInsight setup to the machine where NiFi is running. You also should set the dfs.adls.home.mountpoint properly, directing to root or a data directory. Be aware that this is not officially supported, so phps. you should also consider Azure Data Factory or StreamSets as an option for Data Ingest.
PutHDFS does not expect a classic hadoop cluster in the first place. It expects core-site.xml only for conventional reasons. As you will see in the below example a minimalist config file to have PutHDFS work with ADLS.
Using NiFi PutHDFS processor to ingress into ADLS is simple. Below steps will lead to the solution
Have ADLS Gen1 set up(ADLS has been renamed as ADLS Gen1)
Additionally have OAUTH authentication set up for your ADLS account. See here
Create an empty core-site.xml for configuring PuHDFS processor
Update core-site.xml with the following properties(I am using Client keys mode for auth in this example)
fs.defaultFS = adl://<yourADLname>.azuredatalakestore.net
fs.adl.oauth2.access.token.provider.type = ClientCredential
fs.adl.oauth2.refresh.url = <Your Azure refresh endpoint>
fs.adl.oauth2.client.id = <Your application id>
fs.adl.oauth2.credential = <Your key>
Update your NiFi PutHDFS processor to refer to the core-site.xml and additional ADLS libraries(hadoop-azure-datalake-3.1.1.jar and azure-data-lake-store-sdk-2.3.1.jar) created in previous step as shown below.
Update the upstream processors and test.

Confluent HDFS Sink Connector: How to configure custom hadoop user and group?

We are currently using Confluent HDFS Sink Connector platform within docker container to write data from Kafka(separate Kafka cluster) to HDFS(separate Hadoop cluster). By default the connector platform writes data to HDFS with root user and wheel group.
How can i configure connector to use a specific hadoop user/group ? Is there an environment variable I need to set in docker ?
Thanks.
The Java process in the Docker container runs as root.
You need to either make your own container with your own user account or run the Connect Workers as a different Unix account in some other way.
You could try setting HADOOP_IDENT_USER or HADOOP_USER_NAME environment variables, but I think these are only pulled by the Hadoop scripts, not the Java API
Keep in mind that user accounts in Hadoop don't really matter if you're not using a Kerberized cluster

Spark Job Submission with AWS Hadoop cluster setup

I have a hadoop cluster setup in AWS EC2, but my development setup(spark) is in local windows system. When I am trying to connect AWS Hive thrift server I able to connect , but it is showing some connection refused error when trying to submit a job from my local spark configuration. Please note in windows my user name is different that the user name for which Hadoop eco system is running in AWS server. Can any one explain me how the underlying system works in this setup?
1) When I am submitting a job from my local Spark to HIVE thrift , if it is associated any MR job , ASW Hive setup will submit that job NN with its own identity or it will carry forward my spark setup identity.
2) In my configuration do I need to run spark in local with same user name as I have for hadoop cluster in AWS ?
3) Do I need to configure SSL also to authenticate my local system?
Please note , my local system is not part of hadoop cluster and I can not include also in AWS Hadoop cluster.
Please let me know what will be actual setup for environment where my hadoop cluster is in AWS and spark is running on my local.
To simplify the problem, you are free to compile your code locally, produce an uber/shaded JAR, SCP to any spark-client in AWS, then run spark-submit --master yarn --class <classname> <jar-file>.
However, if you want to just Spark against EC2 locally, then you can set a few properties programmatically.
Spark submit YARN mode HADOOP_CONF_DIR contents
Alternatively, as mentioned in that post, the best way would be getting your cluster's XML files from HADOOP_CONF_DIR, and copying them over into your application's classpath. This is typically src/main/resources for a Java/Scala application.
Not sure about Python, R, or the SSL configs.
And yes, you need to add a remote user account for your local Windows username on all the nodes. This is how user impersonation will be handled by Spark executors.

Nifi error Wwth putsolrcontentstream processor

I am creating a flow of tweets using nifi and analyze them in solr but tweets coming into nifi but nothing happeing into solr.But error in nifi processor putsolrcontentstream could not connect to localhost:2181/solr cluster not found/not ready.
Putsolrcontentstream processor error:
Are you running in clustered mode?
I just set up a local (Standard mode) Solr core and in the Solr Location property I used http://localhost:8983/solr/myDemoCore. Might you be forgetting to mention the core's name?
If you haven't created a core:
cd path/to/solr/bin/
./solr create -c myDemoCore
./solr restart
Then use http://localhost:8983/solr/myDemoCore in the Solr Location property and try again.
Edit: I see that you're using Windows-- just change your path notation accordingly.

Resource manager configuration in QJM HA setup of Hadoop 2.6.0

We have configured Hadoop as high availability so that we can achieve automatic failover using Quorum Journal Manager. It is working fine as expected.
But we are not sure how to configure resource manager in 2.6.0 version.
Resource manager is needed for running map reduce programs. We need the configuration steps for resource manager failover setup in Hadoop 2.6.0 between the name nodes.
I don't know about MapReduce. If you have multiple Resource Manager (one Active at a time) you need to set its logical names
http://gethue.com/hadoop-tutorial-yarn-resource-manager-high-availability-ha-in-mr2/
However I am not sure if the logical names for rm1,rm2(...) will be the same. If anybody can confirm this?
It can be achieved in oozie by using "logicaljt" as job-tracker value in your workflow
Source: https://issues.cloudera.org/browse/HUE-1631

Resources