RDD to HDFS - authentication error - RetryInvocationHandler - hadoop

I have an RDD that I wish to write to HDFS.
data.saveAsTextFile("hdfs://path/vertices")
This returns:
WARN RetryInvocationHandler: Exception while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over null. Not retrying because try once and fail.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]
I have checked KERBEROS and it is properly authenticated.
How do I solve this?

Well,
You need to check your path /etc/security/keytabs and check if your spark keytab is there.
This path is the recommended to the Kerberos configuration. Maybe it can be in other path.
But the most important, this keytab should be in all workers machines in the same path.
Other thing that you can check is the configuration file of Spark that should be installed in:
SPARK_HOME/conf
This folder should have the spark conf file spark-defaults.conf this conf file need to have this stuffs:
spark.history.kerberos.enabled true
spark.history.kerberos.keytab /etc/security/keytabs/spark.keytab
spark.history.kerberos.principal user#DOMAIN.LOCAL

The issue was actually related to how you reference a file in HDFS when using kerberos.
Rather than hdfs://<HOST>:<HTTP_PORT>
It is webhdfs://<HOST>:<HTTP_PORT>

Related

Using parquet-tools with Kerberos CDH

I am trying to discover a schema from a parquet file.
I tried to use the code:
parquet-tools schema hdfs://<MY_IP>:8020//<PATH_TO_PARQUER>/<PARQUET_FILE_NAME>.parquet
But I got the error:
SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]
Does anyone knows how to use parquet-tools in a Kerberized environment. I have the keytab with the permissions, and I run before the knit command.
The configuration for hadoop.security.authentication can take the values SIMPLE or KERBEROS.
From the error you get, It's clear that it is set to KERBEROS.
Make sure you run it after kinit.
If it does not work, you have to check your core-site.xml and hadoop-policy.xml files for proper configuration.

Write to HDFS/Hive using NiFi

I'm using Nifi 1.6.0.
I'm trying to write to HDFS and to Hive (cloudera) with nifi.
On "PutHDFS" I'm configure the "Hadoop Confiugration Resources" with hdfs-site.xml, core-site.xml files, set the directories and when I'm trying to Start it I got the following error:
"Failed to properly initialize processor, If still shcedule to run,
NIFI will attempt to initalize and run the Processor again after the
'Administrative Yield Duration' has elapsed. Failure is due to
java.lang.reflect.InvocationTargetException:
java.lang.reflect.InvicationTargetException"
On "PutHiveStreaming" I'm configure the "Hive Metastore URI" with
thrift://..., the database and the table name and on "Hadoop
Confiugration Resources" I'm put the Hive-site.xml location and when
I'm trying to Start it I got the following error:
"Hive streaming connect/write error, flow file will be penalized and routed to retry.
org.apache.nifi.util.hive.HiveWritter$ConnectFailure: Failed connectiong to EndPoint {metaStoreUri='thrift://myserver:9083', database='mydbname', table='mytablename', partitionVals=[]}:".
How can I solve the errors?
Thanks.
For #1, if you got your *-site.xml files from the cluster, it's possible that they are using internal IPs to refer to components like the DataNodes and you won't be able to reach them directly using that. Try setting dfs.client.use.datanode.hostname to true in your hdfs-site.xml on the client.
For #2, I'm not sure PutHiveStreaming will work against Cloudera, IIRC they use Hive 1.1.x and PutHiveStreaming is based on 1.2.x, so there may be some Thrift incompatibilities. If that doesn't seem to be the issue, make sure the client can connect to the metastore port (looks like 9083).

NiFi ListHDFS cannot find directory, FileNotFoundException

Have pipeline in NiFi of the form listHDFS->moveHDFS, attempting to run the pipeline we see the error log
13:29:21 HSTDEBUG01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Returning CLUSTER State: StandardStateMap[version=43, values={emitted.timestamp=1525468790000, listing.timestamp=1525468790000}]
13:29:21 HSTDEBUG01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Found new-style state stored, latesting timestamp emitted = 1525468790000, latest listed = 1525468790000
13:29:21 HSTDEBUG01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Fetching listing for /hdfs/path/to/dir
13:29:21 HSTERROR01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Failed to perform listing of HDFS due to File /hdfs/path/to/dir does not exist: java.io.FileNotFoundException: File /hdfs/path/to/dir does not exist
Changing the listHDFS path to /tmp seems to run ok, thus making me think that the problem is with my permissions on the directory I'm trying to list. However, changing the NiFi user to a user that can access that directory (eg. hadoop fs -ls /hdfs/path/to/dir) by setting the bootstrap.properties value run.as=myuser and restarting (see https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#bootstrap_properties) still seems to produce the same problem for the directory. The literal dir. string being used that is not working is:
"/etl/ucera_internal/datagov_example/raw-ingest-tracking/version-1/ingest"
Does anyone know what is happening here? Thanks.
** Note: The hadoop cluster I am accessing does not have kerberos enabled (it is a secured MapR hadoop cluster).
Update: It appears that the mapr hadoop implementation is different enough that it requires special steps in order for NiFi to properly work on it (see https://community.mapr.com/thread/10484 and http://hariology.com/integrating-mapr-fs-and-apache-nifi/). May not get a chance to work on this problem for some time to see if still works (as certain requirements have changed), so am dumping the link here for others who may have this problem in the meantime.
Could you once make sure you have entered correct path and directory needs to be exists in HDFS.
It seems to be list hdfs processor not able to find the directory that you have configured in directory property and logs are not showing any permission denied issues.
If logs shows permission denied then you can change the nifi running user in bootstrap.conf and
Once you make change in nifi properties then NiFi needs to restart to apply the changes (or) change the permissions on the directory that NiFi can have access.

Intellij Accessing file from hadoop cluster

As part of my intellij environment set up I need to connect to a remote hadoop cluster and access the files in my local spark code.
Is there any way to connect to hadoop remote environment without creating hadoop local instance?
A connection code snippet would be the ideal answer.
If you have a keytab file to authenticate to the cluster, this is one way I've done it:
val conf: Configuration: = new Configuration()
conf.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.setConfiguration(conf)
UserGroupInformation.loginUserFromKeytab("user-name", "path/to/keytab/on/local/machine")
FileSystem.get(conf)
I believe to do this, you might also need some configuration xml docs. Namely core-site.xml, hdfs-site.xml, and mapred-site.xml. These are somewhere usually under /etc/hadoop/conf/.
You would put those under a directory in your program and mark it as Resources directory in IntelliJ.

Kerberos defaulting to wrong principal when accessing hdfs from remote server

I've configured kerberos to access hdfs from a remote server and I am able to authenticate and generate a ticket but when I try to access hdfs I am getting an error:
09/02 15:50:02 WARN ipc.Client: Exception encountered while connecting to the server : java.lang.IllegalArgumentException: Server has invalid Kerberos principal: nn/hdp.stack.com#GLOBAL.STACK.COM
in our krb5.conf file, we defined the the admin_server and kdc under a different realm:
DEV.STACK.COM = {
admin_server = hdp.stack.com
kdc = hdp.stack.com
}
Why is it defaulting to a different realm that is also defined in our krb5 (GLOBAL.STACK.COM?). I have ensured that all our hadoop xml files are #DEV.STACK.COM
Any ideas? Any help much appreciated!
In your KRB5 conf, you should define explicitly which machine belongs to which realm, starting with generic rules then adding exceptions.
E.g.
[domain_realm]
stack.com = GLOBAL.STACK.COM
stack.dev = DEV.STACK.COM
vm123456.dc01.stack.com = DEV.STACK.COM
srv99999.dc99.stack.com = DEV.STACK.COM
We have a Cluster Spark that requires connect to hdfs on Cloudera and we have getting the same error.
8/08/07 15:04:45 WARN Client: Exception encountered while connecting to the server : java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hdfs/host1#REALM.DEV
java.io.IOException: Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hdfs/host1#REALM.DEV; Host Details : local host is: "workstation1.local/10.1.0.62"; destination host is: "host1":8020;
Based on above post of ohshazbot and other post on Cloudera site https://community.cloudera.com/t5/Cloudera-Manager-Installation/Kerberos-issue/td-p/29843 we modified the core-site.xml (in Spark Cluster ....spark/conf/core-site.xml)file adding the property and the connection is succesfull
<property>
<name>dfs.namenode.kerberos.principal.pattern</name>
<value>*</value>
</property>
I recently bumped into an issue with this using HDP2.2 on 2 seperate clusters and after hooking up a debugger to the client I found the issue, which may affect other flavors of hadoop.
There is a possible hdfs config dfs.namenode.kerberos.principal.pattern which dictates if a principal pattern is valid. If the principal doesn't match AND doesn't match the current clusters principal then you get the exception you saw. In hdp2.3 and higher, as well as cdh 5.4 and higher, it looks like this is set to a default of *, which means everything hits. But in HDP2.2 it's not in the defaults so this error occurs whenever you try to talk to the remote kerberized hdfs from your existing kerberized hdfs.
Simply adding this property with *, or any other pattern which matches both local and remote principal names, resolves the issue.
give ->
ls -lart
check for keytab file ex: .example.keytab
if keytab file is there
give -
klist -kt keytabfilename
ex: klist -kt .example.keytab
you'll get principal like example#EXAM-PLE.COM in output
then give
kinit -kt .example.keytab(keytabfile) example#EXAM-PLE.COM(principal)

Resources