Programmatically configuring S3 options in Flink - hadoop

Apparently Flink 1.14.0 doesn't correctly translate S3 options when they are set programmatically. I'm creating a local environment like this to connect to local MinIO instance:
val flinkConf = new Configuration()
flinkConf.setString("s3.endpoint", "http://127.0.0.1:9000")
flinkConf.setString("s3.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
val env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(flinkConf)
Then StreamingFileSink fails with a huge stack trace with most relevant messages being Caused by: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider InstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: Failed to connect to service endpoint: which means that Hadoop tried to enumerate all of the credential providers instead of using the one set in configuration. What am I doing wrong?

I've spent ages trying to figure out this one too. I could not find a way to set it programmatically, but adding the following to src/main/resources/core-site.xml in my Flink java project root worked in the end:
<?xml version="1.0"?>
<configuration>
<property>
<name>fs.s3a.aws.credentials.provider</name>
<value>com.amazonaws.auth.profile.ProfileCredentialsProvider</value>
</property>
</configuration>
Then I could use AWS_PROFILE env var to select stored credentials.
This was for Flink with flink-s3-fs-hadoop 1.13.2

Related

Write to HDFS/Hive using NiFi

I'm using Nifi 1.6.0.
I'm trying to write to HDFS and to Hive (cloudera) with nifi.
On "PutHDFS" I'm configure the "Hadoop Confiugration Resources" with hdfs-site.xml, core-site.xml files, set the directories and when I'm trying to Start it I got the following error:
"Failed to properly initialize processor, If still shcedule to run,
NIFI will attempt to initalize and run the Processor again after the
'Administrative Yield Duration' has elapsed. Failure is due to
java.lang.reflect.InvocationTargetException:
java.lang.reflect.InvicationTargetException"
On "PutHiveStreaming" I'm configure the "Hive Metastore URI" with
thrift://..., the database and the table name and on "Hadoop
Confiugration Resources" I'm put the Hive-site.xml location and when
I'm trying to Start it I got the following error:
"Hive streaming connect/write error, flow file will be penalized and routed to retry.
org.apache.nifi.util.hive.HiveWritter$ConnectFailure: Failed connectiong to EndPoint {metaStoreUri='thrift://myserver:9083', database='mydbname', table='mytablename', partitionVals=[]}:".
How can I solve the errors?
Thanks.
For #1, if you got your *-site.xml files from the cluster, it's possible that they are using internal IPs to refer to components like the DataNodes and you won't be able to reach them directly using that. Try setting dfs.client.use.datanode.hostname to true in your hdfs-site.xml on the client.
For #2, I'm not sure PutHiveStreaming will work against Cloudera, IIRC they use Hive 1.1.x and PutHiveStreaming is based on 1.2.x, so there may be some Thrift incompatibilities. If that doesn't seem to be the issue, make sure the client can connect to the metastore port (looks like 9083).

Intellij Accessing file from hadoop cluster

As part of my intellij environment set up I need to connect to a remote hadoop cluster and access the files in my local spark code.
Is there any way to connect to hadoop remote environment without creating hadoop local instance?
A connection code snippet would be the ideal answer.
If you have a keytab file to authenticate to the cluster, this is one way I've done it:
val conf: Configuration: = new Configuration()
conf.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.setConfiguration(conf)
UserGroupInformation.loginUserFromKeytab("user-name", "path/to/keytab/on/local/machine")
FileSystem.get(conf)
I believe to do this, you might also need some configuration xml docs. Namely core-site.xml, hdfs-site.xml, and mapred-site.xml. These are somewhere usually under /etc/hadoop/conf/.
You would put those under a directory in your program and mark it as Resources directory in IntelliJ.

Can't get Master Kerberos principal for use as renewer for Talend Batch Jobs

we are trying to use talend batch (spark) jobs to access hive in a Kerberos cluster but we are getting the below "Can't get Master Kerberos principal for use as renewer" error.
By using the standard jobs(non spark) in talend we are able to access hive without any issue.
Below are the observation:
When we are running spark jobs talend could able to connect to hive
metastore and validating the syntax. ex if I provide the wrong table
name it does return "table not found".
when we select count(*) from table where there is no data it returns
"NULL" but if some data present in Hdfs(table) It failed with the error
"Can't get Master Kerberos principal for use as renewer".
I am not sure exactly what is the issue which is causing the token problem. could some one help us know the root cause.
One more thing to add instead of hive if I read / write to hdfs using spark batch jobs it works , So only problem is with hive and Kerberos.
You should include the hadoop config in the classpath (:/path/hadoop-configuration). You should include all configuration files in that hadoop configuration directory, not only the core-site.xml and hdfs-site.xml files. It happened to me and that solved the problem.
the same problem when I start spark on k8s,
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: Can't get Master Kerberos principal for use as renewer
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:133)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:243)
at org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:52)
at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:54)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)
and I just add yarn-site.xml to the HADOOP_CONFIG_DIR.
the yarn-site.xml only contains yarn.resourcemanager.principal
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>yarn.resourcemanager.principal</name>
<value>yarn/_HOST#DM.COM</value>
</property>
</configuration>
this working for me.

Hadoop can list s3 contents but spark-shell throws ClassNotFoundException

My saga continues -
In short I'm trying to create a teststack for spark - aim being to read a file from an s3 bucket and then write it to another. Windows env.
I was repeatedly encountering errors when trying to access S3 or S3n as a ClassNotFoundException was being thrown. These classes were added to the core-site.xml as the s3 and s3n.impl
I added the hadoop/share/tools/lib to the classpath to no avail, I then added the aws-java-jdk and hadoop-aws jars to the share/hadoop/common folder and I am now able to list the contents of a bucket using haddop on the command line.
hadoop fs -ls "s3n://bucket" shows me the contents, this is great news :)
In my mind the hadoop configuration should be picked up by spark so solving one should solve the other however when I run spark-shell and try to save a file to s3 I get the usual ClassNotFoundException as shown below.
I'm still quite new to this and unsure if I've missed something obvious, hopefully someone can help me solve the riddle? Any help is greatly appreciated, thanks.
The exception:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
my core-site.xml(which I believe to be correct now as hadoop can access s3):
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>
<property>
<name>fs.s3n.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
<description>The FileSystem for s3n: (Native S3) uris.</description>
</property>
and finally the hadoop-env.cmd showing the classpath(which is seemingly ignored):
set HADOOP_CONF_DIR=C:\Spark\hadoop\etc\hadoop
#rem ##added as s3 filesystem not found.http://stackoverflow.com/questions/28029134/how-can-i-access-s3-s3n-from-a-local-hadoop-2-6-installation
set HADOOP_USER_CLASSPATH_FIRST=true
set HADOOP_CLASSPATH=%HADOOP_CLASSPATH%:%HADOOP_HOME%\share\hadoop\tools\lib\*
#rem Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.
if exist %HADOOP_HOME%\contrib\capacity-scheduler (
if not defined HADOOP_CLASSPATH (
set HADOOP_CLASSPATH=%HADOOP_HOME%\contrib\capacity-scheduler\*.jar
) else (
set HADOOP_CLASSPATH=%HADOOP_CLASSPATH%;%HADOOP_HOME%\contrib\capacity-scheduler\*.jar
)
)
EDIT: spark-defaults.conf
spark.driver.extraClassPath=C:\Spark\hadoop\share\hadoop\common\lib\hadoop-aws-2.7.1.jar:C:\Spark\hadoop\share\hadoop\common\lib\aws-java-sdk-1.7.4.jar
spark.executor.extraClassPath=C:\Spark\hadoop\share\hadoop\common\lib\hadoop-aws-2.7.1.jar:C:\Spark\hadoop\share\hadoop\common\lib\aws-java-sdk-1.7.4.jar
You need to pass some parameters to your spark-shell. Try this flag --packages org.apache.hadoop:hadoop-aws:2.7.2 .

Cloudera Hadoop access with Kerberos gives TokenCache error : Can't get Master Kerberos principal for use as renewer

I am trying to access Cloudera Hadoop setup (HIVE + Impala) from Mac Book Pro OS X 10.8.4.
We have Cloudera CDH-4.3.0 installed on Linux servers. I have extracted CDH-4.2.0 tarball to my Mac Book Pro.
I have set proper configuration and Kerberos credentials so that commands like 'hadoop -fs -ls /' works and HIVE shell starts up.
However when I do 'show databases' command it gives following error:
> hive
> show databases;
>
Failed with exception java.io.IOException:java.io.IOException: Can't get Master Kerberos principal for use as renewer
The error is related to TokenCache.
When I searched for error, it seems following method 'obtainTokensForNamenodesInternal' throws this error when it tries to get a delegation token for specific FS and fails.
http://hadoop.apache.org/docs/current/api/src-html/org/apache/hadoop/mapreduce/security/TokenCache.html
On client side I don't see any error in HIVE shell logs. I have also tried using tarballs of CDH 4.3.0 with same configuration I get the same error.
Any help or pointers for resolving this error would be highly appreciated.
It seems that you have not config the kerberos for yarn.
Add the follow configure in your yarn-site.cml
<property>
<name>yarn.nodemanager.principal</name>
<value>yarn_priciple/fqdn#_HOST</value>
</property>
<property>
<name>yarn.resourcemanager.principal</name>
<value>yarn_priciple/fqdn#_HOST</value>
</property>
Create a new Gateway YARN role instance in the host from Cloudera Manager. It will automatically setup and update the yarn-site.xml.

Resources