How to pass Hadoop credentials provider path to Map Reduce job? - hadoop

I am storing my S3 credentials with Hadoop Security credentials provider.
When I run distcp I can pass the path to credentials provider like this:
-Dfs.s3a.security.credential.provider.path=jceks://hdfs/foo/bar/s3.jceks
I'd like to do the same for a Map Reduce job, so it can write directly to S3.
I've tried setting this property
configuration.set("fs.s3a.security.credential.provider.path", "path_to_provider");
in my Map Reduce job, but that did not work.
And yes, I am aware that I can pass them in clear text from the terminal or store them in clear text in core-site.xml, and the reason why I ask this question is exactly because I don't want to do that.
Thanks in advice.

Related

How does impersonation in hadoop work

I am trying to understand how impersonation works in hadoop environment.
I found a few resources like:
About doAs and proxy users- hadoop-kerberos-guide
and about tokens- delegation-tokens.
But I was not able to connect all the dots wrt the full flow of operations.
My current understanding is :
user does a kinit and executes a end user facing program like
beeline, spark-submit etc.
The program is app specific and gets service tickets for HDFS
It then gets tokens for all the services it may need during the job
exeution and saves the tokens in an HDFS directory.
The program then connects a job executer(using a service ticket for
the job executer??) e.g. yarn with the job info and the token path.
The job executor get the tocken and initializes UGI and all
communication with HDFS is done using the token and kerberos ticket
are not used.
Is the above high level understanding correct? (I have more follow up queries.)
Can the token mecahnism be skipped and use only kerberos at each
layer, if so, any resources will help.
My final aim is to write a spark connector with impersonation support
for a data storage system which does not use hadoop(tokens) but
supports kerberos.
Thanks & regards
-Sri

kerberos ticket and delegation token usage in Hadoop Oozie shell action

I am new to hadoop and trying to understand why my oozie shell action is not taking the new ticket even after doing kinit. here is my scenario.
I login using my ID "A", and have a kerberos ticket for my id. I submit oozie worklow with shell action using my ID.
Inside oozie shell action I do another kinit to obtain the ticket for ID "B".
Only this id "B" has access to some HDFS file. kinit is working fine since klist showed the ticket for id "B". Now when I read the HDFS file that only B has access to, I get permission denied error saying "A" does not have permission to access this file.
But when I do the same thing from linux cli, outside oozie, after I do kinit and take ticket for "B", I am able to read the HDFS file as "B".
But the same step is not working inside oozie shell action and hadoop fs commands always seem to work as the user that submitted the oozie workflow rather than the user for which kerberos ticket is present.
Can someone please explain why this is happening? I am not able to understand this.
In the same shell action, though hadoop fs command failed to change to user "B", hbase shell works as user B. Just for testing, I created a hbase table that only "A" has access to. I added the hbase shell to perform get command on this table. If I do kinit -kt for user "B" and get its ticket, this failed too, saying "B" does not have access to this table. So I think hbase is taking the new ticket instead of the delegation token of the user submitting the oozie workflow. When I dont do kinit -kt inside the shell action, hbase command succeeds.
If I do kinit, I could not even run hive queries saying "A" does not have execute access to some directories like /tmp/B/ that only "B" has access to, so I could not understand how hive is working, if it is taking the delegation token that is created when oozie workflow is submitted or if it is taking the new ticket created for new user.
Can someone please help me understand the above scenario? Which hadoop services takes new ticket for authentication and which commands take the delegation token (like hadoop fs commands)? Is this how it would work or am I doing something wrong?
I just dont understand why the same hadoop fs command worked from outside oozie as different user but not working inside oozie shell action even after kinit.
When is this delegation token actually get created? Does it get created only when oozie worklow is submitted or even I issue hadoop fs commands?
Thank you!
In theory -- Oozie automatically transfers the credentials of the submitter (i.e. A) to the YARN containers running the job. You don't have to care about kinit, because, in fact, it's too late for that.
You are not supposed to impersonate another user inside of an Oozie job, that would defeat the purpose of strict Kerberos authentication
In practice it's more tricky -- (1) the core Hadoop services (HDFS, YARN) check the Kerberos token just once, then they create a "delegation token" that is shared among all nodes and all services.
(2) the oozie service user has special privileges, it can do a kind of Hadoop "sudo" so that it connects to YARN as oozie but YARN creates the "delegation token" for the job submitter (i.e. A) and that's it, you can't alter change that token.
(3) well, actually you can use an alternate token, but only with some custom Java code that explicitly creates a UserGroupInformation object for an alternate user. The Hadoop command-line interfaces don't do that.
(4) what about non-core Hadoop, i.e. HBase or Hive Metastore, or non-Hadoop stuff, i.e. Zookeeper? They don't use "delegation tokens" at all. Either you manage explicitly the UserGroupInformation in Java code, or the default Kerberos token is used at connect time.
That's why your HBase shell worked, and if you had used Beeline (the JDBC thin client) instead of Hive (the legacy fat client) it would probably have worked too.
(5) Oozie tries to fill that gap with specific <credentials> options for Hive, Beeline ("Hive2" action), HBase, etc; I'm not sure how that works but it has to imply a non-default Kerberos ticket cache, local to your job containers.
We've found it possible to become another kerb principal once an oozie workflow is launched. We have to run a shell action that then runs java with a custom -Djava.security.auth.login.config=custom_jaas.conf that will then provide a jvm kinit'ed as someone else. This is along the lines of Samson's (3), although this kinit can be even a completely different realm.

Hadoop distcp to S3a with hidden key pair

How can I hide ACCESS_ID_KEY and SECRET_ACCESS_KEY for access to Amazon S3?
I know about adding it to core-site.xml, but maybe there is different solutions. Becouse with this approach every user from cluster will run distcp with same keys. Maybe there is some solution, like to store it in some property file for each cluster user?
Thanks.
Please see my HCC post on using the Hadoop Credential API for this use case.

distcp s3 instance profile temporary credentials

I'm using distcp on my hadoop cluster in AWS. Now we are switching over to use IAM roles for the cluster nodes. A solution I was going to try was add in my own implementation of org.apache.hadoop.fs.s3native.NativeS3FileSystem that would be smarter like the AWS InstanceProfileCredentialsProvider and use the IMDS. However is there an available solution to make distcp work with the temporary security credentials? Looking at NativeS3FileSystem and the related classes, it looks like I will need to copy most of the code just to make the credentials lookup use IMDS.

Configure Hadoop to use S3 requester-pays-enabled

I'm using Hadoop (via Spark), and need to access S3N content which is requester-pays. Normally, this is done by enabling httpclient.requester-pays-buckets-enabled = true in jets3t.properties. Yet, I've set this and Spark / Hadoop are ignoring it. Perhaps I'm putting the jets3t.properties in the wrong place (/usr/share/spark/conf/). How can I get Hadoop / Spark / JetS3t to access requestor-pays buckets?
UPDATE: This is needed if you are outside Amazon EC2. Within EC2, Amazon doesn't require requester-pays. So, a crude workaround is to run out of EC2.
The Spark system is made up of several JVMs (application, master, workers, executors), so setting properties can be tricky. You could use System.getProperty() before the file operation to check if the JVM where the code runs has loaded the right config. You could even use System.setProperty() to directly set it at that point instead of figuring out the config files.
Environment variables and config files didn't work, but some manual code did: sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "PUTTHEKEYHERE")

Resources