distcp s3 instance profile temporary credentials - hadoop

I'm using distcp on my hadoop cluster in AWS. Now we are switching over to use IAM roles for the cluster nodes. A solution I was going to try was add in my own implementation of org.apache.hadoop.fs.s3native.NativeS3FileSystem that would be smarter like the AWS InstanceProfileCredentialsProvider and use the IMDS. However is there an available solution to make distcp work with the temporary security credentials? Looking at NativeS3FileSystem and the related classes, it looks like I will need to copy most of the code just to make the credentials lookup use IMDS.

Related

What's the right way to provide Hadoop/Spark IAM role based access for S3?

We have Hadoop cluster running on EC2 and EC2 instances attached to a role which has access to S3 bucket for example: "stackoverflow-example".
Several users are placing Spark jobs in the cluster, we used keys in the past but do not want to continue and want to migrate to role, so any jobs placed on the Hadoop cluster will use role associated with ec2 instances. Did a lot of search and found 10+ tickets, some of them are still open, some of them are fixed and some of them do not have any comments.
Want to know whether it's still possible to use IAM role for jobs(Spark, Hive, HDFS, Oozie, etc) placing on Hadoop cluster. Most of the tutorials are discussing passing key (fs.s3a.access.key, fs.s3a.secret.key) which is not good enough and not secured as well. We also faced issues with credential provider with Ambari.
Some references:
https://issues.apache.org/jira/browse/HADOOP-13277
https://issues.apache.org/jira/browse/HADOOP-9384
https://issues.apache.org/jira/browse/SPARK-16363
That first one you link to HADOOP-13277 says "can we have IAM?" to which the JIRA was closed "you have this in s3a". The second, HADOOP-9384, was "add IAM to S3n", closed as "switch to s3a". And SPARK-16363? incomplete bugrep.
If you use S3a, and do not set any secrets, then the s3a client will fall back to looking at the special EC2 instance metadata HTTP server, and try to get the secrets from there.
That it: it should just work.

Hadoop distcp to S3a with hidden key pair

How can I hide ACCESS_ID_KEY and SECRET_ACCESS_KEY for access to Amazon S3?
I know about adding it to core-site.xml, but maybe there is different solutions. Becouse with this approach every user from cluster will run distcp with same keys. Maybe there is some solution, like to store it in some property file for each cluster user?
Thanks.
Please see my HCC post on using the Hadoop Credential API for this use case.

How can I change the default bucket of a google-cloud-based hadoop-enable cluster after I created it?

After I create a google-cloud-based hadoop-enable cluster, I want to change the default bucket to a different one, how can I do that? I can't find the answer in google cloud doscumentation. Thanks!
Did you create a cluster by hand, using bdutil, using Cloud Dataproc or through some other means?
bdutil
If you used bdutil, see the choose a default file system section in the setup documentation.
Cloud Dataproc
If you used Cloud Dataproc, you can access any bucket to which your project has permission by using the gs:// uri. If you want to connect your cluster to a new bucket for logs, you will have to create a new cluster, unfortunately.
Other method
If you used a different method, like the "click to deploy" launcher, I recommend you give Dataproc or bdutil a try.

Hadoop distcp command using a different S3 destination

I am using a Eucalyptus private cloud on which I have set up an CDH5 HDFS. I would like to backup my HDFS to the Eucalyptus S3. The classic way to use distcp as suggested here: http://wiki.apache.org/hadoop/AmazonS3 , ie hadoop distp hdfs://namenode:9000/user/foo/data/fil1 s3://$AWS_ACCESS_KEY:$AWS_SECRET_KEY#bucket/key doesn't work.
It seems that hadoop is pre-configured with an S3 location on Amazon and I cannot find where is this configuration in order to change this to the IP address of my S3 service running on Eucalyptus. I would expect to be able to just change the uri of S3 in the same way you can change your NameNode uri when using an hdfs:// prefix. But is seems this is not possible... Any insights?
I have already found workarounds for transferring my data. In particular the s3cmd tools here: https://github.com/eucalyptus/eucalyptus/wiki/HowTo-use-s3cmd-with-Eucalyptus and the s3curl scripts here: aws.amazon.com/developertools/Amazon-S3/2880343845151917 work just fine but I would prefer if I could transfer my data using map-reduce with the distcp command.
It looks like hadoop is using the jets3t library for S3 access. You might be able to use the configuration described in this blog to access eucalyptus, but note that for version 4 onwards the path is "/services/objectstorage" rather than "/services/Walrus".

Configure Hadoop to use S3 requester-pays-enabled

I'm using Hadoop (via Spark), and need to access S3N content which is requester-pays. Normally, this is done by enabling httpclient.requester-pays-buckets-enabled = true in jets3t.properties. Yet, I've set this and Spark / Hadoop are ignoring it. Perhaps I'm putting the jets3t.properties in the wrong place (/usr/share/spark/conf/). How can I get Hadoop / Spark / JetS3t to access requestor-pays buckets?
UPDATE: This is needed if you are outside Amazon EC2. Within EC2, Amazon doesn't require requester-pays. So, a crude workaround is to run out of EC2.
The Spark system is made up of several JVMs (application, master, workers, executors), so setting properties can be tricky. You could use System.getProperty() before the file operation to check if the JVM where the code runs has loaded the right config. You could even use System.setProperty() to directly set it at that point instead of figuring out the config files.
Environment variables and config files didn't work, but some manual code did: sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "PUTTHEKEYHERE")

Resources