Hadoop distcp to S3a with hidden key pair - hadoop

How can I hide ACCESS_ID_KEY and SECRET_ACCESS_KEY for access to Amazon S3?
I know about adding it to core-site.xml, but maybe there is different solutions. Becouse with this approach every user from cluster will run distcp with same keys. Maybe there is some solution, like to store it in some property file for each cluster user?
Thanks.

Please see my HCC post on using the Hadoop Credential API for this use case.

Related

How to pass Hadoop credentials provider path to Map Reduce job?

I am storing my S3 credentials with Hadoop Security credentials provider.
When I run distcp I can pass the path to credentials provider like this:
-Dfs.s3a.security.credential.provider.path=jceks://hdfs/foo/bar/s3.jceks
I'd like to do the same for a Map Reduce job, so it can write directly to S3.
I've tried setting this property
configuration.set("fs.s3a.security.credential.provider.path", "path_to_provider");
in my Map Reduce job, but that did not work.
And yes, I am aware that I can pass them in clear text from the terminal or store them in clear text in core-site.xml, and the reason why I ask this question is exactly because I don't want to do that.
Thanks in advice.

Does distcp in hadoop ENCRYPT data while transporting from one cluster to another

I would like to know whether distcp has option to encrypt data while transporting from one cluster to another. I got to know that it does support encryption in S3 cluster but that is something to do with amazon's S3. What if we are moving plain text file from one cluster to another. Will it be encrypted or plain text is sent ? Can we enable such encryption, if it supports?
From HDFS documentation:
Once a KMS has been set up and the NameNode and HDFS clients have been
correctly configured, an admin can use the hadoop key and hdfs crypto
command-line tools to create encryption keys and set up new encryption
zones. Existing data can be encrypted by copying it into the new
encryption zones using tools like distcp.
Hope it helps.

Hadoop distcp command using a different S3 destination

I am using a Eucalyptus private cloud on which I have set up an CDH5 HDFS. I would like to backup my HDFS to the Eucalyptus S3. The classic way to use distcp as suggested here: http://wiki.apache.org/hadoop/AmazonS3 , ie hadoop distp hdfs://namenode:9000/user/foo/data/fil1 s3://$AWS_ACCESS_KEY:$AWS_SECRET_KEY#bucket/key doesn't work.
It seems that hadoop is pre-configured with an S3 location on Amazon and I cannot find where is this configuration in order to change this to the IP address of my S3 service running on Eucalyptus. I would expect to be able to just change the uri of S3 in the same way you can change your NameNode uri when using an hdfs:// prefix. But is seems this is not possible... Any insights?
I have already found workarounds for transferring my data. In particular the s3cmd tools here: https://github.com/eucalyptus/eucalyptus/wiki/HowTo-use-s3cmd-with-Eucalyptus and the s3curl scripts here: aws.amazon.com/developertools/Amazon-S3/2880343845151917 work just fine but I would prefer if I could transfer my data using map-reduce with the distcp command.
It looks like hadoop is using the jets3t library for S3 access. You might be able to use the configuration described in this blog to access eucalyptus, but note that for version 4 onwards the path is "/services/objectstorage" rather than "/services/Walrus".

distcp s3 instance profile temporary credentials

I'm using distcp on my hadoop cluster in AWS. Now we are switching over to use IAM roles for the cluster nodes. A solution I was going to try was add in my own implementation of org.apache.hadoop.fs.s3native.NativeS3FileSystem that would be smarter like the AWS InstanceProfileCredentialsProvider and use the IMDS. However is there an available solution to make distcp work with the temporary security credentials? Looking at NativeS3FileSystem and the related classes, it looks like I will need to copy most of the code just to make the credentials lookup use IMDS.

define a HDFS file on Elastic MapReduce on Amazon Web Services

I'm starting the implementation of the KMeans algorithm on the Hadoop MapReduce framework. I'm using in this regard the elastic MapReduce offered by Amazon Web Services. I want to create an HDFS file to save on it the initial cluster coordinates, and to store on it the final results of the reducers. I'm totally confused here. Is there anyway to create or "upload" this file into the HDFS format in order to be seen by all the mapper.
Any clarification in this regard?
Thanks.
At the end I got how to do it.
So, in order to upload the HDFS file into the cluster. You have to connect to your cluster via putty (by using the security key).
Then write these commands
hadoop distcp s3://bucke_name/data/fileNameinS3Bucket HDFSfileName
with
fileNameinS3Bucket is the name of the file in the s3 bucket
HDFSfileName is what do you want to name your file when I uploaded.
To check if the file has been uploaded
hadoop fs -ls

Resources