Hadoop distcp copy from S3: Signature does not match error - hadoop

I am trying to copy a file from S3 to my hadoop HDFS on Amazon EC2.
The command that I am using is:
bin/hadoop distcp s3://<awsAccessKeyId>:<awsSecretAccessKey>#<bucket_name>/f1 hdfs://user/root/
f1 is the name of the file
I have also changed it to s3n to see if it works but it does not.
I replace the forward slash in my secret access key with %2F
Error that I get is:SignatureDoesNotMatch
org.jets3t.service.S3ServiceException: S3 GET failed for '/%2Ff1'
<Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message>
<StringToSignBytes>...</StringToSignBytes>
<RequestId>...</RequestId>
<HostId>..</HostId>
<SignatureProvided>NsefW5en6P728cc9llkFIk6yGc4=\
</SignatureProvided>
<StringToSign>GETMon, 05 Aug 2013 15:28:21 GMT/<bucket_name>/%2Ff1</StringToSign>
<AWSAccessKeyId><MY_ACCESS_ID><\ /AWSAccessKeyId></Error>
I have only one AWS Access Key Id and secret Key. I checked my AWS account and they are the same. I use the same AWS Access Key and secret Key to log on to my EC2 cluster. I have also tried using core-site.xml but that has not helped either.
Thanks,
Rajiv

Regenerating my AWS Key and Secret such that there is no forward slash in my secret worked for me.
Ref: https://issues.apache.org/jira/browse/HADOOP-3733

An alternative to regenerating the key that worked for me was to use -Dfs.s3n.awsAccessKeyId= -Dfs.s3n.awsSecretAccessKey= flags when running distcp.
Example:
hadoop distcp -Dfs.s3n.awsAccessKeyId= -Dfs.s3n.awsSecretAccessKey= s3n://path/to/log/dir hdfs://hdfs-node:8020/logs/
Note the use of s3n, which has a 5GB file limitation: Difference between Amazon S3 and S3n in Hadoop
Edit: Do not url-encode the secret access key, so slashes "/" and pluses "+" should be passed as they are!

Related

Copy files from HDFS to Amazon S3 using distp and s3a scheme

Using Apache Hadoop version 2.7.2 and trying to copy files from HDFS to Amazon S3 using below command.
hadoop distcp hdfs://<<namenode_host>>:9000/user/ubuntu/input/flightdata s3a://<<bucketid>>
Getting below exception using above command.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: s3a://<<bucketid>>.distcp.tmp.attempt_1462460298670_0004_m_000001_0
Thanks much for the help.
It should be possible to go from HDFS to S3 - I have done it before using syntax like the following, running it from a HDFS cluster:
distcp -Dfs.s3a.access.key=... -Dfs.s3a.secret.key=... /user/vagrant/bigdata s3a://mytestbucket/bigdata
It you run your command like this, does it work:
hadoop distcp hdfs://namenode_host:9000/user/ubuntu/input/flightdata s3a://bucketid/flightdata
From the exception, it looks like it is expecting a 'folder' to put the data in, as opposed to the root of the bucket.
You need to provide AWS credentials in order to successfully transfer files TO/FROM HDFS and S3.
You can pass the access_key_id and secret parameters as shown by #stephen above but you should use a credential provider api for production use where you can manage your credentials without passing them around in individual commands.
Ref: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html
Secondly you do not need to specify "hdfs" protocol. An absolute hdfs path is sufficient.

Could not find uri with key dfs.encryption.key.provider.uri to create a keyProvider in HDFS encryption for CDH 5.4

CDH Version: CDH5.4.5
Issue: When HDFS Encryption is enabled using KMS available in Hadoop CDH 5.4 , getting error while putting file into encryption zone.
Steps:
Steps for Encryption of Hadoop as follows:
Creating a key [SUCCESS]
[tester#master ~]$ hadoop key create 'TDEHDP'
-provider kms://https#10.1.118.1/key_generator/kms -size 128
tde group has been successfully created with options
Options{cipher='AES/CTR/NoPadding', bitLength=128, description='null', attributes=null}.
KMSClientProvider[https://10.1.118.1/key_generator/kms/v1/] has been updated.
2.Creating a directory [SUCCESS]
[tester#master ~]$ hdfs dfs -mkdir /user/tester/vs_key_testdir
Adding Encryption Zone [SUCCESS]
[tester#master ~]$ hdfs crypto -createZone -keyName 'TDEHDP'
-path /user/tester/vs_key_testdir
Added encryption zone /user/tester/vs_key_testdir
Copying File to encryption Zone [ERROR]
[tdetester#master ~]$ hdfs dfs -copyFromLocal test.txt /user/tester/vs_key_testdir
15/09/04 06:06:33 ERROR hdfs.KeyProviderCache: Could not find uri with
key [dfs.encryption.key.provider.uri] to create a keyProvider !!
copyFromLocal: No KeyProvider is configured, cannot access an
encrypted file 15/09/04 06:06:33 ERROR hdfs.DFSClient: Failed to close
inode 20823
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on /user/tester/vs_key_testdir/test.txt.COPYING (inode
20823): File does not exist. Holder
DFSClient_NONMAPREDUCE_1061684229_1 does not have any open files.
Any idea/suggestion will be helpful.
This issue was crossposted here: https://community.cloudera.com/t5/Storage-Random-Access-HDFS/Could-not-find-uri-with-key-dfs-encryption-key-provider-uri-to/td-p/31637
Main conclusion: It is a non-issue
Here is the answer that was provided by the support staff:
CDH's base release versions are just that: base. The fix for the
harmless log print due to HDFS-7931 is present in all CDH5 releases
since CDH 5.4.1.
If you see that error in context of having configured a KMS, then its
a worthy one to consider. If you do not use KMS or EZs, then the error
may be ignored. Alternatively upgrade to the latest CDH5 (5.4.x or
5.5.x) releases to receive a bug fix that makes the error only appear when in the context of a KMS being configured over an encrypted path.
Per your log snippet, I don't see a problem (the canary does not
appear to be failing?). If you're trying to report a failure, please
send us more characteristics of the failure, as HDFS-7931 is a minor
issue with an unnecessary log print.

Transferring scripts from s3 to emr master

I've managed to get data files distributed on emr clusters, but can't get the simple python scripts copied over to the master instance to run the hadoop job.
Using aws cli (aws s3 cp s3://the_bucket/the_script.py .) returns
A client error (Forbidden) occurred when calling the HeadObject operation: Forbidden.
I tried starting emr clusters from the console, checking default in the IAM roles section,
I've setup two IAM roles EMR_DefaultRole , EMR_EC2_DefaultRole making sure they had all s3 access permissions available,
and I've made sure to run aws configure for both ec2-user and hadoop (confirming the right creds were in ~/.aws/config).
Still get the error above. If the hadoop user can distcp the data from the same s3 bucket that holds my python scripts, shouldn't hadoop user be able to copy those scripts using aws s3? Isn't the same user (hadoop) accessing the same bucket? Thanks for any pointers.

Which core-site.xml do I add my AWS access keys to?

I want to run Spark code on EC2 against data stored in my S3 bucket. According to both the Spark EC2 documentation and the Amazon S3 documentation, I have to add my AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to the core-site.xml file. However, when I shell into my master EC2 node, I see several core-site.xml files.
$ find . -name core-site.xml
./mapreduce/conf/core-site.xml
./persistent-hdfs/share/hadoop/templates/conf/core-site.xml
./persistent-hdfs/src/packages/templates/conf/core-site.xml
./persistent-hdfs/src/contrib/test/core-site.xml
./persistent-hdfs/src/test/core-site.xml
./persistent-hdfs/src/c++/libhdfs/tests/conf/core-site.xml
./persistent-hdfs/conf/core-site.xml
./ephemeral-hdfs/share/hadoop/templates/conf/core-site.xml
./ephemeral-hdfs/src/packages/templates/conf/core-site.xml
./ephemeral-hdfs/src/contrib/test/core-site.xml
./ephemeral-hdfs/src/test/core-site.xml
./ephemeral-hdfs/src/c++/libhdfs/tests/conf/core-site.xml
./ephemeral-hdfs/conf/core-site.xml
./spark-ec2/templates/root/mapreduce/conf/core-site.xml
./spark-ec2/templates/root/persistent-hdfs/conf/core-site.xml
./spark-ec2/templates/root/ephemeral-hdfs/conf/core-site.xml
./spark-ec2/templates/root/spark/conf/core-site.xml
./spark/conf/core-site.xml
After some experimentation, I determined that I can access an s3n url like s3n://mcneill-scratch/GR.txt from Spark only if I add my credentials to both mapreduce/conf/core-site.xml and spark/conf/core-site.xml.
This seems wrong to me. It's not DRY, and I can't find anything in the documentation that says you have to add your credentials to multiple files.
Is modifying multiple files the correct way to set of s3 credentials via core-site.xml? Is there documentation somewhere that explains this?
./spark/conf/core-site.xml should be the right place

Unable to copy NCDC Data from Amazon AWS to Hadoop Cluster

I am trying to copy the NCDC Data from Amazon S3 to my local hadoop cluster by using following command.
hadoop distcp -Dfs.s3n.awsAccessKeyId='ABC' -Dfs.s3n.awsSecretAccessKey='XYZ' s3n://hadoopbook/ncdc/all input/ncdc/all
And getting error which is given below :
java.lang.IllegalArgumentException: AWS Secret Access Key must be specified as the password of a s3n URL, or by setting the fs.s3n.awsSecretAccessKey property
Gone through the following question but of no big help.
Problem with Copying Local Data
Any hint about how to solve the problem . Detailed answer will be very appreciated for better understanding. Thanks
Have you tried this:
Excerpt from AmazonS3 Wiki
Here is an example copying a nutch segment named 0070206153839-1998 at
/user/nutch in hdfs to an S3 bucket named 'nutch' (Let the S3
AWS_ACCESS_KEY_ID be 123 and the S3 AWS_ACCESS_KEY_SECRET be 456):
% ${HADOOP_HOME}/bin/hadoop distcp
hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998
s3://123:456#nutch/
In you case, it should be something like this:
hadoop distcp s3n://ABC:XYZ#hadoopbook/ncdc/all hdfs://IPaddress:port/input/ncdc/all
You need to set up the aws id and password in the core-site.xml
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>xxxxxxx</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>xxxxxxxxx</value>
</property>
and restart your cluster

Resources