Which core-site.xml do I add my AWS access keys to? - amazon-ec2

I want to run Spark code on EC2 against data stored in my S3 bucket. According to both the Spark EC2 documentation and the Amazon S3 documentation, I have to add my AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to the core-site.xml file. However, when I shell into my master EC2 node, I see several core-site.xml files.
$ find . -name core-site.xml
./mapreduce/conf/core-site.xml
./persistent-hdfs/share/hadoop/templates/conf/core-site.xml
./persistent-hdfs/src/packages/templates/conf/core-site.xml
./persistent-hdfs/src/contrib/test/core-site.xml
./persistent-hdfs/src/test/core-site.xml
./persistent-hdfs/src/c++/libhdfs/tests/conf/core-site.xml
./persistent-hdfs/conf/core-site.xml
./ephemeral-hdfs/share/hadoop/templates/conf/core-site.xml
./ephemeral-hdfs/src/packages/templates/conf/core-site.xml
./ephemeral-hdfs/src/contrib/test/core-site.xml
./ephemeral-hdfs/src/test/core-site.xml
./ephemeral-hdfs/src/c++/libhdfs/tests/conf/core-site.xml
./ephemeral-hdfs/conf/core-site.xml
./spark-ec2/templates/root/mapreduce/conf/core-site.xml
./spark-ec2/templates/root/persistent-hdfs/conf/core-site.xml
./spark-ec2/templates/root/ephemeral-hdfs/conf/core-site.xml
./spark-ec2/templates/root/spark/conf/core-site.xml
./spark/conf/core-site.xml
After some experimentation, I determined that I can access an s3n url like s3n://mcneill-scratch/GR.txt from Spark only if I add my credentials to both mapreduce/conf/core-site.xml and spark/conf/core-site.xml.
This seems wrong to me. It's not DRY, and I can't find anything in the documentation that says you have to add your credentials to multiple files.
Is modifying multiple files the correct way to set of s3 credentials via core-site.xml? Is there documentation somewhere that explains this?

./spark/conf/core-site.xml should be the right place

Related

Copy files from HDFS to Amazon S3 using distp and s3a scheme

Using Apache Hadoop version 2.7.2 and trying to copy files from HDFS to Amazon S3 using below command.
hadoop distcp hdfs://<<namenode_host>>:9000/user/ubuntu/input/flightdata s3a://<<bucketid>>
Getting below exception using above command.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: s3a://<<bucketid>>.distcp.tmp.attempt_1462460298670_0004_m_000001_0
Thanks much for the help.
It should be possible to go from HDFS to S3 - I have done it before using syntax like the following, running it from a HDFS cluster:
distcp -Dfs.s3a.access.key=... -Dfs.s3a.secret.key=... /user/vagrant/bigdata s3a://mytestbucket/bigdata
It you run your command like this, does it work:
hadoop distcp hdfs://namenode_host:9000/user/ubuntu/input/flightdata s3a://bucketid/flightdata
From the exception, it looks like it is expecting a 'folder' to put the data in, as opposed to the root of the bucket.
You need to provide AWS credentials in order to successfully transfer files TO/FROM HDFS and S3.
You can pass the access_key_id and secret parameters as shown by #stephen above but you should use a credential provider api for production use where you can manage your credentials without passing them around in individual commands.
Ref: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html
Secondly you do not need to specify "hdfs" protocol. An absolute hdfs path is sufficient.

Spark distribute local file from master to nodes

I used to run Spark locally and distributing file to nodes has never caused me problems, but now I am moving things to Amazon cluster service and things starts to break down. Basically, I am processing some IP using the Maxmind GeoLiteCity.dat, which I placed on the local file system on the master (file:///home/hadoop/GeoLiteCity.dat).
following a question from earlier, I used the sc.addFile:
sc.addFile("file:///home/hadoop/GeoLiteCity.dat")
and call on it using something like:
val ipLookups = IpLookups(geoFile = Some(SparkFiles.get("GeoLiteCity.dat")), memCache = false, lruCache = 20000)
This works when running locally on my computer, but seems to be failing on the cluster (I do not know the reason for the failure, but I would appreciate it if someone can tell me how to display the logs for the process, the logs which are generated from Amazon service do not contain any information on which step is failing).
Do I have to somehow load the GeoLiteCity.dat onto the HDFS? Are there other ways to distribute a local file from the master across to the nodes without HDFS?
EDIT: Just to specify the way I run, I wrote a json file which does multiple steps, the first step is to run a bash script which transfers the GeoLiteCity.dat from Amazon S3 to the master:
#!/bin/bash
cd /home/hadoop
aws s3 cp s3://test/GeoLiteCity.dat GeoLiteCity.dat
After checking that the file is in the directory, The json then execute the Spark Jar, but fails. The logs produced by Amazon web UI does not show where the code breaks.
Instead of copying the file into master, load the file into s3 and read it from there
Refer http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter2/s3.html for reading files from S3.
You need to provide AWS Access Key ID and Secret Key. Either set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY or set it programmatically like,
sc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
sc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)
Then you can just read the file as text file. Like,
sc.textFile(s3n://test/GeoLiteCity.dat)
Additional reference :
How to read input from S3 in a Spark Streaming EC2 cluster application
https://stackoverflow.com/a/30852341/4057655

Transferring scripts from s3 to emr master

I've managed to get data files distributed on emr clusters, but can't get the simple python scripts copied over to the master instance to run the hadoop job.
Using aws cli (aws s3 cp s3://the_bucket/the_script.py .) returns
A client error (Forbidden) occurred when calling the HeadObject operation: Forbidden.
I tried starting emr clusters from the console, checking default in the IAM roles section,
I've setup two IAM roles EMR_DefaultRole , EMR_EC2_DefaultRole making sure they had all s3 access permissions available,
and I've made sure to run aws configure for both ec2-user and hadoop (confirming the right creds were in ~/.aws/config).
Still get the error above. If the hadoop user can distcp the data from the same s3 bucket that holds my python scripts, shouldn't hadoop user be able to copy those scripts using aws s3? Isn't the same user (hadoop) accessing the same bucket? Thanks for any pointers.

Hadoop distcp copy from S3: Signature does not match error

I am trying to copy a file from S3 to my hadoop HDFS on Amazon EC2.
The command that I am using is:
bin/hadoop distcp s3://<awsAccessKeyId>:<awsSecretAccessKey>#<bucket_name>/f1 hdfs://user/root/
f1 is the name of the file
I have also changed it to s3n to see if it works but it does not.
I replace the forward slash in my secret access key with %2F
Error that I get is:SignatureDoesNotMatch
org.jets3t.service.S3ServiceException: S3 GET failed for '/%2Ff1'
<Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message>
<StringToSignBytes>...</StringToSignBytes>
<RequestId>...</RequestId>
<HostId>..</HostId>
<SignatureProvided>NsefW5en6P728cc9llkFIk6yGc4=\
</SignatureProvided>
<StringToSign>GETMon, 05 Aug 2013 15:28:21 GMT/<bucket_name>/%2Ff1</StringToSign>
<AWSAccessKeyId><MY_ACCESS_ID><\ /AWSAccessKeyId></Error>
I have only one AWS Access Key Id and secret Key. I checked my AWS account and they are the same. I use the same AWS Access Key and secret Key to log on to my EC2 cluster. I have also tried using core-site.xml but that has not helped either.
Thanks,
Rajiv
Regenerating my AWS Key and Secret such that there is no forward slash in my secret worked for me.
Ref: https://issues.apache.org/jira/browse/HADOOP-3733
An alternative to regenerating the key that worked for me was to use -Dfs.s3n.awsAccessKeyId= -Dfs.s3n.awsSecretAccessKey= flags when running distcp.
Example:
hadoop distcp -Dfs.s3n.awsAccessKeyId= -Dfs.s3n.awsSecretAccessKey= s3n://path/to/log/dir hdfs://hdfs-node:8020/logs/
Note the use of s3n, which has a 5GB file limitation: Difference between Amazon S3 and S3n in Hadoop
Edit: Do not url-encode the secret access key, so slashes "/" and pluses "+" should be passed as they are!

Unable to copy NCDC Data from Amazon AWS to Hadoop Cluster

I am trying to copy the NCDC Data from Amazon S3 to my local hadoop cluster by using following command.
hadoop distcp -Dfs.s3n.awsAccessKeyId='ABC' -Dfs.s3n.awsSecretAccessKey='XYZ' s3n://hadoopbook/ncdc/all input/ncdc/all
And getting error which is given below :
java.lang.IllegalArgumentException: AWS Secret Access Key must be specified as the password of a s3n URL, or by setting the fs.s3n.awsSecretAccessKey property
Gone through the following question but of no big help.
Problem with Copying Local Data
Any hint about how to solve the problem . Detailed answer will be very appreciated for better understanding. Thanks
Have you tried this:
Excerpt from AmazonS3 Wiki
Here is an example copying a nutch segment named 0070206153839-1998 at
/user/nutch in hdfs to an S3 bucket named 'nutch' (Let the S3
AWS_ACCESS_KEY_ID be 123 and the S3 AWS_ACCESS_KEY_SECRET be 456):
% ${HADOOP_HOME}/bin/hadoop distcp
hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998
s3://123:456#nutch/
In you case, it should be something like this:
hadoop distcp s3n://ABC:XYZ#hadoopbook/ncdc/all hdfs://IPaddress:port/input/ncdc/all
You need to set up the aws id and password in the core-site.xml
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>xxxxxxx</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>xxxxxxxxx</value>
</property>
and restart your cluster

Resources