Transferring scripts from s3 to emr master - hadoop

I've managed to get data files distributed on emr clusters, but can't get the simple python scripts copied over to the master instance to run the hadoop job.
Using aws cli (aws s3 cp s3://the_bucket/the_script.py .) returns
A client error (Forbidden) occurred when calling the HeadObject operation: Forbidden.
I tried starting emr clusters from the console, checking default in the IAM roles section,
I've setup two IAM roles EMR_DefaultRole , EMR_EC2_DefaultRole making sure they had all s3 access permissions available,
and I've made sure to run aws configure for both ec2-user and hadoop (confirming the right creds were in ~/.aws/config).
Still get the error above. If the hadoop user can distcp the data from the same s3 bucket that holds my python scripts, shouldn't hadoop user be able to copy those scripts using aws s3? Isn't the same user (hadoop) accessing the same bucket? Thanks for any pointers.

Related

Insufficient number of DataNodes reporting when creating dataproc cluster

I am getting "Insufficient number of DataNodes reporting" error when creating dataproc cluster with gs:// as default FS. Below is the command i am using dataproc cluster.
gcloud dataproc clusters create cluster-538f --image-version 1.2 \
--bucket dataproc_bucket_test --subnet default --zone asia-south1-b \
--master-machine-type n1-standard-1 --master-boot-disk-size 500 \
--num-workers 2 --worker-machine-type n1-standard-1 --worker-boot-disk-size 500 \
--scopes 'https://www.googleapis.com/auth/cloud-platform' --project delcure-firebase \
--properties 'core:fs.default.name=gs://dataproc_bucket_test/'
I checked and confirmed that the bucket i am using is able to create default folder in the bucker.
As Igor suggests, Dataproc does not support GCS as a default FS. I also suggest unsetting this property. Note, that fs.default.name property can be passed to individual jobs and will work just fine.
The error arises when the file system is tried to be accessed (HdfsClientModule). So, I think it is probable that Google Cloud Storage doesn't have a specific feature that is required for Hadoop and the creation fails after some folders were created (first image).
As somebody else mentioned previously, it is better to give up the idea of using GCS as the default fs and leave HDFS work in Dataproc. Nonetheless, you can still take advantage of Cloud Storage to have data persistence, reliability, and performance because remember that data in HDFS is removed when a cluster is shut down.
1.- From a Dataproc node you can access data through the hadoop command to move data in and out, for example:
hadoop fs -ls gs://CONFIGBUCKET/dir/file
hadoop distcp hdfs://OtherNameNode/dir/ gs://CONFIGBUCKET/dir/file
2.- For accessing data from Spark or any Hadoop application just use the gs:// prefix to access your bucket.
Furthermore, if the Dataproc connector is installed on premises it can help to move HDFS data to Cloud Storage and then access it from a Dataproc cluster.

Copy files from HDFS to Amazon S3 using distp and s3a scheme

Using Apache Hadoop version 2.7.2 and trying to copy files from HDFS to Amazon S3 using below command.
hadoop distcp hdfs://<<namenode_host>>:9000/user/ubuntu/input/flightdata s3a://<<bucketid>>
Getting below exception using above command.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: s3a://<<bucketid>>.distcp.tmp.attempt_1462460298670_0004_m_000001_0
Thanks much for the help.
It should be possible to go from HDFS to S3 - I have done it before using syntax like the following, running it from a HDFS cluster:
distcp -Dfs.s3a.access.key=... -Dfs.s3a.secret.key=... /user/vagrant/bigdata s3a://mytestbucket/bigdata
It you run your command like this, does it work:
hadoop distcp hdfs://namenode_host:9000/user/ubuntu/input/flightdata s3a://bucketid/flightdata
From the exception, it looks like it is expecting a 'folder' to put the data in, as opposed to the root of the bucket.
You need to provide AWS credentials in order to successfully transfer files TO/FROM HDFS and S3.
You can pass the access_key_id and secret parameters as shown by #stephen above but you should use a credential provider api for production use where you can manage your credentials without passing them around in individual commands.
Ref: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html
Secondly you do not need to specify "hdfs" protocol. An absolute hdfs path is sufficient.

Spark distribute local file from master to nodes

I used to run Spark locally and distributing file to nodes has never caused me problems, but now I am moving things to Amazon cluster service and things starts to break down. Basically, I am processing some IP using the Maxmind GeoLiteCity.dat, which I placed on the local file system on the master (file:///home/hadoop/GeoLiteCity.dat).
following a question from earlier, I used the sc.addFile:
sc.addFile("file:///home/hadoop/GeoLiteCity.dat")
and call on it using something like:
val ipLookups = IpLookups(geoFile = Some(SparkFiles.get("GeoLiteCity.dat")), memCache = false, lruCache = 20000)
This works when running locally on my computer, but seems to be failing on the cluster (I do not know the reason for the failure, but I would appreciate it if someone can tell me how to display the logs for the process, the logs which are generated from Amazon service do not contain any information on which step is failing).
Do I have to somehow load the GeoLiteCity.dat onto the HDFS? Are there other ways to distribute a local file from the master across to the nodes without HDFS?
EDIT: Just to specify the way I run, I wrote a json file which does multiple steps, the first step is to run a bash script which transfers the GeoLiteCity.dat from Amazon S3 to the master:
#!/bin/bash
cd /home/hadoop
aws s3 cp s3://test/GeoLiteCity.dat GeoLiteCity.dat
After checking that the file is in the directory, The json then execute the Spark Jar, but fails. The logs produced by Amazon web UI does not show where the code breaks.
Instead of copying the file into master, load the file into s3 and read it from there
Refer http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter2/s3.html for reading files from S3.
You need to provide AWS Access Key ID and Secret Key. Either set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY or set it programmatically like,
sc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
sc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)
Then you can just read the file as text file. Like,
sc.textFile(s3n://test/GeoLiteCity.dat)
Additional reference :
How to read input from S3 in a Spark Streaming EC2 cluster application
https://stackoverflow.com/a/30852341/4057655

Which core-site.xml do I add my AWS access keys to?

I want to run Spark code on EC2 against data stored in my S3 bucket. According to both the Spark EC2 documentation and the Amazon S3 documentation, I have to add my AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to the core-site.xml file. However, when I shell into my master EC2 node, I see several core-site.xml files.
$ find . -name core-site.xml
./mapreduce/conf/core-site.xml
./persistent-hdfs/share/hadoop/templates/conf/core-site.xml
./persistent-hdfs/src/packages/templates/conf/core-site.xml
./persistent-hdfs/src/contrib/test/core-site.xml
./persistent-hdfs/src/test/core-site.xml
./persistent-hdfs/src/c++/libhdfs/tests/conf/core-site.xml
./persistent-hdfs/conf/core-site.xml
./ephemeral-hdfs/share/hadoop/templates/conf/core-site.xml
./ephemeral-hdfs/src/packages/templates/conf/core-site.xml
./ephemeral-hdfs/src/contrib/test/core-site.xml
./ephemeral-hdfs/src/test/core-site.xml
./ephemeral-hdfs/src/c++/libhdfs/tests/conf/core-site.xml
./ephemeral-hdfs/conf/core-site.xml
./spark-ec2/templates/root/mapreduce/conf/core-site.xml
./spark-ec2/templates/root/persistent-hdfs/conf/core-site.xml
./spark-ec2/templates/root/ephemeral-hdfs/conf/core-site.xml
./spark-ec2/templates/root/spark/conf/core-site.xml
./spark/conf/core-site.xml
After some experimentation, I determined that I can access an s3n url like s3n://mcneill-scratch/GR.txt from Spark only if I add my credentials to both mapreduce/conf/core-site.xml and spark/conf/core-site.xml.
This seems wrong to me. It's not DRY, and I can't find anything in the documentation that says you have to add your credentials to multiple files.
Is modifying multiple files the correct way to set of s3 credentials via core-site.xml? Is there documentation somewhere that explains this?
./spark/conf/core-site.xml should be the right place

hadoop cluster clarification

I am a newbie in hadoop and I am trying to run a hadoop jar on Amazon EC2. I have started my amazon ec2 instance through the console, uploaded my files to the dfs and then was able to successfully run the job jar and generate output on the instance.
But still I am confused on one part. I am not sure if the job was run on a single machine in amazon ec2 or was it ran on a cluster? How do I find the number of worker nodes involved for my jar run?
In some reference links I see we have to use launch-cluster command , for example "bin/hadoop-ec2 launch-cluster test-cluster 2" . What is the difference in starting the instance from the console and using this command like launch-cluster.

Resources