Running Custom JAR on Amazon EMR giving error ( Filesystem Error ) using Amazon S3 Bucket input and output - hadoop

I am trying to run a Custom JAR on Amazon EMR cluster using the input and output parameters of the Custom JAR as S3 buckets (-input s3n://s3_bucket_name/ldas/in -output s3n://s3_bucket_name/ldas/out)
When the cluster runs this Custom JAR, the following exception occurs.
Exception in thread "main" java.lang.IllegalArgumentException: **Wrong FS: s3n://s3_bucket_name/ldas/out, expected: hdfs://10.214.245.187:9000**
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:644)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:181)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:92)
at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:585)
at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:581)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:581)
at cc.mrlda.ParseCorpus.run(ParseCorpus.java:101)
at cc.mrlda.ParseCorpus.run(ParseCorpus.java:77)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at cc.mrlda.ParseCorpus.main(ParseCorpus.java:727)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
How to correct this error? How to use s3n bucket as the filesystem in Amazon EMR?
Also, I think changing the default filesystem to the s3 bucket would be good, but I am not sure how to do it.

I'd suggest checking that you jar is using the same method of processing the parameters as shown here: http://java.dzone.com/articles/running-elastic-mapreduce-job
Specifically,
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
Alternatively, I've had success adding custom script runner steps to copy files from s3 to hadoop or vice-versa. Particularly if you have a few streaming steps in a row it's helpful to keep things on hdfs. You should be able to make a simple bash scripts with something like
hadoop fs -cp s3://s3_bucket_name/ldas/in hdfs:///ldas/in
and
hadoop fs -cp hdfs:///ldas/out s3://s3_bucket_name/ldas/out
Then set your streaming step in between to operate between hdfs:///ldas/in and hdfs:///ldas/out

Related

Using S3 Links When Running Pig 0.14.0 in Local Mode?

I'm running Pig 0.14 in local mode. I'm running simple scripts over data in S3. I'd like to refer to these files directly in these scripts, e.g.:
x = LOAD 's3://bucket/path/to/file1.json' AS (...);
// Magic happens
STORE x INTO 's3://bucket/path/to/file2.json';
However, when I use the following command line:
$PIG_HOME/bin/pig -x local -P $HOME/credentials.properties -f $HOME/script.pig
I get the following error:
Failed Jobs:
JobId Alias Feature Message Outputs
N/A mainplinks MAP_ONLY Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: s3://bucket/path/to/file.json
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:979)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.pig.backend.hadoop20.PigJobControl.mainLoopAction(PigJobControl.java:157)
at org.apache.pig.backend.hadoop20.PigJobControl.run(PigJobControl.java:134)
at java.lang.Thread.run(Thread.java:748)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: s3://com.w2ogroup.analytics.soma.prod/airy/fb25b5c6/data/mainplinks.json
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:265)
... 20 more
file:/tmp/temp-948194340/tmp-48450066,
I can confirm that LOAD is failing; I suspect that STORE will fail too. REGISTER S3 links also fail. I can confirm that the links referenced by LOAD and REGISTER exist, and the links referred to by STORE don't, as Pig expects.
I've solved some issues already. For example, I dropped jets3t-0.7.1 into $PIG_HOME/lib, which fixed runtime errors due to the presence of S3 links at all. Additionally, I've provided the relevant AWS keys, and I can confirm that these keys work because I use them AWSCLI to do the same work.
If I use awscli to copy the files to local disk and rewrite the links to use the local file system, everything works fine. Thus, I'm convinced that the issue is S3-related.
How can I convince Pig to handle these S3 links properly?
AFAIK, the way Pig read from S3 is through HDFS. Furthermore, in order Pig to be able to access HDFS, Pig must not run locally. For setting up non-local Pig easily, I'd suggest you to spin up an EMR cluster (which I have tried this on).
So first you need to setup your HDFS properly to access data from S3.
On your hdfs-site.xml configuration, make sure to set values for fs.s3a keys:
<property>
<name>fs.s3a.access.key</name>
<value>{YOUR_ACCESS_KEY}</value>
<description>AWS access key ID. Omit for Role-based authentication.</description>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>{YOUR_SECRET_KEY}</value>
<description>AWS secret key. Omit for Role-based authentication.</description>
</property>
There should not be any need to restart HDFS service but there is no harm on doing so. To restart a service, run initctl list then sudo stop <service name according to initctl output>.
Verify that you can access S3 from HDFS by running (note the s3a protocol):
$ hdfs dfs -ls s3a://bucket/path/to/file
If you get no error then you are now able to use S3 path in Pig. Run Pig in either MapReduce or Tez mode:
pig -x tez -f script.pig or pig -x mapreduce -f script.pig.
https://community.hortonworks.com/articles/25578/how-to-access-data-files-stored-in-aws-s3-buckets.html

Adding a step or bootstrap action in EMR 3.10 to copy a file from local to s3

I am using Amazon EMR 3.10 for my purpose where I want to copy a file from local to Amazon S3...I am using "script-runner.jar" where in the arguments,I am mentioning a command in the arguments sudo aws s3 cp /home/hadoop/conf/hdfs-site.xml s3://testbucket/myfolder/--recursive ....But the step is getting failed & throwing the following exception :
Exception in thread "main" java.lang.RuntimeException: Local file does not exist.
at com.amazon.elasticmapreduce.scriptrunner.ScriptRunner.fetchFile(ScriptRunner.java:30)
at com.amazon.elasticmapreduce.scriptrunner.ScriptRunner.main(ScriptRunner.java:56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
What do I need to do in the step so that it copies the file from local to Amazon S3 ?? Also I want to raise few questions ??
1 ) If I need to use "command-runner.jar",how can I use command-runner in EMR 3.10 ??
2) How can I do the copying task using Bootstrap action ??
Thank You
If you are trying to perform this copy in a bootstrap action, note that
the hadoop user does not exist until after the bootstrapping phase has completed.
That would explain the error.
Doing the copy operation as an EMR Step should work as hadoop is installed by that point.
See the lifecycle of an EMR for more details: here
it seems like the program is unable to find the local file
/home/hadoop/conf/hdfs-site.xml
Does the file exists?
You could also try using a nice tool called s3cmd

Executing Mahout against Hadoop cluster

I have a jar file which contains the mahout jars as well as other code I wrote.
It works fine in my local machine.
I would like to run it in a cluster that has Hadoop already installed.
When I do
$HADOOP_HOME/bin/hadoop jar myjar.jar args
I get the error
Exception in thread "main" java.io.IOException: Mkdirs failed to create /some/hdfs/path (exists=false, cwd=file:local/folder/where/myjar/is)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java 440)
...
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
I checked that I can access and create the dir in the hdfs system.
I have also ran hadoop code (no mahout) without a problem.
I am running this in a linux machine.
Check for the mahout user and hadoop user being same. and also check for mahout and hadoop version compatibility.
Regards
Jyoti ranjan panda

Trouble with Hadoop RecommenderJob

I have added my input files 'input.txt' and 'users.txt' to HDFS successfully. I have tested Hadoop and Mahout jobs separately with success. However, when I go to run a RecommenderJob with the following command line:
bin/hadoop jar /Applications/mahout-distribution-0.9/mahout-core-0.9-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=/user/valtera45/input/input.txt -Dmapred.output.dir=/user/valtera45/output
--usersFile /user/valtera45/input2/users.txt --similarityClassname SIMILARITY_COOCCURRENCE
This is the output I get:
Exception in thread "main" java.io.IOException: Cannot open filename /user/valtera45/temp/preparePreferenceMatrix/numUsers.bin
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1444)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.(DFSClient.java:1435)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:347)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:351)
at org.apache.mahout.common.HadoopUtil.readInt(HadoopUtil.java:339)
at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:172)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:322)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Whenever I run a standalone Mahout job, a temp folder gets created within the Mahout directory. The RecommenderJob can't seem to get past this step. Any ideas? Thanks in advance. I know the input files I am using are well formatted because they have worked successfully for others.
hadoop jar mahout-core-0.8-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=large_data.csv -Dmapred.output.dir=output/output1.csv -s SIMILARITY_LOGLIKELIHOOD --booleanData --numRecommendations 5
I am using this and my program is running successfully on ec2 instance with mahout and hadoop but i am not able to get relevant results. if anyone knows anything about it please revert on this.

Hadoop run time error

I have school project to work with hadoop and that will be hosted in amazon EMR.
At first, I'm trying to understand with simple wordcount program and it is running fine at eclipse IDE.
But if I tried to run from command line I'm getting below error.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
at counter.WordCount.main(WordCount.java:56)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method).
Do you have any suggestion for this error and any resource to understand hadoop and EMR?
Thanks,
myat
Don't run your Job from the IDE or with the java command. Instead use the hadoop script in the bin/ directory of the hadoop installation.
Example: if your Job's starting point is in the mrjob.MyJob class and you have a jar (job.jar) containing your Job class, you should run it like this:
path/to/bin/hadoop jar job.jar mrjob.MyJob inputFolder outputFolder

Resources