Adding a step or bootstrap action in EMR 3.10 to copy a file from local to s3 - hadoop

I am using Amazon EMR 3.10 for my purpose where I want to copy a file from local to Amazon S3...I am using "script-runner.jar" where in the arguments,I am mentioning a command in the arguments sudo aws s3 cp /home/hadoop/conf/hdfs-site.xml s3://testbucket/myfolder/--recursive ....But the step is getting failed & throwing the following exception :
Exception in thread "main" java.lang.RuntimeException: Local file does not exist.
at com.amazon.elasticmapreduce.scriptrunner.ScriptRunner.fetchFile(ScriptRunner.java:30)
at com.amazon.elasticmapreduce.scriptrunner.ScriptRunner.main(ScriptRunner.java:56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
What do I need to do in the step so that it copies the file from local to Amazon S3 ?? Also I want to raise few questions ??
1 ) If I need to use "command-runner.jar",how can I use command-runner in EMR 3.10 ??
2) How can I do the copying task using Bootstrap action ??
Thank You

If you are trying to perform this copy in a bootstrap action, note that
the hadoop user does not exist until after the bootstrapping phase has completed.
That would explain the error.
Doing the copy operation as an EMR Step should work as hadoop is installed by that point.
See the lifecycle of an EMR for more details: here

it seems like the program is unable to find the local file
/home/hadoop/conf/hdfs-site.xml
Does the file exists?
You could also try using a nice tool called s3cmd

Related

Using S3 Links When Running Pig 0.14.0 in Local Mode?

I'm running Pig 0.14 in local mode. I'm running simple scripts over data in S3. I'd like to refer to these files directly in these scripts, e.g.:
x = LOAD 's3://bucket/path/to/file1.json' AS (...);
// Magic happens
STORE x INTO 's3://bucket/path/to/file2.json';
However, when I use the following command line:
$PIG_HOME/bin/pig -x local -P $HOME/credentials.properties -f $HOME/script.pig
I get the following error:
Failed Jobs:
JobId Alias Feature Message Outputs
N/A mainplinks MAP_ONLY Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: s3://bucket/path/to/file.json
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:979)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.pig.backend.hadoop20.PigJobControl.mainLoopAction(PigJobControl.java:157)
at org.apache.pig.backend.hadoop20.PigJobControl.run(PigJobControl.java:134)
at java.lang.Thread.run(Thread.java:748)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: s3://com.w2ogroup.analytics.soma.prod/airy/fb25b5c6/data/mainplinks.json
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:265)
... 20 more
file:/tmp/temp-948194340/tmp-48450066,
I can confirm that LOAD is failing; I suspect that STORE will fail too. REGISTER S3 links also fail. I can confirm that the links referenced by LOAD and REGISTER exist, and the links referred to by STORE don't, as Pig expects.
I've solved some issues already. For example, I dropped jets3t-0.7.1 into $PIG_HOME/lib, which fixed runtime errors due to the presence of S3 links at all. Additionally, I've provided the relevant AWS keys, and I can confirm that these keys work because I use them AWSCLI to do the same work.
If I use awscli to copy the files to local disk and rewrite the links to use the local file system, everything works fine. Thus, I'm convinced that the issue is S3-related.
How can I convince Pig to handle these S3 links properly?
AFAIK, the way Pig read from S3 is through HDFS. Furthermore, in order Pig to be able to access HDFS, Pig must not run locally. For setting up non-local Pig easily, I'd suggest you to spin up an EMR cluster (which I have tried this on).
So first you need to setup your HDFS properly to access data from S3.
On your hdfs-site.xml configuration, make sure to set values for fs.s3a keys:
<property>
<name>fs.s3a.access.key</name>
<value>{YOUR_ACCESS_KEY}</value>
<description>AWS access key ID. Omit for Role-based authentication.</description>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>{YOUR_SECRET_KEY}</value>
<description>AWS secret key. Omit for Role-based authentication.</description>
</property>
There should not be any need to restart HDFS service but there is no harm on doing so. To restart a service, run initctl list then sudo stop <service name according to initctl output>.
Verify that you can access S3 from HDFS by running (note the s3a protocol):
$ hdfs dfs -ls s3a://bucket/path/to/file
If you get no error then you are now able to use S3 path in Pig. Run Pig in either MapReduce or Tez mode:
pig -x tez -f script.pig or pig -x mapreduce -f script.pig.
https://community.hortonworks.com/articles/25578/how-to-access-data-files-stored-in-aws-s3-buckets.html

Trouble with Hadoop RecommenderJob

I have added my input files 'input.txt' and 'users.txt' to HDFS successfully. I have tested Hadoop and Mahout jobs separately with success. However, when I go to run a RecommenderJob with the following command line:
bin/hadoop jar /Applications/mahout-distribution-0.9/mahout-core-0.9-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=/user/valtera45/input/input.txt -Dmapred.output.dir=/user/valtera45/output
--usersFile /user/valtera45/input2/users.txt --similarityClassname SIMILARITY_COOCCURRENCE
This is the output I get:
Exception in thread "main" java.io.IOException: Cannot open filename /user/valtera45/temp/preparePreferenceMatrix/numUsers.bin
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1444)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.(DFSClient.java:1435)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:347)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:351)
at org.apache.mahout.common.HadoopUtil.readInt(HadoopUtil.java:339)
at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:172)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:322)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Whenever I run a standalone Mahout job, a temp folder gets created within the Mahout directory. The RecommenderJob can't seem to get past this step. Any ideas? Thanks in advance. I know the input files I am using are well formatted because they have worked successfully for others.
hadoop jar mahout-core-0.8-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=large_data.csv -Dmapred.output.dir=output/output1.csv -s SIMILARITY_LOGLIKELIHOOD --booleanData --numRecommendations 5
I am using this and my program is running successfully on ec2 instance with mahout and hadoop but i am not able to get relevant results. if anyone knows anything about it please revert on this.

Running Custom JAR on Amazon EMR giving error ( Filesystem Error ) using Amazon S3 Bucket input and output

I am trying to run a Custom JAR on Amazon EMR cluster using the input and output parameters of the Custom JAR as S3 buckets (-input s3n://s3_bucket_name/ldas/in -output s3n://s3_bucket_name/ldas/out)
When the cluster runs this Custom JAR, the following exception occurs.
Exception in thread "main" java.lang.IllegalArgumentException: **Wrong FS: s3n://s3_bucket_name/ldas/out, expected: hdfs://10.214.245.187:9000**
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:644)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:181)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:92)
at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:585)
at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:581)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:581)
at cc.mrlda.ParseCorpus.run(ParseCorpus.java:101)
at cc.mrlda.ParseCorpus.run(ParseCorpus.java:77)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at cc.mrlda.ParseCorpus.main(ParseCorpus.java:727)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
How to correct this error? How to use s3n bucket as the filesystem in Amazon EMR?
Also, I think changing the default filesystem to the s3 bucket would be good, but I am not sure how to do it.
I'd suggest checking that you jar is using the same method of processing the parameters as shown here: http://java.dzone.com/articles/running-elastic-mapreduce-job
Specifically,
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
Alternatively, I've had success adding custom script runner steps to copy files from s3 to hadoop or vice-versa. Particularly if you have a few streaming steps in a row it's helpful to keep things on hdfs. You should be able to make a simple bash scripts with something like
hadoop fs -cp s3://s3_bucket_name/ldas/in hdfs:///ldas/in
and
hadoop fs -cp hdfs:///ldas/out s3://s3_bucket_name/ldas/out
Then set your streaming step in between to operate between hdfs:///ldas/in and hdfs:///ldas/out

Using hadoop distcp to copy data to s3 block filesystem: The specified copy source is larger than the maximum allowable size for a copy source

I'm trying to use hadoop's distcp to copy data from HDFS to S3 (not S3N). My understanding is that using the s3:// protocol, Hadoop will store the individual blocks on S3, and each S3 'file' will effectively be an HDFS block.
Hadoop version is 2.2.0 running on Amazon EMR.
However, trying to do a simple distcp, I get the following error:
Caused by: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 71C64ECE79FCC244, AWS Error Code: InvalidRequest, AWS Error Message: The specified copy source is larger than the maximum allowable size for a copy source: 5368709120, S3 Extended Request ID: uAnvxtrNolvs0qm6htIrKjpD0VFxzjqgIeN9RtGFmXflUHDcSqwnZGZgWt5PwoTy
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:619)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:317)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:170)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:2943)
at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1235)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.copy(Jets3tNativeFileSystemStore.java:277)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.$Proxy11.copy(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:1217)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.promoteTmpToTarget(RetriableFileCopyCommand.java:161)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:110)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:83)
at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
Some of my source files are >5GB. Looking at the error, it seems that distcp is trying to blindly copy files from HDFS into S3, as if it were using the S3 Native filesystem. Because of the files that are >5GB, this is failing, as S3 doesn't support put requests >5GB.
Why is this happening? I would have thought that distcp would try to put the individual blocks onto S3, and these should only be 64MB (my HDFS blocksize).
In order to write files with size > 4GB - one must use multi-part uploads. This seems to have been fixed in Hadoop version 2.4.0 (see: https://issues.apache.org/jira/browse/HADOOP-9454).
That said - this is one of the reasons why it makes sense to use AWS native Hadoop offerings like EMR and Qubole. They are already setup to deal with such idiosyncracies. (Full Disclosure - I am one of the founders #Qubole). In addition to vanilla multipart uploads - we also support streaming multi part uploads - where the file is continuously uploaded to S3 in small chunks even as it is being generated. (in vanilla multipart upload - we first wait for the file to be fullly generated and only then upload in chunks to S3).
Here is the example from wiki : http://wiki.apache.org/hadoop/AmazonS3
% ${HADOOP_HOME}/bin/hadoop distcp hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998 s3://123:456#nutch/

Hadoop local job directory get deleted before job is completed on task nodes

We are having a strange issue in our Hadoop cluster. We have noticed that some of our jobs fail with the with a file not found exception[see below]. Basically the files in the "attempt_*" directory and the directory itself are getting deleted while the task is still being run on the host. Looking through some of the hadoop documentation I see that the job directory gets wiped out when it gets a KillJobAction however I am not sure why it gets wiped out while the job is still running.
My question is what could be deleting it while the job is running? Any thoughts or pointers on how to debug this would be helpful.
Thanks!
java.io.FileNotFoundException: <dir>/hadoop/mapred/local_data/taskTracker/<user>/jobcache/job_201211030344_15383/attempt_201211030344_15383_m_000169_0/output/spill29.out (Permission denied)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:120)
at org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.<init>(RawLocalFileSystem.java:71)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:107)
at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:177)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:400)
at org.apache.hadoop.mapred.Merger$Segment.init(Merger.java:205)
at org.apache.hadoop.mapred.Merger$Segment.access$100(Merger.java:165)
at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:418)
at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1692)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)
I had a similar error and found that this Permission error was caused due to the hadoop program not being able to create or access the files.
Are the files being created inside the hdfs or on any local file system.
If they are on a local file system, then try setting permissions to that folder, if they are on the hdfs folder then try setting permissions to that folder.
If you are running it on ubuntu then
try running
chmod -R a=rwx //hadoop/mapred/local_data/taskTracker//jobcache/job_201211030344_15383/attempt_201211030344_15383_m_000169_0/output/

Resources