Using S3 Links When Running Pig 0.14.0 in Local Mode? - hadoop

I'm running Pig 0.14 in local mode. I'm running simple scripts over data in S3. I'd like to refer to these files directly in these scripts, e.g.:
x = LOAD 's3://bucket/path/to/file1.json' AS (...);
// Magic happens
STORE x INTO 's3://bucket/path/to/file2.json';
However, when I use the following command line:
$PIG_HOME/bin/pig -x local -P $HOME/credentials.properties -f $HOME/script.pig
I get the following error:
Failed Jobs:
JobId Alias Feature Message Outputs
N/A mainplinks MAP_ONLY Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: s3://bucket/path/to/file.json
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:979)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.pig.backend.hadoop20.PigJobControl.mainLoopAction(PigJobControl.java:157)
at org.apache.pig.backend.hadoop20.PigJobControl.run(PigJobControl.java:134)
at java.lang.Thread.run(Thread.java:748)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: s3://com.w2ogroup.analytics.soma.prod/airy/fb25b5c6/data/mainplinks.json
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:265)
... 20 more
file:/tmp/temp-948194340/tmp-48450066,
I can confirm that LOAD is failing; I suspect that STORE will fail too. REGISTER S3 links also fail. I can confirm that the links referenced by LOAD and REGISTER exist, and the links referred to by STORE don't, as Pig expects.
I've solved some issues already. For example, I dropped jets3t-0.7.1 into $PIG_HOME/lib, which fixed runtime errors due to the presence of S3 links at all. Additionally, I've provided the relevant AWS keys, and I can confirm that these keys work because I use them AWSCLI to do the same work.
If I use awscli to copy the files to local disk and rewrite the links to use the local file system, everything works fine. Thus, I'm convinced that the issue is S3-related.
How can I convince Pig to handle these S3 links properly?

AFAIK, the way Pig read from S3 is through HDFS. Furthermore, in order Pig to be able to access HDFS, Pig must not run locally. For setting up non-local Pig easily, I'd suggest you to spin up an EMR cluster (which I have tried this on).
So first you need to setup your HDFS properly to access data from S3.
On your hdfs-site.xml configuration, make sure to set values for fs.s3a keys:
<property>
<name>fs.s3a.access.key</name>
<value>{YOUR_ACCESS_KEY}</value>
<description>AWS access key ID. Omit for Role-based authentication.</description>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>{YOUR_SECRET_KEY}</value>
<description>AWS secret key. Omit for Role-based authentication.</description>
</property>
There should not be any need to restart HDFS service but there is no harm on doing so. To restart a service, run initctl list then sudo stop <service name according to initctl output>.
Verify that you can access S3 from HDFS by running (note the s3a protocol):
$ hdfs dfs -ls s3a://bucket/path/to/file
If you get no error then you are now able to use S3 path in Pig. Run Pig in either MapReduce or Tez mode:
pig -x tez -f script.pig or pig -x mapreduce -f script.pig.
https://community.hortonworks.com/articles/25578/how-to-access-data-files-stored-in-aws-s3-buckets.html

Related

Adding a step or bootstrap action in EMR 3.10 to copy a file from local to s3

I am using Amazon EMR 3.10 for my purpose where I want to copy a file from local to Amazon S3...I am using "script-runner.jar" where in the arguments,I am mentioning a command in the arguments sudo aws s3 cp /home/hadoop/conf/hdfs-site.xml s3://testbucket/myfolder/--recursive ....But the step is getting failed & throwing the following exception :
Exception in thread "main" java.lang.RuntimeException: Local file does not exist.
at com.amazon.elasticmapreduce.scriptrunner.ScriptRunner.fetchFile(ScriptRunner.java:30)
at com.amazon.elasticmapreduce.scriptrunner.ScriptRunner.main(ScriptRunner.java:56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
What do I need to do in the step so that it copies the file from local to Amazon S3 ?? Also I want to raise few questions ??
1 ) If I need to use "command-runner.jar",how can I use command-runner in EMR 3.10 ??
2) How can I do the copying task using Bootstrap action ??
Thank You
If you are trying to perform this copy in a bootstrap action, note that
the hadoop user does not exist until after the bootstrapping phase has completed.
That would explain the error.
Doing the copy operation as an EMR Step should work as hadoop is installed by that point.
See the lifecycle of an EMR for more details: here
it seems like the program is unable to find the local file
/home/hadoop/conf/hdfs-site.xml
Does the file exists?
You could also try using a nice tool called s3cmd

YARN log aggregation on AWS EMR - UnsupportedFileSystemException

I am struggling to enable YARN log aggregation for my Amazon EMR cluster. I am following this documentation for the configuration:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html#emr-plan-debugging-logs-archive
Under the section titled: "To aggregate logs in Amazon S3 using the AWS CLI".
I've verified that the hadoop-config bootstrap action puts the following in yarn-site.xml
<property><name>yarn.log-aggregation-enable</name><value>true</value></property>
<property><name>yarn.log-aggregation.retain-seconds</name><value>-1</value></property>
<property><name>yarn.log-aggregation.retain-check-interval-seconds</name><value>3000</value></property>
<property><name>yarn.nodemanager.remote-app-log-dir</name><value>s3://mybucket/logs</value></property>
I can run a sample job (pi from hadoop-examples.jar) and see that it completed successfully on the ResourceManager's GUI.
It even creates a folder under s3://mybucket/logs named with the application id. But the folder is empty, and if I run yarn logs -applicationID <applicationId>, I get a stacktrace:
14/10/20 23:02:15 INFO client.RMProxy: Connecting to ResourceManager at /10.XXX.XXX.XXX:9022
Exception in thread "main" org.apache.hadoop.fs.UnsupportedFileSystemException: No AbstractFileSystem for scheme: s3
at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:154)
at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:242)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:333)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:330)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:330)
at org.apache.hadoop.fs.FileContext.getFSofPath(FileContext.java:322)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:85)
at org.apache.hadoop.fs.FileContext.listStatus(FileContext.java:1388)
at org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:112)
at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:137)
at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:199)
Which is doesn't make any sense to me; I can run hdfs dfs -ls s3://mybucket/ and it lists the contents just fine. The machines are getting credentials from AWS IAM Roles, I've tried adding fs.s3n.awsAccessKeyId and such to core-site.xml with no change in behavior.
Any advice is much appreciated.
Hadoop provides two fs interfaces - FileSystem and AbstractFileSystem. Most of the time, we work with FileSystem and use configuration options like fs.s3.impl to provide custom adapters.
yarn logs, however, uses the AbstractFileSystem interface.
If you can find an implementation of that for S3, you can specify it using fs.AbstractFileSystem.s3.impl.
See core-default.xml for examples of fs.AbstractFileSystem.hdfs.impl etc.

Trouble with Hadoop RecommenderJob

I have added my input files 'input.txt' and 'users.txt' to HDFS successfully. I have tested Hadoop and Mahout jobs separately with success. However, when I go to run a RecommenderJob with the following command line:
bin/hadoop jar /Applications/mahout-distribution-0.9/mahout-core-0.9-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=/user/valtera45/input/input.txt -Dmapred.output.dir=/user/valtera45/output
--usersFile /user/valtera45/input2/users.txt --similarityClassname SIMILARITY_COOCCURRENCE
This is the output I get:
Exception in thread "main" java.io.IOException: Cannot open filename /user/valtera45/temp/preparePreferenceMatrix/numUsers.bin
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1444)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.(DFSClient.java:1435)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:347)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:351)
at org.apache.mahout.common.HadoopUtil.readInt(HadoopUtil.java:339)
at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:172)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:322)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Whenever I run a standalone Mahout job, a temp folder gets created within the Mahout directory. The RecommenderJob can't seem to get past this step. Any ideas? Thanks in advance. I know the input files I am using are well formatted because they have worked successfully for others.
hadoop jar mahout-core-0.8-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=large_data.csv -Dmapred.output.dir=output/output1.csv -s SIMILARITY_LOGLIKELIHOOD --booleanData --numRecommendations 5
I am using this and my program is running successfully on ec2 instance with mahout and hadoop but i am not able to get relevant results. if anyone knows anything about it please revert on this.

Running Custom JAR on Amazon EMR giving error ( Filesystem Error ) using Amazon S3 Bucket input and output

I am trying to run a Custom JAR on Amazon EMR cluster using the input and output parameters of the Custom JAR as S3 buckets (-input s3n://s3_bucket_name/ldas/in -output s3n://s3_bucket_name/ldas/out)
When the cluster runs this Custom JAR, the following exception occurs.
Exception in thread "main" java.lang.IllegalArgumentException: **Wrong FS: s3n://s3_bucket_name/ldas/out, expected: hdfs://10.214.245.187:9000**
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:644)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:181)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:92)
at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:585)
at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:581)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:581)
at cc.mrlda.ParseCorpus.run(ParseCorpus.java:101)
at cc.mrlda.ParseCorpus.run(ParseCorpus.java:77)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at cc.mrlda.ParseCorpus.main(ParseCorpus.java:727)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
How to correct this error? How to use s3n bucket as the filesystem in Amazon EMR?
Also, I think changing the default filesystem to the s3 bucket would be good, but I am not sure how to do it.
I'd suggest checking that you jar is using the same method of processing the parameters as shown here: http://java.dzone.com/articles/running-elastic-mapreduce-job
Specifically,
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
Alternatively, I've had success adding custom script runner steps to copy files from s3 to hadoop or vice-versa. Particularly if you have a few streaming steps in a row it's helpful to keep things on hdfs. You should be able to make a simple bash scripts with something like
hadoop fs -cp s3://s3_bucket_name/ldas/in hdfs:///ldas/in
and
hadoop fs -cp hdfs:///ldas/out s3://s3_bucket_name/ldas/out
Then set your streaming step in between to operate between hdfs:///ldas/in and hdfs:///ldas/out

Pig MR job failing when run by same user who started the cluster

I am seeing this exception intermittently for some mappers and reducers in my Pig map reduce job. Most of the times it is retried on some other node and the task succeeds. But sometimes all 4 tasks fails and the map reduce job fails.
However the interesting thing is the folder jobcache indeed has permissions 700. I dont understand why it is not able to create the folder inside it.
Error initializing attempt_201212101828_0396_m_000028_0:
java.io.IOException: Failed to set permissions of path: /apollo/env/TrafficAnalyticsHadoop/var/hadoop/mapred/local_data/taskTracker/trafanly/jobcache/job_201212101828_0396 to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:682)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:671)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.mapred.JobLocalizer.createJobDirs(JobLocalizer.java:221)
at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:184)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1226)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1201)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1116)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2404)
at java.lang.Thread.run(Thread.java:662)
I am using Hadoop 1.0.1 if that helps. One more thing which i found while searching online was: https://issues.apache.org/jira/browse/MAPREDUCE-890. In my case the user who started the mapred cluster is indeed running the job and that is when it fails. For any other user the job runs just fine.
Any help would be appreciated.
change the permissions of the directories you have used as property values in your .xml configuration files to 755

Resources