Trouble with Hadoop RecommenderJob - hadoop

I have added my input files 'input.txt' and 'users.txt' to HDFS successfully. I have tested Hadoop and Mahout jobs separately with success. However, when I go to run a RecommenderJob with the following command line:
bin/hadoop jar /Applications/mahout-distribution-0.9/mahout-core-0.9-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=/user/valtera45/input/input.txt -Dmapred.output.dir=/user/valtera45/output
--usersFile /user/valtera45/input2/users.txt --similarityClassname SIMILARITY_COOCCURRENCE
This is the output I get:
Exception in thread "main" java.io.IOException: Cannot open filename /user/valtera45/temp/preparePreferenceMatrix/numUsers.bin
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1444)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.(DFSClient.java:1435)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:347)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:351)
at org.apache.mahout.common.HadoopUtil.readInt(HadoopUtil.java:339)
at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:172)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:322)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Whenever I run a standalone Mahout job, a temp folder gets created within the Mahout directory. The RecommenderJob can't seem to get past this step. Any ideas? Thanks in advance. I know the input files I am using are well formatted because they have worked successfully for others.

hadoop jar mahout-core-0.8-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=large_data.csv -Dmapred.output.dir=output/output1.csv -s SIMILARITY_LOGLIKELIHOOD --booleanData --numRecommendations 5
I am using this and my program is running successfully on ec2 instance with mahout and hadoop but i am not able to get relevant results. if anyone knows anything about it please revert on this.

Related

Using S3 Links When Running Pig 0.14.0 in Local Mode?

I'm running Pig 0.14 in local mode. I'm running simple scripts over data in S3. I'd like to refer to these files directly in these scripts, e.g.:
x = LOAD 's3://bucket/path/to/file1.json' AS (...);
// Magic happens
STORE x INTO 's3://bucket/path/to/file2.json';
However, when I use the following command line:
$PIG_HOME/bin/pig -x local -P $HOME/credentials.properties -f $HOME/script.pig
I get the following error:
Failed Jobs:
JobId Alias Feature Message Outputs
N/A mainplinks MAP_ONLY Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: s3://bucket/path/to/file.json
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:979)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.pig.backend.hadoop20.PigJobControl.mainLoopAction(PigJobControl.java:157)
at org.apache.pig.backend.hadoop20.PigJobControl.run(PigJobControl.java:134)
at java.lang.Thread.run(Thread.java:748)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: s3://com.w2ogroup.analytics.soma.prod/airy/fb25b5c6/data/mainplinks.json
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:265)
... 20 more
file:/tmp/temp-948194340/tmp-48450066,
I can confirm that LOAD is failing; I suspect that STORE will fail too. REGISTER S3 links also fail. I can confirm that the links referenced by LOAD and REGISTER exist, and the links referred to by STORE don't, as Pig expects.
I've solved some issues already. For example, I dropped jets3t-0.7.1 into $PIG_HOME/lib, which fixed runtime errors due to the presence of S3 links at all. Additionally, I've provided the relevant AWS keys, and I can confirm that these keys work because I use them AWSCLI to do the same work.
If I use awscli to copy the files to local disk and rewrite the links to use the local file system, everything works fine. Thus, I'm convinced that the issue is S3-related.
How can I convince Pig to handle these S3 links properly?
AFAIK, the way Pig read from S3 is through HDFS. Furthermore, in order Pig to be able to access HDFS, Pig must not run locally. For setting up non-local Pig easily, I'd suggest you to spin up an EMR cluster (which I have tried this on).
So first you need to setup your HDFS properly to access data from S3.
On your hdfs-site.xml configuration, make sure to set values for fs.s3a keys:
<property>
<name>fs.s3a.access.key</name>
<value>{YOUR_ACCESS_KEY}</value>
<description>AWS access key ID. Omit for Role-based authentication.</description>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>{YOUR_SECRET_KEY}</value>
<description>AWS secret key. Omit for Role-based authentication.</description>
</property>
There should not be any need to restart HDFS service but there is no harm on doing so. To restart a service, run initctl list then sudo stop <service name according to initctl output>.
Verify that you can access S3 from HDFS by running (note the s3a protocol):
$ hdfs dfs -ls s3a://bucket/path/to/file
If you get no error then you are now able to use S3 path in Pig. Run Pig in either MapReduce or Tez mode:
pig -x tez -f script.pig or pig -x mapreduce -f script.pig.
https://community.hortonworks.com/articles/25578/how-to-access-data-files-stored-in-aws-s3-buckets.html

Executing Mahout against Hadoop cluster

I have a jar file which contains the mahout jars as well as other code I wrote.
It works fine in my local machine.
I would like to run it in a cluster that has Hadoop already installed.
When I do
$HADOOP_HOME/bin/hadoop jar myjar.jar args
I get the error
Exception in thread "main" java.io.IOException: Mkdirs failed to create /some/hdfs/path (exists=false, cwd=file:local/folder/where/myjar/is)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java 440)
...
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
I checked that I can access and create the dir in the hdfs system.
I have also ran hadoop code (no mahout) without a problem.
I am running this in a linux machine.
Check for the mahout user and hadoop user being same. and also check for mahout and hadoop version compatibility.
Regards
Jyoti ranjan panda

Hadoop 1.0.4 - file permission issue in running map reduce jobs

I am new to hadoop and need to setup a sandbox environment in windows to showcase to a client. I have followed below mentioned steps
Install cygwin on all machines
setup ssh
install hadoop 1.0.4
configure hadoop
Applied patch for hadoop-7682 bug
After lot of hit and trial I was successfully able to run all the components (namenode, datanode, tasktracker and jobtracker). But now I am facing problem while running map-reduce jobs and getting permission error on tmp directory. When I run word count example using following command
bin/hadoop jar hadoop*examples*.jar wordcount wcountjob wcountjob/gutenberg-output
13/03/28 23:43:29 INFO mapred.JobClient: Task Id :
attempt_201303282342_0001_m_000003_2, Status : FAILED Error
initializing attempt_201303282342_0001_m_000003_2:
java.io.IOException: Failed to set permissions of path:
c:\cygwin\usr\local\tmp\taskTracker\uswu50754 to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.mapred.JobLocalizer.createLocalDirs(JobLocalizer.java:144)
at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:182)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1228)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1203)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1118)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2430)
at java.lang.Thread.run(Thread.java:662)
I have tried setting the permissions manually but that also doesn't work. What I understand is that this due to java libraries being used that try to reset the permissions and fail. The permission patch that solved the tasktracker problem doesn't seem to solve this one.
Has anybody found a solution for this?
Can anybody point me to download location for Hadoop 0.20.2 which seems to be immune
to this problem?

Hadoop map-Reduce program not runing

I'm new to Hadoop MapReduce. When I'm trying to run my MapReduce code using the following command:
vishal#XXXX bin/hadoop jar /user/vishal/WordCount com.WordCount.java /user/vishal/file01 /user/vishal/output.
It displays the following output:
Exception in thread "main" java.io.IOException: Error opening job jar: /user/vishal/WordCount.jar
at org.apache.hadoop.util.RunJar.main(RunJar.java:130)
Caused by: java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(ZipFile.java:131)
at java.util.jar.JarFile.<init>(JarFile.java:150)
at java.util.jar.JarFile.<init>(JarFile.java:87)
at org.apache.hadoop.util.RunJar.main(RunJar.java:128)
How can I fix this error?
Your command is asking Hadoop to run a JAR but is specifying a directory instead.
You have also added '.java' to the class name, which is not required. (This is assuming you have written the package name, com.WordCount, correctly).
First build the jar in /user/vishal/WordCount.jar (ensure this is a local directory, not HDFS) then run the command without the '.java' at the end of the class name. Also, you put a dot at the end of the command in your question, I hope that isn't there in the real command.
bin/hadoop jar /user/vishal/WordCount.jar com.WordCount /user/vishal/file01 /user/vishal/output
See the Hadoop tutorial's 'Usage' section for more.

mahout wont start up. Anything to do with compatible version between hadoop and mahout?

I am new to hadoop and not to say mahout. I hope someone could assist me to get through here.. have been trying for 2 days..
I have already a hadoop cluster running.
I am using hadoop-2.0.0-alpha.
I installed mahout (ahout-distribution-0.7) and maven-2.2.1 (latest maven-3.0.4 doesnt work)
Now i would like to just run mahout to get the idea of what is it.
I learnt that by typing "mahout" it will print out a list of options (algorithms) available in mahout, but when i typed mahout, it just gives me Java Exception.
$ [hadoop#localhost bin]$ mahout
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /home/hadoop/hadoop/bin/hadoop and HADOOP_CONF_DIR=/home/hadoop/hadoop/conf
MAHOUT-JOB: /home/hadoop/mahout/examples/target/mahout-examples-0.7-job.jar
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.util.ProgramDriver.driver([Ljava/lang/String;)V
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:123)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
From what i googled online, most of the answers required me to use lower version of hadoop, ie hadoop-0.20, Is my problem now has something to do with my hadoop version?
Thank you.
======== NEWLY EDITED ========
I changed my hadoop version to hadoop-1.0.3 and now it works when i typed "mahout" (my mahout is version7)
But it fails again with the similar error, when i tried to run an example..
$ hadoop /home/hadoop/mahout/core/target/mahout-core-0.7-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.output.dir=output -Dmapred.input.dir=input/prefs.txt --usersFile input/users.txt --similarityClassname SIMILARITY_PEARSON_CORRELATION
Caused by: java.lang.ClassNotFoundException: .home.hadoop.mahout.core.target.mahout-core-0.7-job.jar
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
Could not find the main class: /home/hadoop/mahout/core/target/mahout-core-0.7-job.jar. Program will exit.
Hmm..
Yes it looks like you need to use a different version of Hadoop (or build the latest Mahout from source) if you want this to work. You got a NoSuchMethodError so the first thing to do is check if the ProgramDriver is in the distribution of hadoop you're using.
Looking at the API docs for the various version you can see that its in v0.0.20.x but had been removed from the newer versions.
http://hadoop.apache.org/common/docs/r0.20.205.0/api/index.html
http://hadoop.apache.org/common/docs/r2.0.0-alpha/api/index.html (your version)
Looking at the JIRA for Mahout you can see a bug was submitted for a similar problem to this on July 11th and has been fixed in version 0.8.
[MAHOUT-1044] https://issues.apache.org/jira/browse/MAHOUT-1044
Update:
Shouldn't your command have jar after the hadoop command? Something like:
$ hadoop jar /home/hadoop/mahout/core/target/mahout-core-0.7-job.jar
etc
#BinaryNerd is right. There is a bug in Mahout as detailed in:
[MAHOUT-1044] https://issues.apache.org/jira/browse/MAHOUT-1044
Mahout 0.7 command line gives the NoClassDef ProgramDriver error as detailed in the first part of your question. This will be fixed in 0.8 or you can edit your bin/mahout as detailed in the bug report to fix the brace placement.
I ran into the same issue using mahout under Cloudera CDH 4.1.2. Changing the brace in my bin/mahout fixed it.

Resources