Hadoop streaming with python on Windows - hadoop

I'm using Hortonworks HDP for Windows and have it successfully configured with a master and 2 slaves.
I'm using the following command;
bin\hadoop jar contrib\streaming\hadoop-streaming-1.1.0-SNAPSHOT.jar -files file:///d:/dev/python/mapper.py,file:///d:/dev/python/reducer.py -mapper "python mapper.py" -reducer "python reduce.py" -input /flume/0424/userlog.MDAC-HD1.MDAC.local..20130424.1366789040945 -output /flume/o%1 -cmdenv PYTHONPATH=c:\python27
The mapper runs through fine, but the log reports that the reduce.py file wasn't found. In the exception it looks like the hadoop taskrunner is creating the symlink for the reducer to the mapper.py file.
When I check the job configuration file, I noticed that mapred.cache.files is set to;
hdfs://MDAC-HD1:8020/mapred/staging/administrator/.staging/job_201304251054_0021/files/mapper.py#mapper.py
It looks like although the reduce.py file is being added to the jar file, it's not being included in the configuration correctly and can't be found when the reducer tries to run.
I think my command is correct, I've tried using -file parameters instead but then neither file is found.
Can anyone see or know of an obvious reason?
Please note, this is on Windows.
EDIT- I've just run it locally and it worked, looks like my problem may be with the copying of the files round the cluster.
Still welcome input!

Well, thats embarrassing... my first question and I answer it myself.
I found the problem by renaming the hadoop conf file to force default settings which meant the local job tracker.
The job ran properly and it gave me the room to work out what the problem is, looks like communication around the cluster isn't as complete as it need be.

When I see your command, it shows "file:///d:/dev/python/reducer.py" for -files option, but you specify the reduce.py for -reducer. Does this cause the problem?? Sorry I am not sure.

Related

Hadoop Streaming Exception (No FileSystem for Scheme "C")

I'm new to Hadoop, and trying to use streaming option to develop some jobs using Python on windows 10 localy.
After double checking my pathes given, and even my program, I encounter an Exception that is not discussed in any pages. the Exception is as:
I will be grateful for any help.
No FileSystem for scheme
The error comes from either:
your core-site.xml , fs.defaultFS value. That needs to be hdfs://127.0.0.1:9000, for example, not your Windows filesystem. Perhaps you confused that with hdfs-site.xml values for the namenode/datanode data directories.
Your code. You need to use file://c:/path, not C:/ for Hadoop-compatible file paths, especially values passed as -mapper or -reducer
Also, no one really writes mapreduce code anymore. You can run similar code in PySpark, and you don't need Hadoop to run it.

Why MR2 map task is running under 'yarn' user and not under user I ran hadoop job?

I'm trying to run mapreduce job on MR2, Hadoop ver. 2.6.0-cdh5.8.0. Job has relative path to directory which has a lot of files to be compressed based on some criteria(not really necessary for this question). I'm running my job as following:
sudo -u my_user hadoop jar my_jar.jar com.example.Main
There is a folder on HDFS under path /user/my_user/ with files. But when I'm running my job I got following exception:
java.io.FileNotFoundException: File /user/yarn/<path_from_job> does not exist.
I'm migrating this job from MR1 where this job is working correctly. My suggestion is this is happening due to YARN, because each container started under YARN user. In my job configuration I've tried to set mapreduce.job.user.name="my_user" but this didn't help.
I've found ${user.home} usage in me Job configuration, but I don't know aware where it is set and is it possible to change this.
The only solution I found so far is to provide absolute path to folder. Is there any other way around, because I feel like this is not correct approach.
Thank you

Running Mahout Job on Hadoop: Got ClassNotFoundException

I try to run a Mahout Kmeans Example on the cloudera quickstart vm for hadoop. I read here link to clouudera block and here stack overflow post that i can use the -libjars command to attach the mahout .jars
I put the jar-files: KMeansHadoop.jar mahout-core-0.9.jar and mahout-math-0.9.jar in the same folder and run:
hadoop jar KMeansHadoop.jar SimpleKMeansClustering -libjars mahout-core-0.9.jar mahout-math-0.9.jar
But i still get the error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/mahout/math/Vector
What do i wrong? Thank you!
Firstly, I believe that the -libjars values need to be comma-separated. But that only makes your third-party jars available to the cluster. You may also need to use HADOOP_CLASSPATH to make those jars available on the client side (e.g: on the edge node from which you're kicking off your job).
Check out this post. It helped me a lot when I was working my way through this exact issue with getting Driven to work with Cascading.

Issue with psuedo mode configuration of Hadoop

I am trying to do pseudo mode configuration of Hadoop 2.0.4 version. Script start-dfs.sh works fine. However, start-mapred.sh fails to start the jobtracker and tasktracker. Below is the error I am getting. Seeing at error it looks like it is not able to pick the jar file. Please let me know if you have any idea of this issue. Thanks.
FATAL org.apache.hadoop.mapred.JobTracker: java.lang.NoSuchMethodError: org/apache/hadoop/mapred/JobACLsManager.<init>(Lorg/apache/hadoop/mapred/JobConf;)V
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:2182)
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:1895)
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:1889)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:311)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:302)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:297)
at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:4820)
It seems I was using incorrect jars. So, first I replaced those. Then, I just created a new directory with hadoop conf files. Formatted the namenode. Finally it worked. :)

Using different hadoop-mapreduce-client-core.jar to run hadoop cluster

I'm working on a hadoop cluster with CDH4.2.0 installed and ran into this error. It's been fixed in later versions of hadoop but I don't have access to update the cluster. Is there a way to tell hadoop to use this jar when running my job through the command line arguments like
hadoop jar MyJob.jar -D hadoop.mapreduce.client=hadoop-mapreduce-client-core-2.0.0-cdh4.2.0.jar
where the new mapreduce-client-core.jar file is the patched jar from the ticket. Or must hadoop be completely recompiled with this new jar? I'm new to hadoop so I don't know all the command line options that are possible.
I'm not sure how that would work as when you're executing the hadoop command you're actually executing code in the client jar.
Can you not use MR1? The issue says this issue only occurs when you're using MR2, so unless you really need Yarn you're probably better using the MR1 library to run your map/reduce.

Resources