Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException:Input path does not exist: hdfs:host/user/yogesh/WordCount - hadoop

I have created the input text file test.txt and put it to HDFS as /user/yogesh/Input/test.txt
Created output path on HDFS as /user/yogesh/Output
Created the jar file on local /home/yogesh/WordCount.jar and submitted MR job from local, like that: hadoop jar /home/yogesh/WordCount.jar WordCount /user/yogesh/Input/test.txt /user/yogesh/Output/output1
I have got following error:
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException:Input path does not exist: hdfs:host/user/yogesh/WordCount.
hdfs:host/user/yogesh/ - is my HDFS directory. I am not able to understand why this MR job looking for code in HDFS and how to solve this error.

Try giving the name package of the class WordCount as its prefix, or just skip the class and use just jar, input, output, like that:
hadoop jar /home/yogesh/WordCount.jar /user/yogesh/Input /user/yogesh/Output/output1
Also, make sure that /user/yogesh/Output/output1 does not exist prior to the execution of this command. Also, notice that you should give an input directory and not an input file. Hadoop will take as input all the files in the specified directory.
For an example, see how the WordCount example is run, in this site.

Related

Snappy compressed file on HDFS appears without extension and is not readable

I configured a Map Reduce job to save output as a Sequence file compressed with Snappy. The MR job executes successfully however in HDFS the output file looks as the following:
I've expected that the file will have a .snappy extension and that it should be part-r-00000.snappy. And now I think that this may be the reason for the file to be not readable when I'm trying to read it from a local file system using this pattern hadoop fs -libjars /path/to/jar/myjar.jar -text /path/in/HDFS/to/my/file
So I'm getting the –libjars: Unknown command when executing the command:
hadoop fs –libjars /root/hd/metrics.jar -text /user/maria_dev/hd/output/part-r-00000
And when I'm using this command hadoop fs -text /user/maria_dev/hd/output/part-r-00000, I'm getting the error:
18/02/15 22:01:57 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
-text: Fatal internal error
java.lang.RuntimeException: java.io.IOException: WritableName can't load class: com.hd.metrics.IpMetricsWritable
Caused by: java.lang.ClassNotFoundException: Class com.hd.ipmetrics.IpMetricsWritable not found
Could it be that the absence of the .snappy extension causes the problem? What other command should I try to read the compressed file?
The jar is in my local file system /root/hd/ Where should I place it not to cause ClassNotFoundException? Or how should I modify the command?
Instead of hadoop fs –libjars (which actually has a wrong hyphen and should be -libjars. Copy that exactly, and you won't see Unknown command)
You should be using HADOOP_CLASSPATH environment variable
export HADOOP_CLASSPATH=/root/hd/metrics.jar:${HADOOP_CLASSPATH}
hadoop fs -text /user/maria_dev/hd/output/part-r-*
The error clearly says ClassNotFoundException: Class com.hd.ipmetrics.IpMetricsWritable not found.
It means that a required library is missing in classpath.
To clarify your doubts:
Map-Reduce by default output the file as part-* and there is no
meaning of extension. Remember extension "thing" is just a metadata
usually required by windows operating system to determine suitable
program for the file. It has no meaning in linux/unix and the
system's behavior is not going to change, even though you rename the
file as .snappy (you may actually try this).
The command looks absolutely fine to inspect the snappy file, but it seems that some required jar file are not there, which is causing ClassNotFoundException.
EDIT 1:
By default hadoop picks the jar files from the path emit by below command:
$ hadoop classpath
By default it list all the hadoop core jars.
You can add your jar by executing below command on the prompt
export HADOOP_CLASSPATH=/path/to/my/custom.jar
After executing this, try checking the class path again by hadoop classpath command and you should be able to see your jar listed along with hadoop core jars.

can't run a MapReduce Job on Hadoop

I'm trying to run a mapreduce job took from the internet. This job takes in input a 'points.dat' file and makes a k-means clustering on it. It should produce a file 'centroids.dat' and a file with points matched to their own centroid. A couple of months this was working, but now i'm trying to re-execute on a new installation.
I made
bin/hdfs dfs -copyFromLocal ..//..//../home/<myusername>/Downloads/points.dat
Everything is fine and the file appears in the web service tool in the /user// path on hdfs . Jps is ok
The jar requests args:
<input> <output> <n clusters>
so i made
bin/hadoop jar ../../../home/<myusername>/Downloads/kmeans.jar /user/<myusername>/ /out 3
it creates a "centroids.dat" file in /user/ and a out/ directory. As much as i can understand it tries to re-read "centroids.dat" to execute. So it ends with some failures like
"failed creating symlink /tmp/hadoop-<myusername>/mapred/local/1466809349241/centroids.dat <- /usr/local/hadoop/centroids.dat
So java raise a FileNotFoundException
I tried to shorten the question as much as possible. If more info are needed, no problem for me
I think you are missing to mention main class in your command
bin/hadoop jar kmeans.jar MainClass input output

InvalidJobConfException: Output directory not set

I am using Cloudera VM for mapreduce pratice.
I just created the jar from the default wordcount classes given by cloudera.
I am getting this error when I run the mapreduce program. Can I know what I am missing?
InvalidJobConfException: Output directory not set.
Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set.
To process data using MapReduce program you need-
Mapper class
Reducer class
Driver class(Main class to run MapReduce program)
Input data(path of input data to analysis)
Output directory(path of output directory,where output of the program will store, this
directory should not already exist in HDFS)
From the error, It seems you have not set the output directory path. If output directory is not already set in your code, than you have to pass it at runtime if your code is accepting the argument for the same. Here is a very good step-by-step guide to run first WordCount program in MapReduce.

Output Folders for Amazon EMR

I want to jun a custom jar, whose main class a chain of map reduce jobs, with the output of the first job going as the input of the second jar, and so on.
What do I set in FileOutputFormat.setOutputPath("what path should be here?");
If I specify -outputdir in the argument, I get the error FileAlraedy exists. If I don't specify, then I do not know where the ouput will land. I want to be able to see the ouput from every job of the chained mapreduce jobs.
Thanks in adv. Pls help!
You are likely getting the "FileAlraedy exists" error because that output directory exists prior to the job you are running. Make sure to delete the directories that you specify as output for your Hadoop jobs; otherwise you will not be able to run those jobs.
Good practice is to take output from command line as it will increase flexibility of your code And you will compile your jar only once provided the changes are related to your path.
for EMR if you launch your cluster and compile your jar
For eg.
dfs_ip_folder=HDFS_IP_DIR
dfs_op_folder=HDFS_OP_DIR
hadoop jar hadoop-examples-*.jar wordcount ${dfs_ip_folder} ${dfs_op_folder}
Note : you have to create dfs_ip_folder and store input data inside it.
dfs_op_folder will be created automatically on HDFS not on local file system
To access the HDFS op folder either you can copy it to local file system or you can do cat.
eg.
hadoop fs -cat ${dfs_op_folder}/<file_name>
hadoop fs -copyToLocal ${dfs_op_folder} ${your_local_input_dir_path}

Hadoop map-Reduce program not runing

I'm new to Hadoop MapReduce. When I'm trying to run my MapReduce code using the following command:
vishal#XXXX bin/hadoop jar /user/vishal/WordCount com.WordCount.java /user/vishal/file01 /user/vishal/output.
It displays the following output:
Exception in thread "main" java.io.IOException: Error opening job jar: /user/vishal/WordCount.jar
at org.apache.hadoop.util.RunJar.main(RunJar.java:130)
Caused by: java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(ZipFile.java:131)
at java.util.jar.JarFile.<init>(JarFile.java:150)
at java.util.jar.JarFile.<init>(JarFile.java:87)
at org.apache.hadoop.util.RunJar.main(RunJar.java:128)
How can I fix this error?
Your command is asking Hadoop to run a JAR but is specifying a directory instead.
You have also added '.java' to the class name, which is not required. (This is assuming you have written the package name, com.WordCount, correctly).
First build the jar in /user/vishal/WordCount.jar (ensure this is a local directory, not HDFS) then run the command without the '.java' at the end of the class name. Also, you put a dot at the end of the command in your question, I hope that isn't there in the real command.
bin/hadoop jar /user/vishal/WordCount.jar com.WordCount /user/vishal/file01 /user/vishal/output
See the Hadoop tutorial's 'Usage' section for more.

Resources