Snappy compressed file on HDFS appears without extension and is not readable - bash

I configured a Map Reduce job to save output as a Sequence file compressed with Snappy. The MR job executes successfully however in HDFS the output file looks as the following:
I've expected that the file will have a .snappy extension and that it should be part-r-00000.snappy. And now I think that this may be the reason for the file to be not readable when I'm trying to read it from a local file system using this pattern hadoop fs -libjars /path/to/jar/myjar.jar -text /path/in/HDFS/to/my/file
So I'm getting the –libjars: Unknown command when executing the command:
hadoop fs –libjars /root/hd/metrics.jar -text /user/maria_dev/hd/output/part-r-00000
And when I'm using this command hadoop fs -text /user/maria_dev/hd/output/part-r-00000, I'm getting the error:
18/02/15 22:01:57 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
-text: Fatal internal error
java.lang.RuntimeException: java.io.IOException: WritableName can't load class: com.hd.metrics.IpMetricsWritable
Caused by: java.lang.ClassNotFoundException: Class com.hd.ipmetrics.IpMetricsWritable not found
Could it be that the absence of the .snappy extension causes the problem? What other command should I try to read the compressed file?
The jar is in my local file system /root/hd/ Where should I place it not to cause ClassNotFoundException? Or how should I modify the command?

Instead of hadoop fs –libjars (which actually has a wrong hyphen and should be -libjars. Copy that exactly, and you won't see Unknown command)
You should be using HADOOP_CLASSPATH environment variable
export HADOOP_CLASSPATH=/root/hd/metrics.jar:${HADOOP_CLASSPATH}
hadoop fs -text /user/maria_dev/hd/output/part-r-*

The error clearly says ClassNotFoundException: Class com.hd.ipmetrics.IpMetricsWritable not found.
It means that a required library is missing in classpath.
To clarify your doubts:
Map-Reduce by default output the file as part-* and there is no
meaning of extension. Remember extension "thing" is just a metadata
usually required by windows operating system to determine suitable
program for the file. It has no meaning in linux/unix and the
system's behavior is not going to change, even though you rename the
file as .snappy (you may actually try this).
The command looks absolutely fine to inspect the snappy file, but it seems that some required jar file are not there, which is causing ClassNotFoundException.
EDIT 1:
By default hadoop picks the jar files from the path emit by below command:
$ hadoop classpath
By default it list all the hadoop core jars.
You can add your jar by executing below command on the prompt
export HADOOP_CLASSPATH=/path/to/my/custom.jar
After executing this, try checking the class path again by hadoop classpath command and you should be able to see your jar listed along with hadoop core jars.

Related

sqoop2 not finding log4j2 from hadoop

I am trying to install sqoop2 (1.99.7) on my ubuntu server. I am trying to follow the instructions provided on the apache website here. I have a working hadoop installation and I have downloaded and extracted the sqoop file to the /usr/local/sqoop location.
tar -xvf sqoop-1.99.7-bin-hadoop200.tar.gz
mv sqoop-1.99.7-bin-hadoop200 /usr/local/sqoop
I believe I have all the environmental variables defined, in particular HADOOP_HOME which I thought is stated to direct where sqoop looks for the jar files.
However, when I try to verify installation with sqoop2-tool verify I get the following output.
Setting conf dir: /usr/local/sqoop/bin/../conf
Sqoop home directory: /usr/local/sqoop
Sqoop tool executor:
Version: 1.99.7
Revision: 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
Compiled on Tue Jul 19 16:08:27 PDT 2016 by abefine
ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.
Running tool: class org.apache.sqoop.tools.tool.VerifyTool
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
at org.apache.sqoop.security.authentication.SimpleAuthenticationHandler.secureLogin(SimpleAuthenticationHandler.java:36)
at org.apache.sqoop.security.AuthenticationManager.initialize(AuthenticationManager.java:98)
at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:57)
at org.apache.sqoop.tools.tool.VerifyTool.runTool(VerifyTool.java:36)
at org.apache.sqoop.tools.ToolRunner.main(ToolRunner.java:72)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more
Somehow, it is failing to find the log4j2 configuration file. I'm not sure why this is the case.
This question is similar to the one here but the solution provided does not help. If I modify the sqoop.properties file and point directly to the hadoop configuration directory /usr/local/hadoop/etc/hadoop (which is where my core-site.xml, hdfs-site.xml, etc. are located) I continue to get the error above.
EDIT
Output of grep -r "org.apache.hadoop.conf.Configuration" /usr/local/hadoop | grep jar
Binary file /usr/local/hadoop/share/hadoop/common/sources/hadoop-common-2.8.0-sources.jar matches
Binary file /usr/local/hadoop/share/hadoop/common/hadoop-common-2.8.0.jar matches
Binary file /usr/local/hadoop/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/hadoop-common-2.8.0.jar matches
Binary file /usr/local/hadoop/share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/hadoop-common-2.8.0.jar matches
Sqoop.properties is a Java property file. Environment variable should be defined in sqoop-env.sh or set it up using export command.
Can you try to execute the below environment variables export command before executing the sqoop command, It it works you can add these commands to sqoop-env.sh environment file.
export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_HDFS_HOME=/usr/local/hadoop
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_YARN_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
Make sure /usr/local/hadoop is correct.
Edit -
If you look at the last line of sqoop command, it's a bash script and it uses hadoop command internally to invoke sqoop class, so all hadoop related libs will be loaded to sqoop environment, if HADOOP_COMMON_HOME env variable is correct.
Are you able to execute hadoop commands in this server ?, Can you share the output of ${HADOOP_COMMON_HOME}/bin/hadoop fs -ls / ; If this works, this error could be due to compatibility - Sqoop version may not compatible with Hadoop.

Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException:Input path does not exist: hdfs:host/user/yogesh/WordCount

I have created the input text file test.txt and put it to HDFS as /user/yogesh/Input/test.txt
Created output path on HDFS as /user/yogesh/Output
Created the jar file on local /home/yogesh/WordCount.jar and submitted MR job from local, like that: hadoop jar /home/yogesh/WordCount.jar WordCount /user/yogesh/Input/test.txt /user/yogesh/Output/output1
I have got following error:
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException:Input path does not exist: hdfs:host/user/yogesh/WordCount.
hdfs:host/user/yogesh/ - is my HDFS directory. I am not able to understand why this MR job looking for code in HDFS and how to solve this error.
Try giving the name package of the class WordCount as its prefix, or just skip the class and use just jar, input, output, like that:
hadoop jar /home/yogesh/WordCount.jar /user/yogesh/Input /user/yogesh/Output/output1
Also, make sure that /user/yogesh/Output/output1 does not exist prior to the execution of this command. Also, notice that you should give an input directory and not an input file. Hadoop will take as input all the files in the specified directory.
For an example, see how the WordCount example is run, in this site.

Unable to run Hadoop on windows 7

I am new to Hadoop and trying to run it on Windows 7.
Whenever I am trying to run hadoop bash script, I get the following error :
'-Xmx32m' is not recognized as an internal or external command,
operable program or batch file.
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
fs run a generic filesystem user client
version print the version
jar <jar> run a jar file
checknative [-a|-h] check native hadoop and compression libraries availability
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath prints the class path needed to get the
Hadoop jar and the required libraries
credential interact with credential providers
key manage keys via the KeyProvider
daemonlog get/set the log level for each daemon
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
Also, when I run hdfs command ,
I get the following error :
-Xms1000m is not recognized as in internal or external command.
When I try to pass -Xmx and -Xms arguments, I get the following message :
Error occurred during initialization of VM
Could not reserve enough space for object heap
Can anyone help me out on this ?
The error message
is not recognized as an internal or external command
indicates that you attempted to run from the command line a program that Windows doesn't recognize. This likely has nothing to do with -Xms and -Xmx. The problem is Windows cannot find java.
Make sure you can ran java -version no matter what's the current folder you are in. If you can't, you need to add the java at the PATH environment variable.
This could also be an issue of installing java or hadoop in a folder that has spaces in the path e.g. C:\Program Files has a space in the folder and that can be a problem. If that's the cause then install java and hadoop on a different folder without spaces in the path.

Output Folders for Amazon EMR

I want to jun a custom jar, whose main class a chain of map reduce jobs, with the output of the first job going as the input of the second jar, and so on.
What do I set in FileOutputFormat.setOutputPath("what path should be here?");
If I specify -outputdir in the argument, I get the error FileAlraedy exists. If I don't specify, then I do not know where the ouput will land. I want to be able to see the ouput from every job of the chained mapreduce jobs.
Thanks in adv. Pls help!
You are likely getting the "FileAlraedy exists" error because that output directory exists prior to the job you are running. Make sure to delete the directories that you specify as output for your Hadoop jobs; otherwise you will not be able to run those jobs.
Good practice is to take output from command line as it will increase flexibility of your code And you will compile your jar only once provided the changes are related to your path.
for EMR if you launch your cluster and compile your jar
For eg.
dfs_ip_folder=HDFS_IP_DIR
dfs_op_folder=HDFS_OP_DIR
hadoop jar hadoop-examples-*.jar wordcount ${dfs_ip_folder} ${dfs_op_folder}
Note : you have to create dfs_ip_folder and store input data inside it.
dfs_op_folder will be created automatically on HDFS not on local file system
To access the HDFS op folder either you can copy it to local file system or you can do cat.
eg.
hadoop fs -cat ${dfs_op_folder}/<file_name>
hadoop fs -copyToLocal ${dfs_op_folder} ${your_local_input_dir_path}

Hadoop map-Reduce program not runing

I'm new to Hadoop MapReduce. When I'm trying to run my MapReduce code using the following command:
vishal#XXXX bin/hadoop jar /user/vishal/WordCount com.WordCount.java /user/vishal/file01 /user/vishal/output.
It displays the following output:
Exception in thread "main" java.io.IOException: Error opening job jar: /user/vishal/WordCount.jar
at org.apache.hadoop.util.RunJar.main(RunJar.java:130)
Caused by: java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(ZipFile.java:131)
at java.util.jar.JarFile.<init>(JarFile.java:150)
at java.util.jar.JarFile.<init>(JarFile.java:87)
at org.apache.hadoop.util.RunJar.main(RunJar.java:128)
How can I fix this error?
Your command is asking Hadoop to run a JAR but is specifying a directory instead.
You have also added '.java' to the class name, which is not required. (This is assuming you have written the package name, com.WordCount, correctly).
First build the jar in /user/vishal/WordCount.jar (ensure this is a local directory, not HDFS) then run the command without the '.java' at the end of the class name. Also, you put a dot at the end of the command in your question, I hope that isn't there in the real command.
bin/hadoop jar /user/vishal/WordCount.jar com.WordCount /user/vishal/file01 /user/vishal/output
See the Hadoop tutorial's 'Usage' section for more.

Resources