Not a Valid Jar When Running Hadoop Job - hadoop

I want to run WordCount Example.
In eclipse it run correctly. In output folder the output file is present.
I made a jar file of WordCount and want to run it through command
hadoop jar WordCount.jar /Projects/input /Projects/output
it gives me error
Not a valid JAR: /Projects/WordCount.jar
result of hdfs dfs -ls /Projects
Found 3 items
-rw-r--r-- 1 hduser supergroup 3418 2014-11-02 15:38 /Projects/WordCount.jar
drwxr-xr-x - hduser supergroup 0 2014-11-02 14:13 /Projects/input
drwxr-xr-x - hduser supergroup 0 2014-11-02 14:16 /Projects/output
it gives me same error on this also
hadoop jar /Projects/WordCount.jar wordPackage.WordCount /Projects/input /Projects/output
Not a valid JAR: /Projects/WordCount.jar
how to solve this error.
I have run tvf command it gives this output
jar -tvf /home/hduser/Desktop/Files/WordCount.jar
60 Sun Nov 02 16:10:10 PKT 2014 META-INF/MANIFEST.MF
1895 Sun Nov 02 14:02:38 PKT 2014 wordPackage/WordCount.class
1295 Sun Nov 02 14:02:38 PKT 2014 wordPackage/WordCount.java
2388 Sun Nov 02 14:02:06 PKT 2014 wordPackage/WordReducer.class
707 Sun Nov 02 14:02:06 PKT 2014 wordPackage/WordReducer.java
2203 Sun Nov 02 14:02:08 PKT 2014 wordPackage/WordMapper.class
713 Sun Nov 02 14:02:06 PKT 2014 wordPackage/WordMapper.java
16424 Sun Nov 02 13:50:00 PKT 2014 .classpath
420 Sun Nov 02 13:50:00 PKT 2014 .project

You cannot keep the jar in HDFS when executing the same using hadoop command, Jar should be available in the local path
If the jar is not runnable try the following (Need to specify the package.mainclass)
hadoop jar /home/hduser/Desktop/Files/WordCount.jar wordPackage.WordCount /Projects/input /Projects/output
If the jar is runnable following can be used
hadoop jar /home/hduser/Desktop/Files/WordCount.jar /Projects/input /Projects/output
If the issue still persists, you need to rebuild this jar(WordCount.jar) in eclipse again

in spite of rebuilding your jar again if issue still persists, Please make sure you have given +x(execution chmod 755) privileges before you run the command. in my case this was the reason for the issue.
command help:
chmod +x jarname.jar

I faced the same issue. But in my case what I did was I wrote the java code referring to hadoop 1.x libraries and tried to execute it using 2.x. Initially, I got the same error in the terminal. Then I tried navigating to the path where I had my jar file. It worked.
May be you can actually try the following:(after navigating to the jar files path)
hadoop jar WordCount.jar wordPackage.WordCount /Projects/input /Projects/output

Related

Not able to configure databricks with external hive metastore

I am following this document https://docs.databricks.com/data/metastores/external-hive-metastore.html#spark-configuration-options
to connect to my external hive metastore. My metastore version is 3.1.0 and followed the document.
docs.databricks.comdocs.databricks.com
External Apache Hive metastore — Databricks Documentation
Learn how to connect to external Apache Hive metastores in Databricks.
10:51
I have getting this error when trying to connect to external hive metastore
org/apache/hadoop/hive/conf/HiveConf when creating Hive client using classpath:
Please make sure that jars for your version of hive and hadoop are included in the paths passed to spark.sql.hive.metastore.jars
spark.sql.hive.metastore.jars=/databricks/hive_metastore_jars/*
When I do an ls on /databricks/hive_metastore_jars/, I can see all copied files
10:52
Do I need to copy any hive specific files and upload it in this folder?
I did exactly what was mentioned in the site
These are the contents of my hive_metastore_jars
total 56K
drwxr-xr-x 3 root root 4.0K Mar 24 05:06 1585025573715-0
drwxr-xr-x 2 root root 4.0K Mar 24 05:06 d596a6ec-e105-4a6e-af95-df3feffc263d_resources
drwxr-xr-x 3 root root 4.0K Mar 24 05:06 repl
drwxr-xr-x 2 root root 4.0K Mar 24 05:06 spark-2959157d-2018-441a-a7d3-d7cecb8a645f
drwxr-xr-x 4 root root 4.0K Mar 24 05:06 root
drwxr-xr-x 2 root root 4.0K Mar 24 05:06 spark-30a72ee5-304c-432b-9c13-0439511fb0cd
drwxr-xr-x 2 root root 4.0K Mar 24 05:06 spark-a19d167b-d571-4e58-a961-d7f6ced3d52f
-rwxr-xr-x 1 root root 5.5K Mar 24 05:06 _CleanRShell.r3763856699176668909resource.r
-rwxr-xr-x 1 root root 9.7K Mar 24 05:06 _dbutils.r9057087446822479911resource.r
-rwxr-xr-x 1 root root 301 Mar 24 05:06 _rServeScript.r1949348184439973964resource.r
-rwxr-xr-x 1 root root 1.5K Mar 24 05:06 _startR.sh5660449951005543051resource.r
Am I missing anything?
Strangely If I look into the cluster boot logs here is what I get
20/03/24 07:29:05 INFO Persistence: Property spark.hadoop.javax.jdo.option.ConnectionDriverName unknown - will be ignored
20/03/24 07:29:05 INFO Persistence: Property spark.hadoop.javax.jdo.option.ConnectionURL unknown - will be ignored
20/03/24 07:29:05 INFO Persistence: Property spark.hadoop.javax.jdo.option.ConnectionUserName unknown - will be ignored
20/03/24 07:29:05 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
20/03/24 07:29:05 INFO Persistence: Property spark.hadoop.javax.jdo.option.ConnectionPassword unknown - will be ignored
20/03/24 07:29:05 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
20/03/24 07:29:05 INFO Persistence: Property datanucleus.schema.autoCreateAll unknown - will be ignored
20/03/24 07:29:09 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
20/03/24 07:29:09 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
I have already set the above configurations and it shows in the logs as well
20/03/24 07:28:59 INFO SparkContext: Spark configuration:
spark.hadoop.javax.jdo.option.ConnectionDriverName=org.mariadb.jdbc.Driver
spark.hadoop.javax.jdo.option.ConnectionPassword=*********(redacted)
spark.hadoop.javax.jdo.option.ConnectionURL=*********(redacted)
spark.hadoop.javax.jdo.option.ConnectionUserName=*********(redacted)
Also version information is available in my hive metastore, I can connect to mysql and see it it shows
SCHEMA_VERSION : 3.1.0
VER_ID = 1
From the output, it looks like the jars are not copied to the "/databricks/hive_metastore_jars/" location. As mentioned in the documentation link you shared:
Set spark.sql.hive.metastore.jars set to maven
Restart the cluster with the above configuration and then check in the Spark driver logs for the message :
17/11/18 22:41:19 INFO IsolatedClientLoader: Downloaded metastore jars to <path>
From this location copy the jars to DBFS from the same cluster and then use an init script to copy the jars from DBFS to "/databricks/hive_metastore_jars/"
As I am using azure mysql there is one more step I need to perform
https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore

How to solve this: can not find and load main class org.apache.spark.deploy.history.HistoryServer

I'm a new on Spark. When I install the Spark on Cloudera, it appears this error:
Error: Could not find or load main class org.apache.spark.deploy.history.HistoryServer
The last lines of the log file are:
/user/spark/applicationHistory
Thu Sep 10 23:43:50 CST 2015
Thu Sep 10 23:43:50 CST 2015: Detected CDH_VERSION of [5]
Thu Sep 10 23:43:50 CST 2015: Starting Spark History Server
Error: Could not find or load main class org.apache.spark.deploy.history.HistoryServer
My cloudera manager version is 5.2.5 and CDH version is 5.0.0.
I have installed Yarn, HDFS and zookeeper, and they works well so far.
And there is a spark jar (spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar) at:
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/
Thanks!

Namenode HA (UnknownHostException: nameservice1)

We enable Namenode High Availability through Cloudera Manager, using
Cloudera Manager >> HDFS >> Action > Enable High Availability >> Selected Stand By Namenode & Journal Nodes
Then nameservice1
Once the whole process completed then Deployed Client Configuration.
Tested from Client Machine by listing HDFS directories (hadoop fs -ls /) then manually failover to standby namenode & again listing HDFS directories (hadoop fs -ls /). This test worked perfectly.
But When I ran hadoop sleep job using following command it failed
$ hadoop jar /opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop-0.20-mapreduce/hadoop-examples.jar sleep -m 1 -r 0
java.lang.IllegalArgumentException: java.net.UnknownHostException: nameservice1
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:164)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:448)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:410)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:128)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2308)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:87)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2342)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2324)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:351)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:103)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:980)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:974)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:974)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:948)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1410)
at org.apache.hadoop.examples.SleepJob.run(SleepJob.java:174)
at org.apache.hadoop.examples.SleepJob.run(SleepJob.java:237)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.examples.SleepJob.main(SleepJob.java:165)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.net.UnknownHostException: nameservice1
... 37 more
I dont know why its not able to resolved nameservice1 even after deploying client configuration.
When I google this issue I found only one solution to this issue
Add the below entry in configuration entry for fix the issue
dfs.nameservices=nameservice1
dfs.ha.namenodes.nameservice1=namenode1,namenode2
dfs.namenode.rpc-address.nameservice1.namenode1=ip-10-118-137-215.ec2.internal:8020
dfs.namenode.rpc-address.nameservice1.namenode2=ip-10-12-122-210.ec2.internal:8020
dfs.client.failover.proxy.provider.nameservice1=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
My impression was Cloudera Manager take cares of it. I checked client for this configuration & configuration was there (/var/run/cloudera-scm-agent/process/1998-deploy-client-config/hadoop-conf/hdfs-site.xml).
Also some more details of config files :
[11:22:37 root#datasci01.dev:~]# ls -l /etc/hadoop/conf.cloudera.*
/etc/hadoop/conf.cloudera.hdfs:
total 16
-rw-r--r-- 1 root root 943 Jul 31 09:33 core-site.xml
-rw-r--r-- 1 root root 2546 Jul 31 09:33 hadoop-env.sh
-rw-r--r-- 1 root root 1577 Jul 31 09:33 hdfs-site.xml
-rw-r--r-- 1 root root 314 Jul 31 09:33 log4j.properties
/etc/hadoop/conf.cloudera.hdfs1:
total 20
-rwxr-xr-x 1 root root 233 Sep 5 2013 container-executor.cfg
-rw-r--r-- 1 root root 1890 May 21 15:48 core-site.xml
-rw-r--r-- 1 root root 2546 May 21 15:48 hadoop-env.sh
-rw-r--r-- 1 root root 1577 May 21 15:48 hdfs-site.xml
-rw-r--r-- 1 root root 314 May 21 15:48 log4j.properties
/etc/hadoop/conf.cloudera.mapreduce:
total 20
-rw-r--r-- 1 root root 1032 Jul 31 09:33 core-site.xml
-rw-r--r-- 1 root root 2775 Jul 31 09:33 hadoop-env.sh
-rw-r--r-- 1 root root 1450 Jul 31 09:33 hdfs-site.xml
-rw-r--r-- 1 root root 314 Jul 31 09:33 log4j.properties
-rw-r--r-- 1 root root 2446 Jul 31 09:33 mapred-site.xml
/etc/hadoop/conf.cloudera.mapreduce1:
total 24
-rwxr-xr-x 1 root root 233 Sep 5 2013 container-executor.cfg
-rw-r--r-- 1 root root 1979 May 16 12:20 core-site.xml
-rw-r--r-- 1 root root 2775 May 16 12:20 hadoop-env.sh
-rw-r--r-- 1 root root 1450 May 16 12:20 hdfs-site.xml
-rw-r--r-- 1 root root 314 May 16 12:20 log4j.properties
-rw-r--r-- 1 root root 2446 May 16 12:20 mapred-site.xml
[11:23:12 root#datasci01.dev:~]#
I doubt its issue with old configuration in /etc/hadoop/conf.cloudera.hdfs1 & /etc/hadoop/conf.cloudera.mapreduce1 , but not sure.
looks like /etc/hadoop/conf/* never got updated
# ls -l /etc/hadoop/conf/
total 24
-rwxr-xr-x 1 root root 233 Sep 5 2013 container-executor.cfg
-rw-r--r-- 1 root root 1979 May 16 12:20 core-site.xml
-rw-r--r-- 1 root root 2775 May 16 12:20 hadoop-env.sh
-rw-r--r-- 1 root root 1450 May 16 12:20 hdfs-site.xml
-rw-r--r-- 1 root root 314 May 16 12:20 log4j.properties
-rw-r--r-- 1 root root 2446 May 16 12:20 mapred-site.xml
Anyone has any idea about this issue?
Looks like you are using wrong client configuration in /etc/hadoop/conf directory. Sometimes Cloudera Manager (CM) deploy client configurations option may not work.
As you have enabled NN HA, you should have valid core-site.xml and hdfs-site.xml files in your hadoop client configuration directory. For getting the valid site files, Go to HDFS service from CM Choose Download client configuration option from the Actions Button. you will get configuration files in zip format, extract the zip files and replace /etc/hadoop/conf/core-site.xml and /etc/hadoop/conf/hdfs-site.xml files with the extracted core-site.xml,hdfs-site.xml files.
Got it resolved. wrong config was linked to "/etc/hadoop/conf/" --> "/etc/alternatives/hadoop-conf/" --> "/etc/hadoop/conf.cloudera.mapreduce1"
It has to be "/etc/hadoop/conf/" --> "/etc/alternatives/hadoop-conf/" --> "/etc/hadoop/conf.cloudera.mapreduce"
below statement in my code resolved problem by specifying the host and port
val dfs = sqlContext.read.json("hdfs://localhost:9000//user/arvindd/input/employee.json")
I resolved this issue my putting the complete line to create RDD
myfirstrdd = sc.textFile("hdfs://192.168.35.132:8020/BUPA.txt")
and then I was able to do other RDD transformation .. Make sure you have the w/r/x to the file or you can do chmod 777

Subclipse not recognizing my JavaHL

I keep getting the following error:
Failed to load JavaHL Library.
These are the errors that were encountered:
no libsvnjavahl-1 in java.library.path
no svnjavahl-1 in java.library.path
no svnjavahl in java.library.path
java.library.path = "/usr/lib/x86_64-linux-gnu/jni"
Although the library path is correct:
user#localhost /usr/lib/x86_64-linux-gnu/jni $ ls -l
total 336
lrwxrwxrwx 1 root root 24 Apr 6 02:06 libatk-wrapper.so -> libatk-wrapper.so.0.0.18
lrwxrwxrwx 1 root root 24 Apr 6 02:06 libatk-wrapper.so.0 -> libatk-wrapper.so.0.0.18
-rw-r--r-- 1 root root 85168 Sep 20 2012 libatk-wrapper.so.0.0.18
lrwxrwxrwx 1 root root 23 Sep 28 2012 libsvnjavahl-1.so -> libsvnjavahl-1.so.0.0.0
lrwxrwxrwx 1 root root 23 Sep 28 2012 libsvnjavahl-1.so.0 -> libsvnjavahl-1.so.0.0.0
-rw-r--r-- 1 root root 256104 Sep 28 2012 libsvnjavahl-1.so.0.0.0
The above was installed with apt-get install libsvn-java on ubuntu 12.10. Basically this package here.
The installed version of svn is 1.7.5.
The installed version of subclipse is 1.8.19.
I understand that the required svn version for subclipse 1.8.x to work is 1.7.x.
How can I make subclipse recognize my installed JavaHL library?
Okay I have found it...
The problem was in my eclipse.ini file, which looked like this:
-startup
plugins/org.eclipse.equinox.launcher_1.3.0.v20120522-1813.jar
--launcher.library
plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.200.v20120913-144807
-showsplash
org.eclipse.platform
--launcher.XXMaxPermSize
256m
--launcher.defaultAction
openFile
-Xms40m
-Xmx512m
-vmargs
-Djava.library.path="/usr/lib/x86_64-linux-gnu/jni"
I had to remove the extra quotes: -Djava.library.path="/usr/lib/x86_64-linux-gnu/jni" to -Djava.library.path=/usr/lib/x86_64-linux-gnu/jni.
That fixed it.

Error using DBD::Oracle with instantclient 11.2

I Want to use perl module DBD::Oracle on Mac OS X 10.8.
I installed DBI through CPAN.
Downloaded the Oracle instant client 11.2 (basic, sqlplus and jdk).
Extracted it to /usr/local/oracle.
$ ls /usr/local/oracle/instantclient_11_2/
BASIC_README libnnz11.dylib ojdbc6.jar
SQLPLUS_README libocci.dylib.11.1 sdk
adrci libociei.dylib sqlplus
genezi libocijdbc11.dylib uidrvci
glogin.sql libsqlplus.dylib xstreams.jar
libclntsh.dylib libsqlplusic.dylib
libclntsh.dylib.11.1 ojdbc5.jar
Then installed DBD::Oracle.
Now when I want to use DBD::Oracle it gives an error.
install_driver(Oracle) failed: Can't load '/Library/Perl/5.12/darwin-thread-multi-
2level/auto/DBD/Oracle/Oracle.bundle' for module DBD::Oracle:
dlopen(/Library/Perl/5.12/darwin-thread-multi-2level/auto/DBD/Oracle/Oracle.bundle, 1):
Library not loaded: /ade/b/2649109290/oracle/rdbms/lib/libclntsh.dylib.11.1
Referenced from: /Library/Perl/5.12/darwin-thread-multi-
2level/auto/DBD/Oracle/Oracle.bundle
Reason: image not found at /System/Library/Perl/5.12/darwin-thread-multi-
2level/DynaLoader.pm line 204.
I have DYLD_LIBRARY_PATH=/usr/local/oracle/instaclient_11_2
I have not a clue what I am doing wrong.
SOLVED:
I got the same error when trying to run sqlplus. I added my oracle client directory to my global PATH variable and it is working now.
Saw they did the same in this tutorial: http://www.janhellevik.no/?p=521
Check this one out: perl DBD::Oracle Module installtion
Above information is about Linux Environment. But you might get some clues on how to go about configuring DBD::Oracle on MAC OSX.
On linux, you need "Basic, SQLPLUS and Devel" binaries of the oracle instant client. Here is the directory listing on my Linux box:
# pwd
/usr/lib/oracle/11.2/client64/lib
# ls -l
total 185232
-rw-r--r-- 1 root root 368 Sep 17 2011 glogin.sql
lrwxrwxrwx 1 root root 17 Jul 9 2012 libclntsh.so -> libclntsh.so.11.1
-rw-r--r-- 1 root root 52761218 Sep 17 2011 libclntsh.so.11.1
-rw-r--r-- 1 root root 7955322 Sep 17 2011 libnnz11.so
lrwxrwxrwx 1 root root 15 Jul 9 2012 libocci.so -> libocci.so.11.1
-rw-r--r-- 1 root root 1971762 Sep 17 2011 libocci.so.11.1
-rw-r--r-- 1 root root 118408281 Sep 17 2011 libociei.so
-rw-r--r-- 1 root root 164836 Sep 17 2011 libocijdbc11.so
-rw-r--r-- 1 root root 1503303 Sep 17 2011 libsqlplusic.so
-rw-r--r-- 1 root root 1477446 Sep 17 2011 libsqlplus.so
-rw-r--r-- 1 root root 2095661 Sep 17 2011 ojdbc5.jar
-rw-r--r-- 1 root root 2714016 Sep 17 2011 ojdbc6.jar
-rw-r--r-- 1 root root 300666 Sep 17 2011 ottclasses.zip
-rw-r--r-- 1 root root 66779 Sep 17 2011 xstreams.jar
and here is the list of oracle instant client RPMs that I installed:
# rpm -qa | grep -i oracle
oracle-instantclient11.2-basic-11.2.0.3.0-1
oracle-instantclient11.2-devel-11.2.0.3.0-1
oracle-instantclient11.2-sqlplus-11.2.0.3.0-1
Hope this helps.

Resources