Not able to configure databricks with external hive metastore - azure-databricks

I am following this document https://docs.databricks.com/data/metastores/external-hive-metastore.html#spark-configuration-options
to connect to my external hive metastore. My metastore version is 3.1.0 and followed the document.
docs.databricks.comdocs.databricks.com
External Apache Hive metastore — Databricks Documentation
Learn how to connect to external Apache Hive metastores in Databricks.
10:51
I have getting this error when trying to connect to external hive metastore
org/apache/hadoop/hive/conf/HiveConf when creating Hive client using classpath:
Please make sure that jars for your version of hive and hadoop are included in the paths passed to spark.sql.hive.metastore.jars
spark.sql.hive.metastore.jars=/databricks/hive_metastore_jars/*
When I do an ls on /databricks/hive_metastore_jars/, I can see all copied files
10:52
Do I need to copy any hive specific files and upload it in this folder?
I did exactly what was mentioned in the site
These are the contents of my hive_metastore_jars
total 56K
drwxr-xr-x 3 root root 4.0K Mar 24 05:06 1585025573715-0
drwxr-xr-x 2 root root 4.0K Mar 24 05:06 d596a6ec-e105-4a6e-af95-df3feffc263d_resources
drwxr-xr-x 3 root root 4.0K Mar 24 05:06 repl
drwxr-xr-x 2 root root 4.0K Mar 24 05:06 spark-2959157d-2018-441a-a7d3-d7cecb8a645f
drwxr-xr-x 4 root root 4.0K Mar 24 05:06 root
drwxr-xr-x 2 root root 4.0K Mar 24 05:06 spark-30a72ee5-304c-432b-9c13-0439511fb0cd
drwxr-xr-x 2 root root 4.0K Mar 24 05:06 spark-a19d167b-d571-4e58-a961-d7f6ced3d52f
-rwxr-xr-x 1 root root 5.5K Mar 24 05:06 _CleanRShell.r3763856699176668909resource.r
-rwxr-xr-x 1 root root 9.7K Mar 24 05:06 _dbutils.r9057087446822479911resource.r
-rwxr-xr-x 1 root root 301 Mar 24 05:06 _rServeScript.r1949348184439973964resource.r
-rwxr-xr-x 1 root root 1.5K Mar 24 05:06 _startR.sh5660449951005543051resource.r
Am I missing anything?
Strangely If I look into the cluster boot logs here is what I get
20/03/24 07:29:05 INFO Persistence: Property spark.hadoop.javax.jdo.option.ConnectionDriverName unknown - will be ignored
20/03/24 07:29:05 INFO Persistence: Property spark.hadoop.javax.jdo.option.ConnectionURL unknown - will be ignored
20/03/24 07:29:05 INFO Persistence: Property spark.hadoop.javax.jdo.option.ConnectionUserName unknown - will be ignored
20/03/24 07:29:05 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
20/03/24 07:29:05 INFO Persistence: Property spark.hadoop.javax.jdo.option.ConnectionPassword unknown - will be ignored
20/03/24 07:29:05 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
20/03/24 07:29:05 INFO Persistence: Property datanucleus.schema.autoCreateAll unknown - will be ignored
20/03/24 07:29:09 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
20/03/24 07:29:09 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
I have already set the above configurations and it shows in the logs as well
20/03/24 07:28:59 INFO SparkContext: Spark configuration:
spark.hadoop.javax.jdo.option.ConnectionDriverName=org.mariadb.jdbc.Driver
spark.hadoop.javax.jdo.option.ConnectionPassword=*********(redacted)
spark.hadoop.javax.jdo.option.ConnectionURL=*********(redacted)
spark.hadoop.javax.jdo.option.ConnectionUserName=*********(redacted)
Also version information is available in my hive metastore, I can connect to mysql and see it it shows
SCHEMA_VERSION : 3.1.0
VER_ID = 1

From the output, it looks like the jars are not copied to the "/databricks/hive_metastore_jars/" location. As mentioned in the documentation link you shared:
Set spark.sql.hive.metastore.jars set to maven
Restart the cluster with the above configuration and then check in the Spark driver logs for the message :
17/11/18 22:41:19 INFO IsolatedClientLoader: Downloaded metastore jars to <path>
From this location copy the jars to DBFS from the same cluster and then use an init script to copy the jars from DBFS to "/databricks/hive_metastore_jars/"

As I am using azure mysql there is one more step I need to perform
https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore

Related

Liquibase can not find file containing H2 bin

When trying to generate a liquibase diff I get the following:
Starting Liquibase at 00:18:57 (version 4.1.1 #10 built at 2020-10-12 19:24+0000)
Unexpected error running Liquibase: '/usr/local/apps/h2/h2-1.3.176.jar' does not exist
For more information, please use the --logLevel flag
[2022-10-02 00:18:57] SEVERE [liquibase.integration] Unexpected error running Liquibase: '/usr/local/apps/h2/h2-1.3.176.jar' does not exist
liquibase.exception.CommandLineParsingException: '/usr/local/apps/h2/h2-1.3.176.jar' does not exist
at liquibase.integration.commandline.Main.configureClassLoader(Main.java:1278)
at liquibase.integration.commandline.Main$1.run(Main.java:360)
at liquibase.integration.commandline.Main$1.run(Main.java:193)
at liquibase.Scope.child(Scope.java:169)
at liquibase.Scope.child(Scope.java:145)
at liquibase.integration.commandline.Main.run(Main.java:193)
at liquibase.integration.commandline.Main.main(Main.java:156)
The problem is, the java binary does exist at in that directory.
(base) user#userpad:/usr/local/apps/h2$ ls -la
total 1632
drwxr-xr-x 2 root root 4096 okt 2 00:12 .
drwxr-xr-x 3 root root 4096 okt 2 00:12 ..
-rw-rw-r-- 1 user user 1659879 okt 2 00:07 h2-1.3.176.jar
(base) user#userpad:/usr/local/apps/h2$ pwd
/usr/local/apps/h2
(base) user#userpad:/usr/local/apps/h2$
Any help would be appreciated.

Where to use the HDFS data in Web UI - MapR

I am able to connect to Mapr Control System - MCS - port 8080 (Web - UI) but where can i see the data file i copied from my local files system.??
On the MCS Navigation Pane , I can see the below tabs but i dont see where to navigate to data folder.
**Cluster
Dashboard
Nodes
Node Heatmap
Jobs
MapR-FS
MapR Tables
Volumes
Mirror Volumes
USer Disk USage
Snapshot
NFS HA
NFS Setup
Alarms
...
...
..
system Settings
-..
..
..
HBASE
...
Resource Manager
Job History Server
..**
**Hadoop Command:**
]$ hadoop fs -ls -R /user
drwxr-xr-x - mapr mapr 1 2019-06-25 12:47 /user/hive
drwxrwxr-x - mapr mapr 0 2019-06-25 12:47 /user/hive/warehouse
drwxrwxrwx - mapr mapr 6 2019-06-26 11:34 /user/mapr
drwxr-xr-x - mapr mapr 0 2019-06-25 12:14 /user/mapr/.sparkStaging
drwxr-xr-x - mapr mapr 0 2019-06-24 15:40 /user/mapr/hadoop
drwxrwxrwx - mapr mapr 1 2019-06-26 11:48 /user/mapr/rajesh
-rwxr-xr-x 3 webload webload 219 2019-06-26 11:48 /user/mapr/rajesh/sample.txt
-rwxr-xr-x 3 mapr mapr 219 2019-06-25 13:46 /user/mapr/sample.txt
drwxr-xr-x - mapr mapr 0 2019-06-25 12:14 /user/mapr/spark-warehouse
drwxr-xr-x - mapr mapr 1 2019-06-24 13:37 /user/mapr/tmp
drwxrwxrwx - mapr mapr 0 2019-06-24 13:37 /user/mapr/tmp/hive

How to run pig scripts from HDFS?

I am trying to run pig script from the hdfs but it shows error as the file does not exist.
My hdfs Directory
[cloudera#quickstart ~]$ hdfs dfs -ls /
Found 11 items
drwxrwxrwx - hdfs supergroup 0 2016-08-10 14:35 /benchmarks
drwxr-xr-x - hbase supergroup 0 2017-08-19 23:51 /hbase
drwxr-xr-x - cloudera supergroup 0 2017-07-13 04:53 /home
drwxr-xr-x - cloudera supergroup 0 2017-08-27 07:26 /input
drwxr-xr-x - cloudera supergroup 0 2017-07-30 14:30 /output
drwxr-xr-x - solr solr 0 2016-08-10 14:37 /solr
-rw-r--r-- 1 cloudera supergroup 273 2017-08-27 11:59 /success.pig
-rw-r--r-- 1 cloudera supergroup 273 2017-08-27 12:04 /success.script
drwxrwxrwt - hdfs supergroup 0 2017-08-27 12:07 /tmp
drwxr-xr-x - hdfs supergroup 0 2016-09-28 09:00 /user
drwxr-xr-x - hdfs supergroup 0 2016-08-10 14:37 /var
Command executed
[cloudera#quickstart ~]$ pig -x mapreduce /success.pig
Error Message
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
2017-08-27 12:34:39,160 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0-cdh5.8.0 (rexported) compiled Jun 16 2016, 12:40:41
2017-08-27 12:34:39,162 [main] INFO org.apache.pig.Main - Logging error messages to: /home/cloudera/pig_1503862479069.log
2017-08-27 12:34:47,079 [main] ERROR org.apache.pig.Main - ERROR 2997: Encountered IOException. File /success.pig does not exist
Details at logfile: /home/cloudera/pig_1503862479069.log
What am I missing ?
You may use -f <script location> option and option value to run script located at HDFS path. But script location need to be absolute path as given in following syntax and example.
Syntax:
pig -f <fs.defaultFS>/<script path in hdfs>
Example:
pig -f hdfs://Foton/user/root/script.pig

hadoop-2.6.0 install snappy not ok

my hadoop version is "hadoop-2.6.0-cdh5.4.5"
my linux version is "Description: CentOS release 6.5 (Final)"
my hadoop home is "/home/hadoop/hadoop-2.6.0-cdh5.4.5"
I have install snappy in /usr/lib64:
libsnappy.so -> libsnappy.so.1.1.4
libsnappy.so.1 -> libsnappy.so.1.1.4
libsnappy.so.1.1.4
I also finish hadoop-snappy by mvn:
1) copy hadoop-snappy-0.0.1-SNAPSHOT.jar /home/hadoop/hadoop-2.6.0-cdh5.4.5/lib/.
2)cp -r $HADOOP-SNAPPY_CODE_HOME/target/hadoop-snappy-0.0.1-SNAPSHOT/lib/native/Linux-amd64-64 /home/hadoop/hadoop-2.6.0-cdh5.4.5/lib/native/
I ln the lib in /home/hadoop/hadoop-2.6.0-cdh5.4.5/lib/native:
lrwxrwxrwx. 1 root root 23 Jan 17 17:11 libsnappy.so -> /usr/lib64/libsnappy.so
lrwxrwxrwx. 1 root root 25 Jan 17 17:16 libsnappy.so.1 -> /usr/lib64/libsnappy.so.1
I also finish the /home/hadoop/hadoop-2.6.0-cdh5.4.5/etc/hadoop/hadoop.env.sh:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native/Linux-amd64-64/
export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native/Linux-amd64-64/
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_HOME}/lib/native
but when I check the snappy lib:
[hadoop#stormmaster hadoop]$ hadoop checknative
16/01/17 17:52:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Native library checking:
hadoop: false
zlib: false
**snappy: false**
lz4: false
bzip2: false
openssl: false
16/01/17 17:52:20 INFO util.ExitUtil: Exiting with status 1
I try to download a hadoop-native-64-2.6.0,and put in /home/hadoop/hadoop-2.6.0-cdh5.4.5/lib/native:
-rw-r--r--. 1 hadoop hadoop 1119486 Jan 15 15:51 libhadoop.a
-rw-r--r--. 1 hadoop hadoop 1486964 Jan 15 15:51 libhadooppipes.a
lrwxrwxrwx. 1 hadoop hadoop 24 Jan 11 15:11 libhadoopsnappy.so -> libhadoopsnappy.so.0.0.1
lrwxrwxrwx. 1 hadoop hadoop 24 Jan 11 15:11 libhadoopsnappy.so.0 -> libhadoopsnappy.so.0.0.1
-rwxr-xr-x. 1 hadoop hadoop 54437 Jan 15 15:51 libhadoopsnappy.so.0.0.1
-rwxr-xr-x. 1 hadoop hadoop 671189 Jan 15 15:51 libhadoop.so
-rwxr-xr-x. 1 hadoop hadoop 671189 Jan 15 15:51 libhadoop.so.1.0.0
-rw-r--r--. 1 hadoop hadoop 581944 Jan 15 15:51 libhadooputils.a
-rw-r--r--. 1 hadoop hadoop 359458 Jan 15 15:51 libhdfs.a
-rwxr-xr-x. 1 hadoop hadoop 228435 Jan 15 15:51 libhdfs.so
-rwxr-xr-x. 1 hadoop hadoop 228435 Jan 15 15:51 libhdfs.so.0.0.0
also hadoop checknative:
16/01/17 17:55:00 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
16/01/17 17:55:00 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
**hadoop: true /home/hadoop/hadoop-2.6.0-cdh5.4.5/lib/native/libhadoop.so
zlib: true /lib64/libz.so.1
snappy: false
lz4: true revision:99
bzip2: false
openssl: false Cannot load libcrypto.so (libcrypto.so: cannot open shared object file: No such file or directory)!**
how can I solve the problem ,and use snappy in hadoop,thx!

Namenode HA (UnknownHostException: nameservice1)

We enable Namenode High Availability through Cloudera Manager, using
Cloudera Manager >> HDFS >> Action > Enable High Availability >> Selected Stand By Namenode & Journal Nodes
Then nameservice1
Once the whole process completed then Deployed Client Configuration.
Tested from Client Machine by listing HDFS directories (hadoop fs -ls /) then manually failover to standby namenode & again listing HDFS directories (hadoop fs -ls /). This test worked perfectly.
But When I ran hadoop sleep job using following command it failed
$ hadoop jar /opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop-0.20-mapreduce/hadoop-examples.jar sleep -m 1 -r 0
java.lang.IllegalArgumentException: java.net.UnknownHostException: nameservice1
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:164)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:448)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:410)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:128)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2308)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:87)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2342)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2324)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:351)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:103)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:980)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:974)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:974)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:948)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1410)
at org.apache.hadoop.examples.SleepJob.run(SleepJob.java:174)
at org.apache.hadoop.examples.SleepJob.run(SleepJob.java:237)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.examples.SleepJob.main(SleepJob.java:165)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.net.UnknownHostException: nameservice1
... 37 more
I dont know why its not able to resolved nameservice1 even after deploying client configuration.
When I google this issue I found only one solution to this issue
Add the below entry in configuration entry for fix the issue
dfs.nameservices=nameservice1
dfs.ha.namenodes.nameservice1=namenode1,namenode2
dfs.namenode.rpc-address.nameservice1.namenode1=ip-10-118-137-215.ec2.internal:8020
dfs.namenode.rpc-address.nameservice1.namenode2=ip-10-12-122-210.ec2.internal:8020
dfs.client.failover.proxy.provider.nameservice1=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
My impression was Cloudera Manager take cares of it. I checked client for this configuration & configuration was there (/var/run/cloudera-scm-agent/process/1998-deploy-client-config/hadoop-conf/hdfs-site.xml).
Also some more details of config files :
[11:22:37 root#datasci01.dev:~]# ls -l /etc/hadoop/conf.cloudera.*
/etc/hadoop/conf.cloudera.hdfs:
total 16
-rw-r--r-- 1 root root 943 Jul 31 09:33 core-site.xml
-rw-r--r-- 1 root root 2546 Jul 31 09:33 hadoop-env.sh
-rw-r--r-- 1 root root 1577 Jul 31 09:33 hdfs-site.xml
-rw-r--r-- 1 root root 314 Jul 31 09:33 log4j.properties
/etc/hadoop/conf.cloudera.hdfs1:
total 20
-rwxr-xr-x 1 root root 233 Sep 5 2013 container-executor.cfg
-rw-r--r-- 1 root root 1890 May 21 15:48 core-site.xml
-rw-r--r-- 1 root root 2546 May 21 15:48 hadoop-env.sh
-rw-r--r-- 1 root root 1577 May 21 15:48 hdfs-site.xml
-rw-r--r-- 1 root root 314 May 21 15:48 log4j.properties
/etc/hadoop/conf.cloudera.mapreduce:
total 20
-rw-r--r-- 1 root root 1032 Jul 31 09:33 core-site.xml
-rw-r--r-- 1 root root 2775 Jul 31 09:33 hadoop-env.sh
-rw-r--r-- 1 root root 1450 Jul 31 09:33 hdfs-site.xml
-rw-r--r-- 1 root root 314 Jul 31 09:33 log4j.properties
-rw-r--r-- 1 root root 2446 Jul 31 09:33 mapred-site.xml
/etc/hadoop/conf.cloudera.mapreduce1:
total 24
-rwxr-xr-x 1 root root 233 Sep 5 2013 container-executor.cfg
-rw-r--r-- 1 root root 1979 May 16 12:20 core-site.xml
-rw-r--r-- 1 root root 2775 May 16 12:20 hadoop-env.sh
-rw-r--r-- 1 root root 1450 May 16 12:20 hdfs-site.xml
-rw-r--r-- 1 root root 314 May 16 12:20 log4j.properties
-rw-r--r-- 1 root root 2446 May 16 12:20 mapred-site.xml
[11:23:12 root#datasci01.dev:~]#
I doubt its issue with old configuration in /etc/hadoop/conf.cloudera.hdfs1 & /etc/hadoop/conf.cloudera.mapreduce1 , but not sure.
looks like /etc/hadoop/conf/* never got updated
# ls -l /etc/hadoop/conf/
total 24
-rwxr-xr-x 1 root root 233 Sep 5 2013 container-executor.cfg
-rw-r--r-- 1 root root 1979 May 16 12:20 core-site.xml
-rw-r--r-- 1 root root 2775 May 16 12:20 hadoop-env.sh
-rw-r--r-- 1 root root 1450 May 16 12:20 hdfs-site.xml
-rw-r--r-- 1 root root 314 May 16 12:20 log4j.properties
-rw-r--r-- 1 root root 2446 May 16 12:20 mapred-site.xml
Anyone has any idea about this issue?
Looks like you are using wrong client configuration in /etc/hadoop/conf directory. Sometimes Cloudera Manager (CM) deploy client configurations option may not work.
As you have enabled NN HA, you should have valid core-site.xml and hdfs-site.xml files in your hadoop client configuration directory. For getting the valid site files, Go to HDFS service from CM Choose Download client configuration option from the Actions Button. you will get configuration files in zip format, extract the zip files and replace /etc/hadoop/conf/core-site.xml and /etc/hadoop/conf/hdfs-site.xml files with the extracted core-site.xml,hdfs-site.xml files.
Got it resolved. wrong config was linked to "/etc/hadoop/conf/" --> "/etc/alternatives/hadoop-conf/" --> "/etc/hadoop/conf.cloudera.mapreduce1"
It has to be "/etc/hadoop/conf/" --> "/etc/alternatives/hadoop-conf/" --> "/etc/hadoop/conf.cloudera.mapreduce"
below statement in my code resolved problem by specifying the host and port
val dfs = sqlContext.read.json("hdfs://localhost:9000//user/arvindd/input/employee.json")
I resolved this issue my putting the complete line to create RDD
myfirstrdd = sc.textFile("hdfs://192.168.35.132:8020/BUPA.txt")
and then I was able to do other RDD transformation .. Make sure you have the w/r/x to the file or you can do chmod 777

Resources