GeoSpark using Maven UDF running Databricks on Azure?

GeoSpark using Maven UDF running Databricks on Azure? - user-defined-functions

I'm trying out Databricks#Azure with spark cluster:
Spark 3.0.0, Scala 2.12
On the cluster(!) I installed:
geospark:1.3.1
geospark-sql_2.3:1.3.1
Inspired by https://databricks.com/notebooks/geospark-notebook.html
I love SQL and would like to run GeoSpark queries.
I run this (from Notebook):
%scala
import com.vividsolutions.jts.geom.{Coordinate, Geometry, GeometryFactory}
import org.datasyslab.geospark.formatMapper.shapefileParser.ShapefileReader
import org.datasyslab.geospark.spatialRDD.SpatialRDD
import org.datasyslab.geosparksql.utils.{Adapter, GeoSparkSQLRegistrator}
GeoSparkSQLRegistrator.registerAll(sqlContext)
When I run this check:
%scala
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL UDF scalar example")
.getOrCreate()
spark.catalog.listFunctions().filter("name like 'ST%P%' ").show(false)
/** spark.catalog.listTables().show()
spark.sql("SELECT ST_Point(0,0) FROM ( VALUES (42) ) AS t(a); ").show() */
The output is:
|name |database|description|className |isTemporary|
|ST_NPoints |null |null |org.apache.spark.sql.geosparksql.expressions.ST_NPoints$ |true |
|ST_Point |null |null |org.apache.spark.sql.geosparksql.expressions.ST_Point$ |true |
...
But this
%sql
SELECT t.a, ST_Point(0,0) as p
FROM (VALUES (42)) AS t(a);
Fails:
Error in SQL statement: NoClassDefFoundError: org/apache/spark/sql/catalyst/expressions/codegen/CodegenFallback$class
What is it I'm doing wrong?
P.S.
I also tried:
CREATE FUNCTION ST_Point AS 'org.apache.spark.sql.geosparksql.expressions.ST_Point$';
with and without ending dollar sign. the create function statement returns OK; however running the select including ST_point then returns:
Error in SQL statement: AnalysisException: No handler for UDF/UDAF/UDTF 'org.apache.spark.sql.geosparksql.expressions.ST_Point$'; line 1 pos 12

geospark 1.3.1 seems to be built for Spark 2.x, see [1], if you need to use Spark 3.x then try and upgrade to geospark 1.3.2 otherwise try to downgrade to spark 2.x.
[1] http://sedona.apache.org/download/GeoSpark-All-Modules-Maven-Central-Coordinates/

Related

How to load Impala table directly to Spark using JDBC?

I am trying to write a spark job with Python that would open a jdbc connection with Impala and load a VIEW directly from Impala into a Dataframe. This question is pretty close but in scala: Calling JDBC to impala/hive from within a spark job and creating a table
How do I do this? There are plenty of examples for other datasources such as MySQL, PostgreSQL, etc. but I haven't seen one for Impala + Python + Kerberos. An example would be of great help. Thank you!
Tried this with information from the web but it didn't work.
SPARK Notebook
#!/bin/bash
export PYSPARK_PYTHON=/home/anave/anaconda2/bin/python
export HADOOP_CONF_DIR=/etc/hive/conf
export PYSPARK_DRIVER_PYTHON=/home/anave/anaconda2/bin/ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=* --no-browser'
# use Java8
export JAVA_HOME=/usr/java/latest
export PATH=$JAVA_HOME/bin:$PATH
# JDBC Drivers for Impala
export CLASSPATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30/*.jar:$CLASSPATH
export JDBC_PATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30
# --jars $SRCDIR/spark-csv-assembly-1.4.0-SNAPSHOT.jar \
# --conf spark.sql.parquet.binaryAsString=true \
# --conf spark.sql.hive.convertMetastoreParquet=false
pyspark --master yarn-client \
--driver-memory 4G \
--executor-memory 2G \
# --num-executors 10 \
--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
--driver-class-path $JDBC_PATH/*.jar
Python Code
properties = {
"driver": "com.cloudera.impala.jdbc41.Driver",
"AuthMech": "1",
# "KrbRealm": "EXAMPLE.COM",
# "KrbHostFQDN": "impala.example.com",
"KrbServiceName": "impala"
}
# imp_env is the hostname of the db, works with other impala queries ran inside python
url = "jdbc:impala:imp_env;auth=noSasl"
db_df = sqlContext.read.jdbc(url=url, table='summary', properties=properties)
I received this error msg (Full Error Log):
Py4JJavaError: An error occurred while calling o42.jdbc.
: java.lang.ClassNotFoundException: com.cloudera.impala.jdbc41.Driver

You can use
--jars $(echo /dir/of/jars/*.jar | tr ' ' ',')
instead of
--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
or for another approach please see my answer

1st approach is to use spark-submit on below impala_jdbc_connection.py script like spark-submit --driver-class-path /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --jars /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --class com.cloudera.impala.jdbc41.Driver impala_jdbc_connection.py
impala_jdbc_connection.py
properties = {
"drivers": "com.cloudera.impala.jdbc41.Driver"
}
#initalize the spark session
spark = (
SparkSession.builder
.config("spark.jars.packages", "jar-packages-list")
.config("spark.sql.warehouse.dir","hdfs://dwh-hdp-node01.dev.ergo.liferunoffinsuranceplatform.com:8020/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
)
db_df = spark.read.jdbc(url= 'jdbc:impala://host_ip_address:21050/database_name', table ='table_name', properties = properties)
db_df.show()
2nd approach is not a direct import from impala to spark but rather a conversion of results to spark dataframe
pip install impyla Source: https://github.com/cloudera/impyla
Connect to impala and fetch results from impala database and convert result to spark dataframe
from impala.dbapi import connect
conn = connect(host = 'IP_ADDRESS_OF_HOST', port=21050)
cursor = conn.cursor()
cursor.execute('select * from database.table')
res= cursor.fetchall() # convert res to spark dataframe
for data in res:
print(data)

Did this in Azure Databricks notebook after setting up the jar in the cluster libraries. Generally followed previous post except that d is upper case for Driver config. Worked great.
properties = {
"Driver": "com.cloudera.impala.jdbc41.Driver"
}
db_df = spark.read.jdbc(url= 'jdbc:impala://hostname.domain.net:21050/dbname;AuthMech=3;UID=xxxx;PWD=xxxx', table ='product', properties = properties)
db_df.show()

This works for me:
spark-shell --driver-class-path ImpalaJDBC41.jar --jars ImpalaJDBC41.jar
val jdbcURL = s"jdbc:impala://192.168.56.101:21050;AuthMech=0"
val connectionProperties = new java.util.Properties()
val hbaseDF = sqlContext.read.jdbc(jdbcURL, "impala_table", connectionProperties)

How to write spark dataframe to impala database

I use the following code to write the spark dataframe to impala through JDBC connection.
df.write.mode("append").jdbc(url="jdbc:impala://10.61.1.101:21050/test;auth=noSasl",table="t_author_classic_copy", pro)
But I get the following error: java.sql.SQLException: No suitable driver found
then I change the mode:
df.write.mode("overwrite").jdbc(url="jdbc:impala://10.61.1.101:21050/test;auth=noSasl",table="t_author_classic_copy", pro)
but it still get an error:
CAUSED BY: Exception: Syntax error
), Query: CREATE TABLE t_author_classic_copy1 (id TEXT NOT NULL, domain_id TEXT NOT NULL, pub_num INTEGER , cited_num INTEGER , rank DOUBLE PRECISION ).

This works for me:
spark-shell --driver-class-path ImpalaJDBC41.jar --jars ImpalaJDBC41.jar
val jdbcURL = s"jdbc:impala://192.168.56.101:21050;AuthMech=0"
val connectionProperties = new java.util.Properties()
import org.apache.spark.sql.SaveMode
sqlContext.sql("select * from my_users").write.mode(SaveMode.Append).jdbc(jdbcURL, "users", connectionProperties)

SPARK SQL (1.5.1) connect to Oracle and write to Avro

I am using spark-sql to connect to oracle databse and getting data as dataframes. I would like to write this retrieved data into avro file. While writing to avro I am seeing multiple issues, could you help us.
Here is the code -
val df = sqlContext.read.format("jdbc")
.options(Map( "driver"->"oracle.jdbc.driver.OracleDriver",
"url" -> "jdbc:oracle:thin:user/password#host/service"
, "numPartitions" -> "1", "dbtable"-> "
(Select * from schema.table WHERE STAGE_NUM <=39 and
guid='I284ba1f9cdba11dea82ab9f4ee295c21')"))
.load()
df.write.format("com.databricks.spark.avro").save("Outputfile")
Dependencies that are there in my project -
<dependency><br> <groupId>org.apache.spark</groupId><br> <artifactId>spark-sql_2.10</artifactId><br> <version>1.5.1</version><br></dependency><br><dependency><br> <groupId>com.databricks</groupId><br> <artifactId>spark-avro_2.10</artifactId><br> <version>2.0.1</version><br></dependency><br><dependency><br> <groupId>org.apache.avro</groupId><br> <artifactId>avro</artifactId><br> <version>1.7.7</version><br></dependency><br><dependency><br> <groupId>org.apache.avro</groupId><br> <artifactId>avro-mapred</artifactId><br> <version>1.7.7</version><br></dependency>
Here is the exception information -
java.lang.RuntimeException: com.databricks.spark.avro.DefaultSource does not allow create table as select
If I use - df.write.avro("headnotes"), I get the following exception.
java.lang.IllegalAccessError: tried to access class org.apache.avro.SchemaBuilder$FieldDefault from class com.databricks.spark.avro.SchemaConverters$$anonfun$convertStructToAvro$1

error while storing in Hbase using Pig

hadoop dfs input data cat:
[ituser1#genome-dev3 ~]$ hadoop fs -cat FOR_COPY/COMPETITOR_BROKERING/part-r-00000 | head -1
returns:
836646827,1000.0,2016-02-20,34,CAPITAL BOOK,POS/CAPITAL BOOK/NEW DELHI/200216/14:18,BOOKS AND STATIONERY,5497519004453567/41043516,MARRIED,M,SALARIED,D,5942,1
My Pig code:
DATA = LOAD 'FOR_COPY/COMPETITOR_BROKERING' USING PigStorage(',') AS (CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray,OCCURANCE:int);
DATA_FIL = FOREACH DATA GENERATE
(chararray)CUST_ID AS CUST_ID,
(chararray)TXN_AMT AS TXN_AMT,
(chararray)TXN_DATE AS TXN_DATE,
(chararray)AGE_CASA AS AGE_CASA,
(chararray)MERCH_NAME AS MERCH_NAME,
(chararray)TXN_PARTICULARS AS TXN_PARTICULARS,
(chararray)MCC_CATEGORY AS MCC_CATEGORY,
(chararray)TXN_REMARKS AS TXN_REMARKS,
(chararray)MARITAL_STATUS_CASA AS MARITAL_STATUS_CASA,
(chararray)GENDER_CASA AS GENDER_CASA,
(chararray)OCCUPATION_CAT_V2_NEW AS OCCUPATION_CAT_V2_NEW,
(chararray)DR_CR AS DR_CR,
(chararray)MCC_CODE AS MCC_CODE;
STORE DATA_FIL INTO 'hbase://TXN_EVENTS' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE');
but Giving error:
ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2244: Job job_1457792710587_0100 failed, hadoop does not return any error message
But my Load is working perfectly:
HDATA = LOAD 'hbase://TXN_EVENTS'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'DETAILS:CUST_ID DETAILS:TXN_AMT DETAILS:TXN_DATE DETAILS:AGE_CASA DETAILS:MERCH_NAME DETAILS:TXN_PARTICULARS DETAILS:MCC_CATEGORY DETAILS:TXN_REMARKS DETAILS:MARITAL_STATUS_CASA DETAILS:GENDER_CASA DETAILS:OCCUPATION_CAT_V2_NEW DETAILS:DR_CR DETAILS:MCC_CODE','-loadKey true' )
AS (ROWKEY:chararray,CUST_ID:chararray,TXN_AMT:chararray,TXN_DATE:chararray,AGE_CASA:chararray,MERCH_NAME:chararray,TXN_PARTICULARS:chararray,MCC_CATEGORY:chararray,TXN_REMARKS:chararray,MARITAL_STATUS_CASA:chararray,GENDER_CASA:chararray,OCCUPATION_CAT_V2_NEW:chararray,DR_CR:chararray,MCC_CODE:chararray);
DUMP HDATA; (this gives perfect result):
2016-03-01,1,20.0,2016-03-22,27,test_merch,test/particulars,test_category,test/remarks,married,M,service,D,1234
A help is appreciated
I am using Horton stack in distributed mode:
HDP2.3
Apache Pig version 0.15.0
HBase 1.1.1
Also all jars are in place as I have installed them through Ambari.

solved the data upload :
as i was missing to Rank the relation , hence hbase rowkey becomes the rank.\
DATA_FIL_1 = RANK DATA_FIL_2;
NOTE: this will generate arbitrary rowkey.
But if you want to define your row key then use like:
you have to give another relation , only STORE function won't work.
this will take first tuple as rowkey(which you have defined)
storage_data = STORE DATA_FIL INTO 'hbase://genome:event_sink' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('event_data:CUST_ID event_data:EVENT transaction_data:TXN_AMT transaction_data:TXN_DATE transaction_data:AGE_CASA transaction_data:MERCH_NAME transaction_data:TXN_PARTICULARS transaction_data:MCC_CATEGORY transaction_data:TXN_REMARKS transaction_data:MARITAL_STATUS_CASA transaction_data:GENDER_CASA transaction_data:OCCUPATION_CAT_V2_NEW transaction_data:DR_CR transaction_data:MCC_CODE');

Apache Drill 1.2 and Oracle JDBC

Using Apache Drill v1.2 and Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit in embedded mode.
I'm curious if anyone has had any success connecting Apache Drill to an Oracle DB. I've updated the drill-override.conf with the following configurations (per documents):
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "localhost:2181",
drill.exec.sys.store.provider.local.path = "/mypath"
}
and placed the ojdbc6.jar in \apache-drill-1.2.0\jars\3rdparty. I can successfully create the storage plug-in:
{
"type": "jdbc",
"driver": "oracle.jdbc.driver.OracleDriver",
"url": "jdbc:oracle:thin:#<IP>:<PORT>:<SID>",
"username": "USERNAME",
"password": "PASSWORD",
"enabled": true
}
but when I issue a query such as:
select * from <storage_name>.<schema_name>.`dual`;
I get the following error:
Query Failed: An Error Occurred
org.apache.drill.common.exceptions.UserRemoteException: VALIDATION ERROR: From line 1, column 15 to line 1, column 20: Table '<storage_name>.<schema_name>.dual' not found [Error Id: 57a4153c-6378-4026-b90c-9bb727e131ae on <computer_name>:<PORT>].
I've tried to query other schema/tables and get a similar result. I've also tried connecting to Teradata and get the same error. Does any one have suggestions/run into similar issues?

It's working with Drill 1.3 (released on 23-Dec-2015)
Plugin: name - oracle
{
"type": "jdbc",
"driver": "oracle.jdbc.driver.OracleDriver",
"url": "jdbc:oracle:thin:user/password#192.xxx.xxx.xxx:1521:orcl ",
"enabled": true
}
Query:
select * from <plugin-name>.<user-name>.<table-name>;
Example:
select * from oracle.USER.SAMPLE;
Check drill's documentation for more details.
Note: Make sure you added ojdbc7.12.1.0.2.jar(recommended in docs) in apache-drill-1.3.0/jars/3rdparty

It kind of works in Apache drill 1.3.
The strange thing is that I can only query the tables for which there are synonyms created...
In the command line try:
use <storage_name>;
show tables;
This will give you a list of objects that you can query - dual is not on that list ;-).

I'm using apache-drill-1.9.0 and it seems that the schema name is interpreted case sensitive and must be be therefore be in upper case.
For a table user1.my_tab (which is per default created in Oracle in upper case)
this works in Drill (plugin name is oracle)
SELECT * FROM oracle.USER1.my_tab;
But this triggers an error
SELECT * FROM oracle.user1.my_tab;
SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: Table 'oracle.user1.my_tab' not found
An alternative approach is to set the plugin name and the schema name with use (owner must be upper case as well)
0: jdbc:drill:zk=local> use oracle.USER1;
+-------+-------------------------------------------+
| ok | summary |
+-------+-------------------------------------------+
| true | Default schema changed to [oracle.USER1] |
+-------+-------------------------------------------+
1 row selected (0,169 seconds)
0: jdbc:drill:zk=local> select * from my_tab;
+------+
| X |
+------+
| 1.0 |
| 1.0 |
+------+
2 rows selected (0,151 seconds)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

GeoSpark using Maven UDF running Databricks on Azure? - user-defined-functions

geospark 1.3.1 seems to be built for Spark 2.x, see [1], if you need to use Spark 3.x then try and upgrade to geospark 1.3.2 otherwise try to downgrade to spark 2.x. [1] http://sedona.apache.org/download/GeoSpark-All-Modules-Maven-Central-Coordinates/

Related

How to load Impala table directly to Spark using JDBC?

How to write spark dataframe to impala database

SPARK SQL (1.5.1) connect to Oracle and write to Avro

error while storing in Hbase using Pig

Apache Drill 1.2 and Oracle JDBC

Categories

Resources