How to connect Hbase With JDBC driver of Apache Drill programmatically - jdbc

I tried to use JDBC driver of Apache Drill programatically.
Here's the code:
import java.sql.DriverManager
object SearchHbaseWithHbase {
def main(args: Array[String]): Unit = {
Class.forName("org.apache.drill.jdbc.Driver")
val zkIp = "192.168.3.2:2181"
val connection = DriverManager.getConnection(s"jdbc:drill:zk=${zkIp};schema:hbase")
connection.setSchema("hbase")
println(connection.getSchema)
val st = connection.createStatement()
val rs = st.executeQuery("SELECT * FROM Label")
while (rs.next()){
println(rs.getString(1))
}
}
}
I have set the database schema with type : hbase, Like:
connection.setSchema("hbase")
But it fails with the error code:
Exception in thread "main" java.sql.SQLException: VALIDATION ERROR:
From line 1, column 15 to line 1, column 19: Table 'Label' not found
SQL Query null
The Label table is exactly exit in my hbase.
I can find My data when I use sqline like:
sqline -u jdbc:drill:zk....
use hbase;
input :select * from Label;

I have solved this problem. I confused the drill's schema and jdbc driver schema......
the correct codes should be like:
object SearchHbaseWithHbase{
def main(args: Array[String]): Unit = {
Class.forName("org.apache.drill.jdbc.Driver")
val zkIp = "192.168.3.2:2181"
val p = new java.util.Properties
p.setProperty("schema","hbase")
// val connectionInfo = new ConnectionInfo
val url = s"jdbc:drill:zk=${zkIp}"
val connection = DriverManager.getConnection(url, p)
// connection.setSchema("hbase")
// println(connection.getSchema)
val st = connection.createStatement()
val rs = st.executeQuery("SELECT * FROM Label")
while (rs.next()){
println(rs.getString(1))
}
}
}

Related

How do fetch the state with custome query? Corda application using Spring boot webserver- error while fetching the result

I have created the IOU in corda applicatiion, the IOU has ID,xml payload in body, partyName. NOW, i want to fetch the state with custome query that is basis on ID. NOTE- i am not using linearID.
Below is my API call- which gives me syntax error on. Can someone please correct me, what is the wrong thing that i am doing.
#GetMapping(value = ["getIous"],produces = [ MediaType.APPLICATION_JSON_VALUE])
private fun getTransactionOne(#RequestParam(value = "payloadId") payloadId: String): ResponseEntity<List<IOUState>> {
val generalCriteria = QueryCriteria.VaultQueryCriteria(Vault.StateStatus.ALL)
val results = builder { IOUState::iouId.equal(payloadId)
val customCriteria = QueryCriteria.VaultCustomQueryCriteria(results)}
val criteria = customCriteria.and(customCriteria)
val res = proxy.vaultQueryBy<IOUState>(criteria)
return ResponseEntity.ok(res)
}
I think the issue is because VaultCustomQueryCriteria is applicable only to StatePersistable objects. So you should use PersistentIOU instead of IOUState. Also, I could see incorrect use of brackets. Here is how your code should look like:
#GetMapping(value = ["getIous"],produces = [ MediaType.APPLICATION_JSON_VALUE])
private fun getTransactionOne(#RequestParam(value = "payloadId") payloadId: String): ResponseEntity<List<IOUState>> {
val generalCriteria = QueryCriteria.VaultQueryCriteria(Vault.StateStatus.ALL)
val results = builder {
val idx = IOUSchemaV1.PersistentIOU::iouId.equal(payloadId);
val customCriteria = QueryCriteria.VaultCustomQueryCriteria(idx)
val criteria = generalCriteria.and(customCriteria)
proxy.vaultQueryBy<IOUState>(criteria);
}
return ResponseEntity.ok(results)
}

how to spark streaming use connection pool for impala(JDBC to kudu)

i use impala(JDBC) twice to get kafka offset and save data in foreachRDD.
but impala and kudu always shutdown. now i want to set connect pool, but little for scala.
it's my pseudo-code:
#node-1
val newOffsets = getNewOffset() // JDBC read kafka offset in kudu
val messages = KafkaUtils.createDirectStream(*,newOffsets,)
messages.foreachRDD(rdd => {
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
#node-2
Class.forName(jdbcDriver)
val con = DriverManager.getConnection("impala url")
val stmt = con.createStatement()
stmt.executeUpdate(sql)
#node-3
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offsetRanges.foreach { r => {
val rt_upsert = s"UPSert into ${execTable} values('${r.topic}',${r.partition},${r.untilOffset})"
stmt.executeUpdate(rt_upsert)
stmt.close()
conn.close()
}}
}
how to code by c3p0 or other ? I'll appreciate your help.
Below is the code for reading data from kafka and inserting the data to kudu.
import kafka.message.MessageAndMetadata
import kafka.serializer.StringDecoder
import org.apache.kudu.client.KuduClient
import org.apache.kudu.client.KuduSession
import org.apache.kudu.client.KuduTable
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.streaming.Milliseconds,
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka.KafkaUtils
import scala.collection.immutable.List
import scala.collection.mutable
import scala.util.control.NonFatal
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.types._
import org.apache.kudu.Schema
import org.apache.kudu.Type._
import org.apache.kudu.spark.kudu.KuduContext
import scala.collection.mutable.ArrayBuffer
object KafkaKuduConnect extends Serializable {
def main(args: Array[String]): Unit = {
try {
val TopicName="TestTopic"
val kafkaConsumerProps = Map[String, String]("bootstrap.servers" -> "localhost:9092")
val KuduMaster=""
val KuduTable=""
val sparkConf = new SparkConf().setAppName("KafkaKuduConnect")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val ssc = new StreamingContext(sc, Milliseconds(1000))
val kuduContext = new KuduContext(KuduMaster, sc)
val kuduclient: KuduClient = new KuduClient.KuduClientBuilder(KuduMaster).build()
//Opening table
val kudutable: KuduTable = kuduclient.openTable(KuduTable)
// getting table schema
val tableschema: Schema = kudutable.getSchema
// creating the schema for the data frame using the table schema
val FinalTableSchema =generateStructure(tableschema)
//To create the schema for creating the data frame from the rdd
def generateStructure(tableSchema:Schema):StructType=
{
var structFieldList:List[StructField]=List()
for(index <-0 until tableSchema.getColumnCount)
{
val col=tableSchema.getColumnByIndex(index)
val coltype=col.getType.toString
println(coltype)
col.getType match {
case INT32 =>
structFieldList=structFieldList:+StructField(col.getName,IntegerType)
case STRING =>
structFieldList=structFieldList:+StructField(col.getName,StringType)
case _ =>
println("No Class Type Found")
}
}
return StructType(structFieldList)
}
// To create the Row object with values type casted according to the schema
def getRow(schema:StructType,Data:List[String]):Row={
var RowData=ArrayBuffer[Any]()
schema.zipWithIndex.foreach(
each=>{
var Index=each._2
each._1.dataType match {
case IntegerType=>
if(Data(Index)=="" | Data(Index)==null)
RowData+=Data(Index).toInt
case StringType=>
RowData+=Data(Index)
case _=>
RowData+=Data(Index)
}
}
)
return Row.fromSeq(RowData.toList)
}
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConsumerProps, Set(TopicName))
messages.foreachRDD(
//we are looping through eachrdd
eachrdd=>{
//we are creating the Rdd[Row] to create dataframe with our schema
val StructuredRdd= eachrdd.map(eachmessage=>{
val record=eachmessage._2
getRow(FinalTableSchema, record.split(",").toList)
})
//DataFrame with required structure according to the table.
val DF = sqlContext.createDataFrame(StructuredRdd, FinalTableSchema)
kuduContext.upsertRows(DF,KuduTable)
}
)
}
catch {
case NonFatal(e) => {
print("Error in main : " + e)
}
}
}
}

How to bind oracle params in scala?

i have done of code that it executes any query in scala it works perfect unless if my query has to use same parameter twice. the database version is 12 and the oracle jar is ojdbc6, i wrote this code in order to execute query
def executeQuery(locale: String, query: String, input: Map[String, String], output: List[String]): Vector[Map[String, Any]] = {
var connection: Connection = null;
val properties = ConnectionLoader.getConnectionProperties(locale);
try {
connection = getDBConnection(properties);
val statement = connection prepareCall (query)
if (null != input)
for ((k, v) <- input) {
statement.setObject(k, v)
}
for (k <- output) {
statement.registerOutParameter(k, OracleTypes.INTEGER)
}
val resultSet = statement.executeQuery();
realize(resultSet);
} catch {
case e => throw e;
} finally {
if (null != connection)
connection.close();
}
}
and my query is
SELECT COUNT (1)
FROM ORDERS
WHERE ORDER_ID = :P_ORDER_ID AND STATUS_ID = 4
this query works fine but im getting an error when executing
SELECT COUNT (1)
FROM ORDERS
WHERE ORDER_ID = :P_ORDER_ID AND STATUS_ID = 4 and :P_ORDER_ID=9
Regardless to this unlogical query i'm getting this error
Execution exception[[SQLException: Missing IN or OUT parameter at index:: 2]]
i have googled everything but i got no result please advise

SPARK SQL - update MySql table using DataFrames and JDBC

I'm trying to insert and update some data on MySql using Spark SQL DataFrames and JDBC connection.
I've succeeded to insert new data using the SaveMode.Append. Is there a way to update the data already existing in MySql Table from Spark SQL?
My code to insert is:
myDataFrame.write.mode(SaveMode.Append).jdbc(JDBCurl,mySqlTable,connectionProperties)
If I change to SaveMode.Overwrite it deletes the full table and creates a new one, I'm looking for something like the "ON DUPLICATE KEY UPDATE" available in MySql
It is not possible. As for now (Spark 1.6.0 / 2.2.0 SNAPSHOT) Spark DataFrameWriter supports only four writing modes:
SaveMode.Overwrite: overwrite the existing data.
SaveMode.Append: append the data.
SaveMode.Ignore: ignore the operation (i.e. no-op).
SaveMode.ErrorIfExists: default option, throw an exception at runtime.
You can insert manually for example using mapPartitions (since you want an UPSERT operation should be idempotent and as such easy to implement), write to temporary table and execute upsert manually, or use triggers.
In general achieving upsert behavior for batch operations and keeping decent performance is far from trivial. You have to remember that in general case there will be multiple concurrent transactions in place (one per each partition) so you have to ensure that there will no write conflicts (typically by using application specific partitioning) or provide appropriate recovery procedures. In practice it may be better to perform and batch writes to a temporary table and resolve upsert part directly in the database.
A pity that there is no SaveMode.Upsert mode in Spark for such quite common cases like upserting.
zero322 is right in general, but I think it should be possible (with compromises in performance) to offer such replace feature.
I also wanted to provide some java code for this case.
Of course it is not that performant as the built-in one from spark - but it should be a good basis for your requirements. Just modify it towards your needs:
myDF.repartition(20); //one connection per partition, see below
myDF.foreachPartition((Iterator<Row> t) -> {
Connection conn = DriverManager.getConnection(
Constants.DB_JDBC_CONN,
Constants.DB_JDBC_USER,
Constants.DB_JDBC_PASS);
conn.setAutoCommit(true);
Statement statement = conn.createStatement();
final int batchSize = 100000;
int i = 0;
while (t.hasNext()) {
Row row = t.next();
try {
// better than REPLACE INTO, less cycles
statement.addBatch(("INSERT INTO mytable " + "VALUES ("
+ "'" + row.getAs("_id") + "',
+ "'" + row.getStruct(1).get(0) + "'
+ "') ON DUPLICATE KEY UPDATE _id='" + row.getAs("_id") + "';"));
//conn.commit();
if (++i % batchSize == 0) {
statement.executeBatch();
}
} catch (SQLIntegrityConstraintViolationException e) {
//should not occur, nevertheless
//conn.commit();
} catch (SQLException e) {
e.printStackTrace();
} finally {
//conn.commit();
statement.executeBatch();
}
}
int[] ret = statement.executeBatch();
System.out.println("Ret val: " + Arrays.toString(ret));
System.out.println("Update count: " + statement.getUpdateCount());
//conn.commit();
statement.close();
conn.close();
overwrite org.apache.spark.sql.execution.datasources.jdbc JdbcUtils.scala insert into to replace into
import java.sql.{Connection, Driver, DriverManager, PreparedStatement, ResultSet, SQLException}
import scala.collection.JavaConverters._
import scala.util.control.NonFatal
import com.typesafe.scalalogging.Logger
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.execution.datasources.jdbc.{DriverRegistry, DriverWrapper, JDBCOptions}
import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcDialects, JdbcType}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Row}
/**
* Util functions for JDBC tables.
*/
object UpdateJdbcUtils {
val logger = Logger(this.getClass)
/**
* Returns a factory for creating connections to the given JDBC URL.
*
* #param options - JDBC options that contains url, table and other information.
*/
def createConnectionFactory(options: JDBCOptions): () => Connection = {
val driverClass: String = options.driverClass
() => {
DriverRegistry.register(driverClass)
val driver: Driver = DriverManager.getDrivers.asScala.collectFirst {
case d: DriverWrapper if d.wrapped.getClass.getCanonicalName == driverClass => d
case d if d.getClass.getCanonicalName == driverClass => d
}.getOrElse {
throw new IllegalStateException(
s"Did not find registered driver with class $driverClass")
}
driver.connect(options.url, options.asConnectionProperties)
}
}
/**
* Returns a PreparedStatement that inserts a row into table via conn.
*/
def insertStatement(conn: Connection, table: String, rddSchema: StructType, dialect: JdbcDialect)
: PreparedStatement = {
val columns = rddSchema.fields.map(x => dialect.quoteIdentifier(x.name)).mkString(",")
val placeholders = rddSchema.fields.map(_ => "?").mkString(",")
val sql = s"REPLACE INTO $table ($columns) VALUES ($placeholders)"
conn.prepareStatement(sql)
}
/**
* Retrieve standard jdbc types.
*
* #param dt The datatype (e.g. [[org.apache.spark.sql.types.StringType]])
* #return The default JdbcType for this DataType
*/
def getCommonJDBCType(dt: DataType): Option[JdbcType] = {
dt match {
case IntegerType => Option(JdbcType("INTEGER", java.sql.Types.INTEGER))
case LongType => Option(JdbcType("BIGINT", java.sql.Types.BIGINT))
case DoubleType => Option(JdbcType("DOUBLE PRECISION", java.sql.Types.DOUBLE))
case FloatType => Option(JdbcType("REAL", java.sql.Types.FLOAT))
case ShortType => Option(JdbcType("INTEGER", java.sql.Types.SMALLINT))
case ByteType => Option(JdbcType("BYTE", java.sql.Types.TINYINT))
case BooleanType => Option(JdbcType("BIT(1)", java.sql.Types.BIT))
case StringType => Option(JdbcType("TEXT", java.sql.Types.CLOB))
case BinaryType => Option(JdbcType("BLOB", java.sql.Types.BLOB))
case TimestampType => Option(JdbcType("TIMESTAMP", java.sql.Types.TIMESTAMP))
case DateType => Option(JdbcType("DATE", java.sql.Types.DATE))
case t: DecimalType => Option(
JdbcType(s"DECIMAL(${t.precision},${t.scale})", java.sql.Types.DECIMAL))
case _ => None
}
}
private def getJdbcType(dt: DataType, dialect: JdbcDialect): JdbcType = {
dialect.getJDBCType(dt).orElse(getCommonJDBCType(dt)).getOrElse(
throw new IllegalArgumentException(s"Can't get JDBC type for ${dt.simpleString}"))
}
// A `JDBCValueGetter` is responsible for getting a value from `ResultSet` into a field
// for `MutableRow`. The last argument `Int` means the index for the value to be set in
// the row and also used for the value in `ResultSet`.
private type JDBCValueGetter = (ResultSet, InternalRow, Int) => Unit
// A `JDBCValueSetter` is responsible for setting a value from `Row` into a field for
// `PreparedStatement`. The last argument `Int` means the index for the value to be set
// in the SQL statement and also used for the value in `Row`.
private type JDBCValueSetter = (PreparedStatement, Row, Int) => Unit
/**
* Saves a partition of a DataFrame to the JDBC database. This is done in
* a single database transaction (unless isolation level is "NONE")
* in order to avoid repeatedly inserting data as much as possible.
*
* It is still theoretically possible for rows in a DataFrame to be
* inserted into the database more than once if a stage somehow fails after
* the commit occurs but before the stage can return successfully.
*
* This is not a closure inside saveTable() because apparently cosmetic
* implementation changes elsewhere might easily render such a closure
* non-Serializable. Instead, we explicitly close over all variables that
* are used.
*/
def savePartition(
getConnection: () => Connection,
table: String,
iterator: Iterator[Row],
rddSchema: StructType,
nullTypes: Array[Int],
batchSize: Int,
dialect: JdbcDialect,
isolationLevel: Int): Iterator[Byte] = {
val conn = getConnection()
var committed = false
var finalIsolationLevel = Connection.TRANSACTION_NONE
if (isolationLevel != Connection.TRANSACTION_NONE) {
try {
val metadata = conn.getMetaData
if (metadata.supportsTransactions()) {
// Update to at least use the default isolation, if any transaction level
// has been chosen and transactions are supported
val defaultIsolation = metadata.getDefaultTransactionIsolation
finalIsolationLevel = defaultIsolation
if (metadata.supportsTransactionIsolationLevel(isolationLevel)) {
// Finally update to actually requested level if possible
finalIsolationLevel = isolationLevel
} else {
logger.warn(s"Requested isolation level $isolationLevel is not supported; " +
s"falling back to default isolation level $defaultIsolation")
}
} else {
logger.warn(s"Requested isolation level $isolationLevel, but transactions are unsupported")
}
} catch {
case NonFatal(e) => logger.warn("Exception while detecting transaction support", e)
}
}
val supportsTransactions = finalIsolationLevel != Connection.TRANSACTION_NONE
try {
if (supportsTransactions) {
conn.setAutoCommit(false) // Everything in the same db transaction.
conn.setTransactionIsolation(finalIsolationLevel)
}
val stmt = insertStatement(conn, table, rddSchema, dialect)
val setters: Array[JDBCValueSetter] = rddSchema.fields.map(_.dataType)
.map(makeSetter(conn, dialect, _))
val numFields = rddSchema.fields.length
try {
var rowCount = 0
while (iterator.hasNext) {
val row = iterator.next()
var i = 0
while (i < numFields) {
if (row.isNullAt(i)) {
stmt.setNull(i + 1, nullTypes(i))
} else {
setters(i).apply(stmt, row, i)
}
i = i + 1
}
stmt.addBatch()
rowCount += 1
if (rowCount % batchSize == 0) {
stmt.executeBatch()
rowCount = 0
}
}
if (rowCount > 0) {
stmt.executeBatch()
}
} finally {
stmt.close()
}
if (supportsTransactions) {
conn.commit()
}
committed = true
Iterator.empty
} catch {
case e: SQLException =>
val cause = e.getNextException
if (cause != null && e.getCause != cause) {
if (e.getCause == null) {
e.initCause(cause)
} else {
e.addSuppressed(cause)
}
}
throw e
} finally {
if (!committed) {
// The stage must fail. We got here through an exception path, so
// let the exception through unless rollback() or close() want to
// tell the user about another problem.
if (supportsTransactions) {
conn.rollback()
}
conn.close()
} else {
// The stage must succeed. We cannot propagate any exception close() might throw.
try {
conn.close()
} catch {
case e: Exception => logger.warn("Transaction succeeded, but closing failed", e)
}
}
}
}
/**
* Saves the RDD to the database in a single transaction.
*/
def saveTable(
df: DataFrame,
url: String,
table: String,
options: JDBCOptions) {
val dialect = JdbcDialects.get(url)
val nullTypes: Array[Int] = df.schema.fields.map { field =>
getJdbcType(field.dataType, dialect).jdbcNullType
}
val rddSchema = df.schema
val getConnection: () => Connection = createConnectionFactory(options)
val batchSize = options.batchSize
val isolationLevel = options.isolationLevel
df.foreachPartition(iterator => savePartition(
getConnection, table, iterator, rddSchema, nullTypes, batchSize, dialect, isolationLevel)
)
}
private def makeSetter(
conn: Connection,
dialect: JdbcDialect,
dataType: DataType): JDBCValueSetter = dataType match {
case IntegerType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setInt(pos + 1, row.getInt(pos))
case LongType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setLong(pos + 1, row.getLong(pos))
case DoubleType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setDouble(pos + 1, row.getDouble(pos))
case FloatType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setFloat(pos + 1, row.getFloat(pos))
case ShortType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setInt(pos + 1, row.getShort(pos))
case ByteType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setInt(pos + 1, row.getByte(pos))
case BooleanType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setBoolean(pos + 1, row.getBoolean(pos))
case StringType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setString(pos + 1, row.getString(pos))
case BinaryType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setBytes(pos + 1, row.getAs[Array[Byte]](pos))
case TimestampType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setTimestamp(pos + 1, row.getAs[java.sql.Timestamp](pos))
case DateType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setDate(pos + 1, row.getAs[java.sql.Date](pos))
case t: DecimalType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setBigDecimal(pos + 1, row.getDecimal(pos))
case ArrayType(et, _) =>
// remove type length parameters from end of type name
val typeName = getJdbcType(et, dialect).databaseTypeDefinition
.toLowerCase.split("\\(")(0)
(stmt: PreparedStatement, row: Row, pos: Int) =>
val array = conn.createArrayOf(
typeName,
row.getSeq[AnyRef](pos).toArray)
stmt.setArray(pos + 1, array)
case _ =>
(_: PreparedStatement, _: Row, pos: Int) =>
throw new IllegalArgumentException(
s"Can't translate non-null value for field $pos")
}
}
usage:
val url = s"jdbc:mysql://$host/$database?useUnicode=true&characterEncoding=UTF-8"
val parameters: Map[String, String] = Map(
"url" -> url,
"dbtable" -> table,
"driver" -> "com.mysql.jdbc.Driver",
"numPartitions" -> numPartitions.toString,
"user" -> user,
"password" -> password
)
val options = new JDBCOptions(parameters)
for (d <- data) {
UpdateJdbcUtils.saveTable(d, url, table, options)
}
ps: pay attention to the deadlock, not update data frequently, just use in re-run in case of emergency, I think that's why spark not support this official.
If your table is small, then you can read the sql data and do the upsertion in spark dataframe. And overwrite the existing sql table.
zero323's answer is right, I just wanted to add that you could use JayDeBeApi package to workaround this:
https://pypi.python.org/pypi/JayDeBeApi/
to update data in your mysql table. It might be a low-hanging fruit since you already have mysql jdbc driver installed.
The JayDeBeApi module allows you to connect from Python code to
databases using Java JDBC. It provides a Python DB-API v2.0 to that
database.
We use Anaconda distribution of Python, and JayDeBeApi python package comes standard.
See examples in that link above.
In PYSPARK I was not able to do that so I decided to use odbc.
url = "jdbc:sqlserver://xxx:1433;databaseName=xxx;user=xxx;password=xxx"
df.write.jdbc(url=url, table="__TableInsert", mode='overwrite')
cnxn = pyodbc.connect('Driver={ODBC Driver 17 for SQL Server};Server=xxx;Database=xxx;Uid=xxx;Pwd=xxx;', autocommit=False)
try:
crsr = cnxn.cursor()
# DO UPSERTS OR WHATEVER YOU WANT
crsr.execute("DELETE FROM Table")
crsr.execute("INSERT INTO Table (Field) SELECT Field FROM __TableInsert")
cnxn.commit()
except:
cnxn.rollback()
cnxn.close()

apache Spark with hive

How can i read/write data from/to hive?
Is it necessary to compile spark with hive profile to interact with hive?
which maven dependencies are required to interact with hive?
i could not find a well documentation to follow step by step to get working with hive.
Currently Here is my code
val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("test"))
val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val sqlCon = new SQLContext(sc)
val schemaString = "Date:string,Open:double,High:double,Low:double,Close:double,Volume:double,Adj_Close:double"
val schema =
StructType(
schemaString.split(",").map(fieldName => StructField(fieldName.split(":")(0),
getFieldTypeInSchema(fieldName.split(":")(1)), true)))
val rdd = sc.textFile("hdfs://45.55.159.119:9000/yahoo_stocks.csv")
//val rdd = sc.parallelize(arr)
val rowRDDx = noHeader.map(p => {
var list: collection.mutable.Seq[Any] = collection.mutable.Seq.empty[Any]
var index = 0
val regex = rowSplittingRegexBuilder(Seq(","))
var tokens = p.split(regex)
tokens.foreach(value => {
var valType = schema.fields(index).dataType
var returnVal: Any = null
valType match {
case IntegerType => returnVal = value.toString.toInt
case DoubleType => returnVal = value.toString.toDouble
case LongType => returnVal = value.toString.toLong
case FloatType => returnVal = value.toString.toFloat
case ByteType => returnVal = value.toString.toByte
case StringType => returnVal = value.toString
case TimestampType => returnVal = value.toString
}
list = list :+ returnVal
index += 1
})
Row.fromSeq(list)
})
val df = sqlCon.applySchema(rowRDDx, schema)
HiveContext.sql("create table yahoo_orc_table (date STRING, open_price FLOAT, high_price FLOAT, low_price FLOAT, close_price FLOAT, volume INT, adj_price FLOAT) stored as orc")
df.saveAsTable("hive", "org.apache.spark.sql.hive.orc", SaveMode.Append)
I am getting following exception
15/10/12 14:57:36 INFO storage.BlockManagerMaster: Registered BlockManager
15/10/12 14:57:38 INFO scheduler.EventLoggingListener: Logging events to hdfs://host:9000/spark/logs/local-1444676256555
Exception in thread "main" java.lang.VerifyError: Bad return type
Exception Details:
Location:
org/apache/spark/sql/catalyst/expressions/Pmod.inputType()Lorg/apache/spark/sql/types/AbstractDataType; #3: areturn
Reason:
Type 'org/apache/spark/sql/types/NumericType$' (current frame, stack[0]) is not assignable to 'org/apache/spark/sql/types/AbstractDataType' (from method signature)
Current Frame:
bci: #3
flags: { }
locals: { 'org/apache/spark/sql/catalyst/expressions/Pmod' }
stack: { 'org/apache/spark/sql/types/NumericType$' }
Bytecode:
0000000: b200 63b0
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2595)
at java.lang.Class.getConstructor0(Class.java:2895)
at java.lang.Class.getDeclaredConstructor(Class.java:2066)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$4.apply(FunctionRegistry.scala:267)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$4.apply(FunctionRegistry.scala:267)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.expression(FunctionRegistry.scala:267)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.<init>(FunctionRegistry.scala:148)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.<clinit>(FunctionRegistry.scala)
at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:414)
at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:413)
at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:39)
at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:203)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:72)
thanks
As dr_stein mentioned, this error is usually due to incompatible compile time and runtime JDK versions, such as running JDK 1.6 with 1.7 compiled jars.
I would also check if your hive libraries reflect the correct versions and if your hive server is also running on the same JDK as you are.
you can also try running with the -noverify option, which will disable verification.

Resources