Getting java.lang.RuntimeException:driver class not found when used jdbc_hook more than once in airflow operator - jdbc

Use case is to run list of sql in hive and update impala metadata. As shown below two methods for hive and impala uses jdbc_hook. In which ever order I call these methods only first one runs and second one throws ERROR - java.lang.RuntimeException: Class <driver name of hive/impala> not found. Each method runs fine when used separately.
Please find the execute method of airflow custom operator :::
Note :: I can't use hive_operator to run hive statements. And I don't see any methods in HiveServer2_Hook. Am new to airflow any help is much appreciated
from airflow.models.baseoperator import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.hooks.jdbc_hook import JdbcHook
import sqlparse
class CustomHiveOperator(BaseOperator):
"""
Executes hql code and invalidates,compute stats impala for that table.
Requires JdbcHook,sqlparse.
:param hive_jdbc_conn: reference to a predefined hive database
:type hive_jdbc_conn: str
:param impala_jdbc_conn: reference to a predefined impala database
:type impala_jdbc_conn: str
:param table_name: hive table name, used for post process in impala
:type table_name: str
:param script_path: hql scirpt path to run in hive
:type script_path: str
:param autocommit: if True, each command is automatically committed.
(default value: False)
:type autocommit: bool
:param parameters: (optional) the parameters to render the SQL query with.
:type parameters: mapping or iterable
"""
#apply_defaults
def __init__(
self,
hive_jdbc_conn: str,
impala_jdbc_conn:str,
table_name:str,
script_path:str,
autocommit=False,
parameters=None,
*args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.hive_jdbc_conn= hive_jdbc_conn
self.impala_jdbc_conn= impala_jdbc_conn
self.table_name=table_name
self.script_path=script_path
self.autocommit=autocommit
self.parameters=parameters
def execute(self,context):
self.hive_run()
self.impala_post_process()
def format_string(self,x):
return x.replace(";","")
def hive_run(self):
with open(self.script_path) as f:
data = f.read()
hql_temp = sqlparse.split((data))
hql = [self.format_string(x) for x in hql_temp]
self.log.info('Executing: %s', hql)
self.hive_hook = JdbcHook(jdbc_conn_id=self.hive_jdbc_conn)
self.hive_hook.run(hql, self.autocommit, parameters=self.parameters)
def impala_post_process(self):
invalidate = 'INVALIDATE METADATA '+self.table_name
compute_stats = 'COMPUTE STATS '+self.table_name
hql = [invalidate,compute_stats]
self.log.info('Executing: %s', hql)
self.impala_hook = JdbcHook(jdbc_conn_id=self.impala_jdbc_conn)
self.impala_hook.run(hql, self.autocommit, parameters=self.parameters)

This is actually an issue with how Airflow uses jaydebeapi and the underlying JPype modules to facilitate the JDBC connection.
A Java virtual machine is started when JPype is first used (the first JdbcHook.get_conn call) and the only libraries that the virtual machine is made aware of is the specific one you're using for whichever JDBC connection is being made. When you create another connection the virtual machine is already started and isn't aware of the libraries necessary for a different connection type.
The only way that I have found around this is to use an extension of JdbcHook which overrides the get_conn method to gather the paths of all JDBC drivers that are defined as a Connection object in Airflow. See here for the Airflow implementation.

Related

TypeError: Object of type RowProxy is not JSON serializable - Flask

I am using SQLAlchemy to query the database from my Flask web-application using engine.After I do the SELECT Query and also do use fetchall object after ResultProxy is returned which ultimately returns RowProxy object and then I store in session.
Here is my code:
import os
from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
from flask import Flask, session
engine = create_engine(os.environ.get('DATABASE_URL'))
db = scoped_session(sessionmaker(bind=engine))
app = Flask(__name__)
app.secret_key = os.environ.get('SECRET_KEY')
#app.route('/')
def index():
session['list'] = db.execute("SELECT title,author,year FROM books WHERE year = 2011 LIMIT 4").fetchall()
print(session['list'])
return "<h1>hello world</h1>"
if __name__ == "__main__":
app.run(debug = True)
Here is the output:
[('Steve Jobs', 'Walter Isaacson', 2011), ('Legend', 'Marie Lu', 2011), ('Hit List', 'Laurell K. Hamilton', 2011), ('Born at Midnight', 'C.C. Hunter', 2011)]
Traceback (most recent call last):
File "C:\Users\avise\AppData\Local\Programs\Python\Python38\Lib\site-packages\flask\app.py", line 2463, in __call__
return self.wsgi_app(environ, start_response)
File "C:\Users\avise\AppData\Local\Programs\Python\Python38\Lib\site-packages\flask\app.py", line 2449, in wsgi_app
response = self.handle_exception(e)
File "C:\Users\avise\AppData\Local\Programs\Python\Python38\Lib\site-packages\flask\app.py", line 1866, in handle_exception
reraise(exc_type, exc_value, tb)
File "C:\Users\avise\AppData\Local\Programs\Python\Python38\Lib\json\encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type RowProxy is not JSON serializable
The session item stores the data as i can see in output.But "hello world" is not rendered.
And if i replace the session variable by ordinary variable say x then it seems to be working.
But i think i need to use sessions so that my application will be used simultaneously by to users to display different things. So, how could i use sessions in this case or is there any other way?
Any help will be appreciated as I am new to Flask and web-development.
From what I understand about the Flask Session object is that it acts as a python dictionary; however values must be JSON serializable. In this case, just like the error suggests, the RowProxy object that is being returned by fetch all is not json serializable.
A solution to this problem would be to instead pass through a result of your query as a dictionary (which is JSON serializable).
It looks like the result of your query is returning a list of tuples so we can do the following:
res = db.execute("SELECT title,author,year FROM books WHERE year = 2011 LIMIT 4").fetchall()
user_books = {}
index = 0
for entry in res:
user_books[index] = {'title':res[index][0],
'author':res[index][1],
'year':res[index][2],
}
index += 1
session['list'] = user_books
A word of caution; however, is that since we are using the title of the book as a key, if there are two books with the same title, information may be overwritten, so consider using a unique id as the key.
Also note that the dictionary construction above would only work for the query you already have - if you added another column to the select statement you would have to edit the code to include the extra column information.

Trying to access an object from a listener python web framework

Pretty new to asynch so here is my question and thank you in advance.
Hi All very simple question I might be thinking too much into.
I am trying to access this cassandra client outside of these defined listeners below that get registered to a sanic main app.
I need the session in order to use an update query which will execute Asynchronously. I can definetly connect and event query from the 'setup_cassandra_session_listener' method below. But having tough time figuring how to call this Cassandra session outside and isolate so i can access else where.
from aiocassandra import aiosession
from cassandra.cluster import Cluster
from sanic import Sanic
from config import CLUSTER_HOST, TABLE_NAME, CASSANDRA_KEY_SPACE, CASSANDRA_PORT, DATA_CENTER, DEBUG_LEVEL, LOGGER_FORMAT
log = logging.getLogger('sanic')
log.setLevel('INFO')
cassandra_cluster = None
def setup_cassandra_session_listener(app, loop):
global cassandra_cluster
cassandra_cluster = Cluster([CLUSTER_HOST], CASSANDRA_PORT, DATA_CENTER)
session = cassandra_cluster.connect(CASSANDRA_KEY_SPACE)
metadata = cassandra_cluster.metadata
app.session = cassandra_cluster.connect(CASSANDRA_KEY_SPACE)
log.info('Connected to cluster: ' + metadata.cluster_name)
aiosession(session)
app.cassandra = session
def teardown_cassandra_session_listener(app, loop):
global cassandra_cluster
cassandra_cluster.shutdown()
def register_cassandra(app: Sanic):
app.listener('before_server_start')(setup_cassandra_session_listener)
app.listener('after_server_stop')(teardown_cassandra_session_listener)
Here is a working example that should do what you need. It does not actually run Cassandra (since I have no experience doing that). But, in principle this should work with any database connection you need to manage across the lifespan of your running server.
from sanic import Sanic
from sanic.response import text
app = Sanic()
class DummyCluser:
def connect(self):
print("Connecting")
return "session"
def shutdown(self):
print("Shutting down")
def setup_cassandra_session_listener(app, loop):
# No global variables needed
app.cluster = DummyCluser()
app.session = app.cluster.connect()
def teardown_cassandra_session_listener(app, loop):
app.cluster.shutdown()
def register_cassandra(app: Sanic):
# Changed these listeners to be more friendly if running with and ASGI server
app.listener('after_server_start')(setup_cassandra_session_listener)
app.listener('before_server_stop')(teardown_cassandra_session_listener)
#app.get("/")
async def get(request):
return text(app.session)
if __name__ == "__main__":
register_cassandra(app)
app.run(debug=True)
The idea is that you attach to your app instance (as you did) and then are able to simply access that inside your routes with request.app.

Trying to Reconnet issue in Odoo 11.0

In the case of importing or exporting large amount of data (like 5000 records) from odoo, It show connection lost and trying to reconnect messages. So is there any way to deal with it while working with large amount of records ?
I have same issue in odoo 12 when I tried to import translation. I did some hard troubleshooting, I disabled nginx that I configured with self signed SSL.
In my case, import records from MSSQL.
Use transient model and pyodbc
import pyodbc
class Import(models.TransientModel):
#api.multi
def insert_records(self):
try:
cnxn = pyodbc.connect(
'DRIVER={SQL Server}; SERVER=server_address; DATABASE=db_name; UID=uid_name; PWD=pass_word')
cursor = cnxn.cursor()
cursor.execute("SELECT * FROM MSSQL_table")
rows = cursor.fetchall() # or cursor.fetchmany(5000)
pg_table = self.env["pgSql_table"].search([])
for row in rows:
pg_table.create({
"pg_colume_name1": row.SQL_colume_name1, ...
})
except Exception as e:
pass
return True
<button string="import" type="object" name="insert_records" confirm="confirm?"/>
Click button to run insert method,and use pyCharm to set break-points while running.
Fetchmany(number) allow you to test few records

How do I prevent logging an encrypted airflow variable?

I've set up an airflow which has an encrypted variable declared. I'm using BigQueryOperator. I use the encrypted variable in the SQL that is fed to the class. But the airflow logs the SQL after decrypting the variable. How can I prevent that from happening?
Unfortunately, there is no built-in way to achieve this.
A possible work-around is removing self.log.info('Executing: %s', self.sql) line in BigQueryOperator or creating a new Operator inheriting BigQueryOperator like below:
class CustomBQOperator(BigQueryOperator):
#apply_defaults
def __init__(self, *args, **kwargs):
super(CustomBQOperator).__init__(*args, **kwargs)
def execute(self, context):
if self.bq_cursor is None:
hook = BigQueryHook(
bigquery_conn_id=self.bigquery_conn_id,
use_legacy_sql=self.use_legacy_sql,
delegate_to=self.delegate_to)
conn = hook.get_conn()
self.bq_cursor = conn.cursor()
self.bq_cursor.run_query(
self.sql,
destination_dataset_table=self.destination_dataset_table,
write_disposition=self.write_disposition,
allow_large_results=self.allow_large_results,
flatten_results=self.flatten_results,
udf_config=self.udf_config,
maximum_billing_tier=self.maximum_billing_tier,
maximum_bytes_billed=self.maximum_bytes_billed,
create_disposition=self.create_disposition,
query_params=self.query_params,
labels=self.labels,
schema_update_options=self.schema_update_options,
priority=self.priority,
time_partitioning=self.time_partitioning,
api_resource_configs=self.api_resource_configs,
cluster_fields=self.cluster_fields,
)
And then using this CustomBQOperator instead of BigQueryOperator

Streaming to HBase with pyspark

There is a fair amount of info online about bulk loading to HBase with Spark streaming using Scala (these two were particularly useful) and some info for Java, but there seems to be a lack of info for doing it with PySpark. So my questions are:
How can data be bulk loaded into HBase using PySpark?
Most examples in any language only show a single column per row being upserted. How can I upsert multiple columns per row?
The code I currently have is as follows:
if __name__ == "__main__":
context = SparkContext(appName="PythonHBaseBulkLoader")
streamingContext = StreamingContext(context, 5)
stream = streamingContext.textFileStream("file:///test/input");
stream.foreachRDD(bulk_load)
streamingContext.start()
streamingContext.awaitTermination()
What I need help with is the bulk load function
def bulk_load(rdd):
#???
I've made some progress previously, with many and various errors (as documented here and here)
So after much trial and error, I present here the best I have come up with. It works well, and successfully bulk loads data (using Puts or HFiles) I am perfectly willing to believe that it is not the best method, so any comments/other answers are welcome. This assume you're using a CSV for your data.
Bulk loading with Puts
By far the easiest way to bulk load, this simply creates a Put request for each cell in the CSV and queues them up to HBase.
def bulk_load(rdd):
#Your configuration will likely be different. Insert your own quorum and parent node and table name
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\#Split the input into individual lines
.flatMap(csv_to_key_value)#Convert the CSV line to key value pairs
load_rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
The function csv_to_key_value is where the magic happens:
def csv_to_key_value(row):
cols = row.split(",")#Split on commas.
#Each cell is a tuple of (key, [key, column-family, column-descriptor, value])
#Works well for n>=1 columns
result = ((cols[0], [cols[0], "f1", "c1", cols[1]]),
(cols[0], [cols[0], "f2", "c2", cols[2]]),
(cols[0], [cols[0], "f3", "c3", cols[3]]))
return result
The value converter we defined earlier will convert these tuples into HBase Puts
Bulk loading with HFiles
Bulk loading with HFiles is more efficient: rather than a Put request for each cell, an HFile is written directly and the RegionServer is simply told to point to the new HFile. This will use Py4J, so before the Python code we have to write a small Java program:
import py4j.GatewayServer;
import org.apache.hadoop.hbase.*;
public class GatewayApplication {
public static void main(String[] args)
{
GatewayApplication app = new GatewayApplication();
GatewayServer server = new GatewayServer(app);
server.start();
}
}
Compile this, and run it. Leave it running as long as your streaming is happening. Now update bulk_load as follows:
def bulk_load(rdd):
#The output class changes, everything else stays
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}#"org.apache.hadoop.hbase.client.Put"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)
#Don't process empty RDDs
if not load_rdd.isEmpty():
#saveAsNewAPIHadoopDataset changes to saveAsNewAPIHadoopFile
load_rdd.saveAsNewAPIHadoopFile("file:///tmp/hfiles" + startTime,
"org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",
conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
#The file has now been written, but HBase doesn't know about it
#Get a link to Py4J
gateway = JavaGateway()
#Convert conf to a fully fledged Configuration type
config = dict_to_conf(conf)
#Set up our HTable
htable = gateway.jvm.org.apache.hadoop.hbase.client.HTable(config, "Test")
#Set up our path
path = gateway.jvm.org.apache.hadoop.fs.Path("/tmp/hfiles" + startTime)
#Get a bulk loader
loader = gateway.jvm.org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles(config)
#Load the HFile
loader.doBulkLoad(path, htable)
else:
print("Nothing to process")
Finally, the fairly straightforward dict_to_conf:
def dict_to_conf(conf):
gateway = JavaGateway()
config = gateway.jvm.org.apache.hadoop.conf.Configuration()
keys = conf.keys()
vals = conf.values()
for i in range(len(keys)):
config.set(keys[i], vals[i])
return config
As you can see, bulk loading with HFiles is more complex than using Puts, but depending on your data load it is probably worth it since once you get it working it's not that difficult.
One last note on something that caught me off guard: HFiles expect the data they receive to be written in lexical order. This is not always guaranteed to be true, especially since "10" < "9". If you have designed your key to be unique, then this can be fixed easily:
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)#Sort in ascending order

Resources