How to fetch sql query results in airflow using JDBC operator - jdbc

I have configured JDBC connection in Airflow connections. My Task part of DAG looks like below which contains a select statement. When triggering the DAG is success, but my the query results are not printed in log. How to fetch the results of the query using JDBC operator.
dag = DAG(dag_id='test_azure_sqldw_v1',
default_args=default_args,schedule_interval=None,dagrun_timeout=timedelta(seconds=120),)
sql="select count(*) from tablename"
azure_sqldw=JdbcOpetask_id='azure_sqldw',sql=sql,jdbc_conn_id="cdf_sqldw",autocommit=True,dag=dag)

The operator does not print to the log. It just run the query.
If you want to fetch results to do something with it you need to use the hook.
from airflow.providers.jdbc.hooks.jdbc import JdbcHook
def func(jdbc_conn_id, sql, **kwargs):
"""Print df from JDBC """
pprint(kwargs)
hook = JdbcHook(jdbc_conn_id=jdbc_conn_id)
df = hook.get_pandas_df(sql=sql,autocommit=True)
print(df.to_string())
run_this = PythonOperator(
task_id='task',
python_callable=func,
op_kwargs={'jdbc_conn_id': 'cdf_sqldw', 'sql': 'select count(*) from tablename' },
dag=dag,
)
You can also create a custom operator that does the required action you seek.

Related

How to run sql query that use "for json" and "for xml" in sql server?

I want to run a complicated SQL query that use "for sql" and "for xml" in microsoft sql server. I used ExecuteSQL, but it got me this Error:
ExecuteSQL[id=87f3d800-016c-1000-28be-8d99127d267e] Unable to execute SQL {my sql query} for StandardFlowFileRecord[uuid=ef1bc7c3-2e48-4911-abea-52e9b5a432b2,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1565651606554-2, container=default, section=2], offset=176, length=8],offset=0,name=ef1bc7c3-2e48-4911-abea-52e9b5a432b2,size=8] due to org.apache.avro.SchemaParseException: Illegal character in: XML_F52E2B61-18A1-11d1-B105-00805F49916B; routing to failure: org.apache.nifi.processor.exception.ProcessException: org.apache.avro.SchemaParseException: Illegal character in: XML_F52E2B61-18A1-11d1-B105-00805F49916B
How can I get the result as JSON or XML in apache NiFi?
Format Query Results as JSON with FOR JSON (SQL Server)
note: i have no experience with such queries and how they are returned to the client.
i assume that server returns one row and one column with string containing json (or xml)
in this case the script for ExecuteGroovyScript could be like this:
def ff=session.get()
if(!ff)return
def query = '''
SELECT TOP 2 ArtistName,
(SELECT AlbumName FROM Albums
WHERE Artists.ArtistId = Albums.ArtistId FOR JSON PATH) AS Albums
FROM Artists ORDER BY ArtistName FOR JSON PATH
'''
ff.write("UTF-8"){writer->
SQL.mydb.eachRow{row->
writer.append( row[1] ) //get first column from the row
writer.append('\n') //expecting 1 row but just in case add separation for the next row
}
}
//transfer to success
REL_SUCCESS << ff
to make SQL.mydb available for the script - create property with this name and connect it to corresponding DBCP: https://i.stack.imgur.com/C83f5.png

Logstash extracting values from sp_executesql

We're tracking and shipping our SQL Server procedure timeouts into Elasticsearch, so we can visualize them in Kibana in order to spot issues.
Some of our SQL queries are parameterized and use sp_executesql.
Would it be possible to extract its parameters and their values from query?
For instance:
EXEC sp_executesql N'EXEC dbo.MySearchProcedure #UserId=#p0,#SearchPhrase=#p1'
, N'#p0 int,#p1 nvarchar(max)'
, #p0 = 11111
, #p1 = N'denmark';
And get this result out of it:
{
"Procedure": "dbo.MySearchProcedure",
"Statement": "exec sp_executesql N'exec Search.GetAnalysisResultsListTextSearch #SubscriberId=#p0,#SearchTerms=#p1,#SortType=#p2',N'#p0 int,#p1 nvarchar(max) ,#p2 int',#p0=47594,#p1=N'denmark',#p2=0",
"Parameters": {
"UserId": 11111,
"SearchPhrase": "denmark"
}
}
Sounds like a job for the ruby{} filter. First, locate all your key=value pair in the query (#userid=#p0, probably using ruby's scan feature), then locate the assignments (#p0=1234, using scan again), then create a new field combining the two (userid=1234). In the ruby filter:
event['userid'] = '1234'

Spark DataFrame not executing group-by statements within a JDBC data source

I've registered a MySQL data source as follows:
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://address=(protocol=tcp)(host=myhost)(port=3306)(user=)(password=)/dbname"
val jdbcDF = sqlContext.load("jdbc", Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> "videos"))
jdbcDF.registerTempTable("videos")
and then executed the following Spark SQL query:
select
uploader, count(*) as items
from
videos_table
where
publisher_id = 154
group by
uploader
order by
items desc
This call actually executes the following query on the MySQL server:
SELECT uploader,publisher_id FROM videos WHERE publisher_id = 154
and then loads the data to the Spark cluster and performs the group-by as a Spark operation.
This behavior is problematic due to the excess network traffic created by not performing the group-by on the MySQL server. Is there a way to force the DataFrame to run the literal query on the MySQL server?
Well, it depends. Spark can push-down over JDBC only the predicates so it is not possible to dynamically execute arbitrary query on a database side. Still, it is possible to use any valid query as a table argument so you can do something like this:
val tableQuery =
"""(SELECT uploader, count(*) as items FROM videos GROUP BY uploader) tmp"""
val jdbcDF = sqlContext.load("jdbc", Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> tableQuery
))
If that's not enough you can try to create a custom data source.

BigQuery - Check if table already exists

I have a dataset in BigQuery. This dataset contains multiple tables.
I am doing the following steps programmatically using the BigQuery API:
Querying the tables in the dataset - Since my response is too large, I am enabling allowLargeResults parameter and diverting my response to a destination table.
I am then exporting the data from the destination table to a GCS bucket.
Requirements:
Suppose my process fails at Step 2, I would like to re-run this step.
But before I re-run, I would like to check/verify that the specific destination table named 'xyz' already exists in the dataset.
If it exists, I would like to re-run step 2.
If it does not exist, I would like to do foo.
How can I do this?
Thanks in advance.
Alex F's solution works on v0.27, but will not work on later versions. In order to migrate to v0.28+, the below solution will work.
from google.cloud import bigquery
project_nm = 'gc_project_nm'
dataset_nm = 'ds_nm'
table_nm = 'tbl_nm'
client = bigquery.Client(project_nm)
dataset = client.dataset(dataset_nm)
table_ref = dataset.table(table_nm)
def if_tbl_exists(client, table_ref):
from google.cloud.exceptions import NotFound
try:
client.get_table(table_ref)
return True
except NotFound:
return False
if_tbl_exists(client, table_ref)
Here is a python snippet that will tell whether a table exists (deleting it in the process--careful!):
def doesTableExist(project_id, dataset_id, table_id):
bq.tables().delete(
projectId=project_id,
datasetId=dataset_id,
tableId=table_id).execute()
return False
Alternately, if you'd prefer not deleting the table in the process, you could try:
def doesTableExist(project_id, dataset_id, table_id):
try:
bq.tables().get(
projectId=project_id,
datasetId=dataset_id,
tableId=table_id).execute()
return True
except HttpError, err
if err.resp.status <> 404:
raise
return False
If you want to know where bq came from, you can call build_bq_client from here: http://code.google.com/p/bigquery-e2e/source/browse/samples/ch12/auth.py
In general, if you're using this to test whether you should run a job that will modify the table, it can be a good idea to just do the job anyway, and use WRITE_TRUNCATE as a write disposition.
Another approach can be to create a predictable job id, and retry the job with that id. If the job already exists, the job already ran (you might want to double check to make sure the job didn't fail, however).
Enjoy:
def doesTableExist(bigquery, project_id, dataset_id, table_id):
try:
bigquery.tables().get(
projectId=project_id,
datasetId=dataset_id,
tableId=table_id).execute()
return True
except Exception as err:
if err.resp.status != 404:
raise
return False
There is an edit in exception.
you can use exists() now to check if dataset exists same with table
BigQuery exist documentation
recently big query introduced so called scripting statements that can be quite a game changer for some flows.
check them out here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting
Now for example to check if table exists you can use something like this:
sql = """
BEGIN
IF EXISTS(SELECT 1 from `YOUR_PROJECT.YOUR_DATASET.YOUR_TABLE) THEN
SELECT 'table_found';
END IF;
EXCEPTION WHEN ERROR THEN
# you can print your own message like above or return error message
# however google says not to rely on error message structure as it may change
select ##error.message;
END;
"""
With my_bigquery being an instance of class google.cloud.bigquery.Client (already authentified and associated to a project):
my_bigquery.dataset(dataset_name).table(table_name).exists() # returns boolean
It does an API call to test for the existence of the table via a GET request
Source: https://googlecloudplatform.github.io/google-cloud-python/0.24.0/bigquery-table.html#google.cloud.bigquery.table.Table.exists
It works for me using 0.27 of the Google Bigquery Python module
Inline SQL Alternative
tarheel's answer is probably the most correct at this point in time
but I was considering the comment from Ivan above that "404 could also mean the resource is not there for a bunch of reasons", so here is a solution that should always successfully run a metadata query and return a result.
It's not the fastest, because it always has to run the query, bigquery has overhead for small queries
A trick I've seen previously is to query information_schema for a (table) object, and union that to a fake query that ensures a record is always returned even if the the object doesn't. There's also a LIMIT 1 and an ordering to ensure the single record returned represents the table, if it does exist. See the SQL in the code below.
In spite of doc claims that Bigquery standard SQL is ISO compliant, they don't support information_schema, but they do have __table_summary__
dataset is required because you can't query __table_summary__ without specifying dataset
dataset is not a parameter in the SQL because you can't parameterize object names without sql injection issues (apart from with the magical _TABLE_SUFFIX, see https://cloud.google.com/bigquery/docs/querying-wildcard-tables )
#!/usr/bin/env python
"""
Inline SQL way to check a table exists in Bigquery
e.g.
print(table_exists(dataset_name='<dataset_goes_here>', table_name='<real_table_name'))
True
print(table_exists(dataset_name='<dataset_goes_here>', table_name='imaginary_table_name'))
False
"""
from __future__ import print_function
from google.cloud import bigquery
def table_exists(dataset_name, table_name):
client = bigquery.Client()
query = """
SELECT table_exists FROM
(
SELECT true as table_exists, 1 as ordering
FROM __TABLES_SUMMARY__ WHERE table_id = #table_name
UNION ALL
SELECT false as table_exists, 2 as ordering
) ORDER by ordering LIMIT 1"""
query_params = [bigquery.ScalarQueryParameter('table_name', 'STRING', table_name)]
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = query_params
if dataset_name is not None:
dataset_ref = client.dataset(dataset_name)
job_config.default_dataset = dataset_ref
query_job = client.query(
query,
job_config=job_config
)
results = query_job.result()
for row in results:
# There is only one row because LIMIT 1 in the SQL
return row.table_exists

ActiveRecord Subquery Inner Join

I am trying to convert a "raw" PostGIS SQL query into a Rails ActiveRecord query. My goal is to convert two sequential ActiveRecord queries (each taking ~1ms) into a single ActiveRecord query taking (~1ms). Using the SQL below with ActiveRecord::Base.connection.execute I was able to validate the reduction in time.
Thus, my direct request is to help me to convert this query into an ActiveRecord query (and the best way to execute it).
SELECT COUNT(*)
FROM "users"
INNER JOIN (
SELECT "centroid"
FROM "zip_caches"
WHERE "zip_caches"."postalcode" = '<postalcode>'
) AS "sub" ON ST_Intersects("users"."vendor_coverage", "sub"."centroid")
WHERE "users"."active" = 1;
NOTE that the value <postalcode> is the only variable data in this query. Obviously, there are two models here User and ZipCache. User has no direct relation to ZipCache.
The current two step ActiveRecord query looks like this.
zip = ZipCache.select(:centroid).where(postalcode: '<postalcode>').limit(1).first
User.where{st_intersects(vendor_coverage, zip.centroid)}.count
Disclamer: I've never used PostGIS
First in your final request, it seems like you've missed the WHERE "users"."active" = 1; part.
Here is what I'd do:
First add a active scope on user (for reusability)
scope :active, -> { User.where(active: 1) }
Then for the actual query, You can have the sub query without executing it and use it in a joins on the User model, such as:
subquery = ZipCache.select(:centroid).where(postalcode: '<postalcode>')
User.active
.joins("INNER JOIN (#{subquery.to_sql}) sub ON ST_Intersects(users.vendor_coverage, sub.centroid)")
.count
This allow minimal raw SQL, while keeping only one query.
In any case, check the actual sql request in your console/log by setting the logger level to debug.
The amazing tool scuttle.io is perfect for converting these sorts of queries:
User.select(Arel.star.count).where(User.arel_table[:active].eq(1)).joins(
User.arel_table.join(ZipCach.arel_table).on(
Arel::Nodes::NamedFunction.new(
'ST_Intersects', [
User.arel_table[:vendor_coverage], Sub.arel_table[:centroid]
]
)
).join_sources
)

Resources