BigQuery - Check if table already exists - google-api

I have a dataset in BigQuery. This dataset contains multiple tables.
I am doing the following steps programmatically using the BigQuery API:
Querying the tables in the dataset - Since my response is too large, I am enabling allowLargeResults parameter and diverting my response to a destination table.
I am then exporting the data from the destination table to a GCS bucket.
Requirements:
Suppose my process fails at Step 2, I would like to re-run this step.
But before I re-run, I would like to check/verify that the specific destination table named 'xyz' already exists in the dataset.
If it exists, I would like to re-run step 2.
If it does not exist, I would like to do foo.
How can I do this?
Thanks in advance.

Alex F's solution works on v0.27, but will not work on later versions. In order to migrate to v0.28+, the below solution will work.
from google.cloud import bigquery
project_nm = 'gc_project_nm'
dataset_nm = 'ds_nm'
table_nm = 'tbl_nm'
client = bigquery.Client(project_nm)
dataset = client.dataset(dataset_nm)
table_ref = dataset.table(table_nm)
def if_tbl_exists(client, table_ref):
from google.cloud.exceptions import NotFound
try:
client.get_table(table_ref)
return True
except NotFound:
return False
if_tbl_exists(client, table_ref)

Here is a python snippet that will tell whether a table exists (deleting it in the process--careful!):
def doesTableExist(project_id, dataset_id, table_id):
bq.tables().delete(
projectId=project_id,
datasetId=dataset_id,
tableId=table_id).execute()
return False
Alternately, if you'd prefer not deleting the table in the process, you could try:
def doesTableExist(project_id, dataset_id, table_id):
try:
bq.tables().get(
projectId=project_id,
datasetId=dataset_id,
tableId=table_id).execute()
return True
except HttpError, err
if err.resp.status <> 404:
raise
return False
If you want to know where bq came from, you can call build_bq_client from here: http://code.google.com/p/bigquery-e2e/source/browse/samples/ch12/auth.py
In general, if you're using this to test whether you should run a job that will modify the table, it can be a good idea to just do the job anyway, and use WRITE_TRUNCATE as a write disposition.
Another approach can be to create a predictable job id, and retry the job with that id. If the job already exists, the job already ran (you might want to double check to make sure the job didn't fail, however).

Enjoy:
def doesTableExist(bigquery, project_id, dataset_id, table_id):
try:
bigquery.tables().get(
projectId=project_id,
datasetId=dataset_id,
tableId=table_id).execute()
return True
except Exception as err:
if err.resp.status != 404:
raise
return False
There is an edit in exception.

you can use exists() now to check if dataset exists same with table
BigQuery exist documentation

recently big query introduced so called scripting statements that can be quite a game changer for some flows.
check them out here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting
Now for example to check if table exists you can use something like this:
sql = """
BEGIN
IF EXISTS(SELECT 1 from `YOUR_PROJECT.YOUR_DATASET.YOUR_TABLE) THEN
SELECT 'table_found';
END IF;
EXCEPTION WHEN ERROR THEN
# you can print your own message like above or return error message
# however google says not to rely on error message structure as it may change
select ##error.message;
END;
"""

With my_bigquery being an instance of class google.cloud.bigquery.Client (already authentified and associated to a project):
my_bigquery.dataset(dataset_name).table(table_name).exists() # returns boolean
It does an API call to test for the existence of the table via a GET request
Source: https://googlecloudplatform.github.io/google-cloud-python/0.24.0/bigquery-table.html#google.cloud.bigquery.table.Table.exists
It works for me using 0.27 of the Google Bigquery Python module

Inline SQL Alternative
tarheel's answer is probably the most correct at this point in time
but I was considering the comment from Ivan above that "404 could also mean the resource is not there for a bunch of reasons", so here is a solution that should always successfully run a metadata query and return a result.
It's not the fastest, because it always has to run the query, bigquery has overhead for small queries
A trick I've seen previously is to query information_schema for a (table) object, and union that to a fake query that ensures a record is always returned even if the the object doesn't. There's also a LIMIT 1 and an ordering to ensure the single record returned represents the table, if it does exist. See the SQL in the code below.
In spite of doc claims that Bigquery standard SQL is ISO compliant, they don't support information_schema, but they do have __table_summary__
dataset is required because you can't query __table_summary__ without specifying dataset
dataset is not a parameter in the SQL because you can't parameterize object names without sql injection issues (apart from with the magical _TABLE_SUFFIX, see https://cloud.google.com/bigquery/docs/querying-wildcard-tables )
#!/usr/bin/env python
"""
Inline SQL way to check a table exists in Bigquery
e.g.
print(table_exists(dataset_name='<dataset_goes_here>', table_name='<real_table_name'))
True
print(table_exists(dataset_name='<dataset_goes_here>', table_name='imaginary_table_name'))
False
"""
from __future__ import print_function
from google.cloud import bigquery
def table_exists(dataset_name, table_name):
client = bigquery.Client()
query = """
SELECT table_exists FROM
(
SELECT true as table_exists, 1 as ordering
FROM __TABLES_SUMMARY__ WHERE table_id = #table_name
UNION ALL
SELECT false as table_exists, 2 as ordering
) ORDER by ordering LIMIT 1"""
query_params = [bigquery.ScalarQueryParameter('table_name', 'STRING', table_name)]
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = query_params
if dataset_name is not None:
dataset_ref = client.dataset(dataset_name)
job_config.default_dataset = dataset_ref
query_job = client.query(
query,
job_config=job_config
)
results = query_job.result()
for row in results:
# There is only one row because LIMIT 1 in the SQL
return row.table_exists

Related

How to add Data sanity logic in airflow

I am implemented airflow , can we add data sanity logic. suppose I have Task1 which do the following task
1.Read the data from the data source--RAW DATA.
2. do join with dimensional table to get the some relation detail product name etc.
3. Store output file some location after step 2.
There is a task2 that stored the output file into database. but before task2 execution i need to some data validation like count of RAW DATA should equal to the store output file count i.e after joining
like count(raw_data) = count(raw_data_join_with_dimensional) , if it is true then trigger the Task2 else send the alert and failed the job.
For that use case a possible workflow could be:
check_op = SQLCheckOperator(
task_id='check_task',
sql='YOUR VALIDATION SQL',
conn_id='YOUR CONN',
)
t2_op = YourNextOperator()
failure_op = EmailOperator(subject='check has failed', to='YOUR EMAIL', trigger_rule='one_failed')
check_op >> [t2_op, failure_op]
It works as follows:
SQLCheckOperator runs the query against the DB. If query returns False the check has failed thus the operator will be in Failure state. If the query returns value the query consider as success thus the operator will be in Success state.
EmailOperator will be triggered if SQLCheckOperator status is failure otherwise YourNextOperator will be triggered.
Edit: Go see #Elad's answer, there's a much more specific operator for this task.
The airflow.sensors.sql_sensor.SqlSensor can be used to build a task that can check data quality:
from airflow.sensors import sql_sensor
...
check_data_task = sql_sensor.SqlSensor(
task_id="check_data",
conn_id="YourConnectionIdentifier",
sql="SELECT CASE WHEN data_is_valid THEN 1 ELSE 0 END ...",
timeout=0
)
The key being that your sql argument return "at least one cell that contains a non-zero / empty string value" -- according to the documentation. The timeout=0 means that it will check once and fail if your data check query doesn't "pass"

Select Count very slow using EF with Oracle

I'm using EF 5 with Oracle database.
I'm doing a select count in a table with a specific parameter. When I'm using EF, the query returns the value 31, as expected, But the result takes about 10 seconds to be returned.
using (var serv = new Aperam.SIP.PXP.Negocio.Modelos.SIP_PA())
{
var teste = (from ens in serv.PA_ENSAIOS_UM
where ens.COD_IDENT_UNMET == "FBLDY3840"
select ens).Count();
}
If I execute the simple query bellow the result is the same (31), but the result is showed in 500 milisecond.
SELECT
count(*)
FROM
PA_ENSAIOS_UM
WHERE
COD_IDENT_UNMET 'FBLDY3840'
There are a way to improve the performance when I'm using EF?
Note: There are 13.000.000 lines in this table.
Here are some things you can try:
Capture the query that is being generated and see if it is the same as the one you are using. Details can be found here, but essentially, you will instantiate your DbContext (let's call it "_context") and then set the Database.Log property to be the logging method. It's fine if this method doesn't actually do anything--you can just set a breakpoint in there and see what's going on.
So, as an example: define a logging function (I have a static class called "Logging" which uses nLog to write to files)
public static void LogQuery(string queryData)
{
if (string.IsNullOrWhiteSpace(queryData))
return;
var message = string.Format("{0}{1}",
queryData.Trim().Contains(Environment.NewLine) ?
Environment.NewLine : "", queryData);
_sqlLogger.Info(message);
_genLogger.Trace($"EntityFW query (len {message.Length} chars)");
}
Then when you create your context point to LogQuery:
_context.Database.Log = Logging.LogQuery;
When you do your tests, remember that often the first run is the slowest because the server has to actually do the work, but on the subsequent runs, it often uses cached data. Try running your tests 2-3 times back to back and see if they don't start to run in the same time.
I don't know if it generates the same query or not, but try this other form (which should be functionally equivalent, but may provide better time)
var teste = serv.PA_ENSAIOS_UM.Count(ens=>ens.COD_IDENT_UNMET == "FBLDY3840");
I'm wondering if the version you have pulls data from the DB and THEN counts it. If so, this other syntax may leave all the work to be done at the server, where it belongs. Not sure, though, esp. since I haven't ever used EF with Oracle and I don't know if it behaves the same as SQL or not.

Check if data already exists before inserting into BigQuery table (using Python)

I am setting up a daily cron job that appends a row to BigQuery table (using Python), however, duplicate data is being inserted. I have searched online and I know that there is a way to manually remove duplicate data, but I wanted to see if I could avoid this duplication in the first place.
Is there a way to check a BigQuery table to see if a data record already exists first in order to avoid inserting duplicate data? Thanks.
CODE SNIPPET:
import webapp2
import logging
from googleapiclient import discovery
from oath2client.client import GoogleCredentials
PROJECT_ID = 'foo'
DATASET_ID = 'bar'
TABLE_ID = 'foo_bar_table’
class UpdateTableHandler(webapp2.RequestHandler):
def get(self):
credentials = GoogleCredentials.get_application_default()
service = discovery.build('bigquery', 'v2', credentials=credentials)
try:
the_fruits = Stuff.query(Stuff.fruitTotal >= 5).filter(Stuff.fruitColor == 'orange').fetch();
for fruit in the_fruits:
#some code here
basket = dict()
basket['id'] = fruit.fruitId
basket['Total'] = fruit.fruitTotal
basket['PrimaryVitamin'] = fruit.fruitVitamin
basket['SafeRaw'] = fruit.fruitEdibleRaw
basket['Color'] = fruit.fruitColor
basket['Country'] = fruit.fruitCountry
body = {
'rows': [
{
'json': basket,
'insertId': str(uuid.uuid4())
}
]
}
response = bigquery_service.tabledata().insertAll(projectId=PROJECT_ID,
datasetId=DATASET_ID,
tableId=TABLE_ID,
body=body).execute(num_retries=5)
logging.info(response)
except Exception, e:
logging.error(e)
app = webapp2.WSGIApplication([
('/update_table', UpdateTableHandler),
], debug=True)
The only way to test whether the data already exists is to run a query.
If you have lots of data in the table, that query could be expensive, so in most cases we suggest you go ahead and insert the duplicate, and then merge duplicates later on.
As Zig Mandel suggests in a comment, you can query over a date partition if you know the date when you expect to see the record, but that may still be expensive compared to inserting and removing duplicates.

Linq Contains issue: cannot formulate the equivalent of 'WHERE IN' query

In the table ReservationWorkerPeriods there are records of all workers that are planned to work on a given period on any possible machine.
The additional table WorkerOnMachineOnConstructionSite contains columns workerId, MachineId and ConstructionSiteId.
From the table ReservationWorkerPeriods I would like to retrieve just workers who work on selected machine.
In order to retrieve just relevant records from WorkerOnMachineOnConstructionSite table I have written the following code:
var relevantWorkerOnMachineOnConstructionSite = (from cswm in currentConstructionSiteSchedule.ContrustionSiteWorkerOnMachine
where cswm.MachineId == machineId
select cswm).ToList();
workerOnMachineOnConstructionSite = relevantWorkerOnMachineOnConstructionSite as List<ContrustionSiteWorkerOnMachine>;
These records are also used in the application so I don't want to bypass the above code even if is possible to directly retrieve just workerPeriods for workers who work on selected machine. Anyway I haven't figured out how it is possible to retrieve the relevant workerPeriods once we know which userIDs are relevant.
I have tried the following code:
var userIDs = from w in workerOnMachineOnConstructionSite select new {w.WorkerId};
List<ReservationWorkerPeriods> workerPeriods = currentConstructionSiteSchedule.ReservationWorkerPeriods.ToList();
allocatedWorkers = workerPeriods.Where(wp => userIDs.Contains(wp.WorkerId));
but it seems to be incorrect and don't know how to fix it. Does anyone know what is the problem and how it is possible to retrieve just records which contain userIDs from the list?
Currently, you are constructing an anonymous object on the fly, with one property. You'll want to grab the id directly with (note the missing curly braces):
var userIDs = from w in workerOnMachineOnConstructionSite select w.WorkerId;
Also, in such cases, don't call ToList on it - the variable userIDs just contains the query, not the result. If you use that variable in a further query, the provider can translate it to a single sql query.

What is the best way pre filter user access for sqlalchemy queries?

I have been looking at the sqlalchemy recipes on their wiki, but don't know which one is best to implement what I am trying to do.
Every row on in my tables have an user_id associated with it. Right now, for every query, I queried by the id of the user that's currently logged in, then query by the criteria I am interested in. My concern is that the developers might forget to add this filter to the query (a huge security risk). Therefore, I would like to set a global filter based on the current user's admin rights to filter what the logged in user could see.
Appreciate your help. Thanks.
Below is simplified redefined query constructor to filter all model queries (including relations). You can pass it to as query_cls parameter to sessionmaker. User ID parameter don't need to be global as far as session is constructed when it's already available.
class HackedQuery(Query):
def get(self, ident):
# Use default implementation when there is no condition
if not self._criterion:
return Query.get(self, ident)
# Copied from Query implementation with some changes.
if hasattr(ident, '__composite_values__'):
ident = ident.__composite_values__()
mapper = self._only_mapper_zero(
"get() can only be used against a single mapped class.")
key = mapper.identity_key_from_primary_key(ident)
if ident is None:
if key is not None:
ident = key[1]
else:
from sqlalchemy import util
ident = util.to_list(ident)
if ident is not None:
columns = list(mapper.primary_key)
if len(columns)!=len(ident):
raise TypeError("Number of values doen't match number "
'of columns in primary key')
params = {}
for column, value in zip(columns, ident):
params[column.key] = value
return self.filter_by(**params).first()
def QueryPublic(entities, session=None):
# It's not directly related to the problem, but is useful too.
query = HackedQuery(entities, session).with_polymorphic('*')
# Version for several entities needs thorough testing, so we
# don't use it yet.
assert len(entities)==1, entities
cls = _class_to_mapper(entities[0]).class_
public_condition = getattr(cls, 'public_condition', None)
if public_condition is not None:
query = query.filter(public_condition)
return query
It works for single model queries only, and there is a lot of work to make it suitable for other cases. I'd like to see an elaborated version since it's MUST HAVE functionality for most web applications. It uses fixed condition stored in each model class, so you have to modify it to your needs.
Here is a very naive implementation that assumes there is the attribute/property self.current_user logged in user has stored.
class YourBaseRequestHandler(object):
#property
def current_user(self):
"""The current user logged in."""
pass
def query(self, session, entities):
"""Use this method instead of :method:`Session.query()
<sqlalchemy.orm.session.Session.query>`.
"""
return session.query(entities).filter_by(user_id=self.current_user.id)
I wrote an SQLAlchemy extension that I think does what you are describing: https://github.com/mwhite/multialchemy
It does this by proxying changes to the Query._from_obj and QueryContext._froms properties, which is where the tables to select from ultimately get set.

Resources