How to add Data sanity logic in airflow - validation

I am implemented airflow , can we add data sanity logic. suppose I have Task1 which do the following task
1.Read the data from the data source--RAW DATA.
2. do join with dimensional table to get the some relation detail product name etc.
3. Store output file some location after step 2.
There is a task2 that stored the output file into database. but before task2 execution i need to some data validation like count of RAW DATA should equal to the store output file count i.e after joining
like count(raw_data) = count(raw_data_join_with_dimensional) , if it is true then trigger the Task2 else send the alert and failed the job.

For that use case a possible workflow could be:
check_op = SQLCheckOperator(
task_id='check_task',
sql='YOUR VALIDATION SQL',
conn_id='YOUR CONN',
)
t2_op = YourNextOperator()
failure_op = EmailOperator(subject='check has failed', to='YOUR EMAIL', trigger_rule='one_failed')
check_op >> [t2_op, failure_op]
It works as follows:
SQLCheckOperator runs the query against the DB. If query returns False the check has failed thus the operator will be in Failure state. If the query returns value the query consider as success thus the operator will be in Success state.
EmailOperator will be triggered if SQLCheckOperator status is failure otherwise YourNextOperator will be triggered.

Edit: Go see #Elad's answer, there's a much more specific operator for this task.
The airflow.sensors.sql_sensor.SqlSensor can be used to build a task that can check data quality:
from airflow.sensors import sql_sensor
...
check_data_task = sql_sensor.SqlSensor(
task_id="check_data",
conn_id="YourConnectionIdentifier",
sql="SELECT CASE WHEN data_is_valid THEN 1 ELSE 0 END ...",
timeout=0
)
The key being that your sql argument return "at least one cell that contains a non-zero / empty string value" -- according to the documentation. The timeout=0 means that it will check once and fail if your data check query doesn't "pass"

Related

Web.Content calling API service and merging pages with List.Transform started to fail

I created PowerBI report which which is connecting to data source via API service. Returning json contains thousands of entities. API service is called via Web.Content function. API service returns always total record count and so we are able to calculate nr. of pages which has to be called to obtain whole dataset. This report is displaying data from our servicedesk app, which is deployed on many servers and for many customers and use Query parameters to connect to any of these servers.
Detail of Power query is below.
Why am I writing here. This report was working without any issue more than 1,5 year but on August 17th one of servers start causing erros in step Pages where are some random lines (pages) with errors - see attached picture labeled "Errors in step Pages". and this is reason that next step Entities (List.Union) in query is stopping refresh and generate errors with message:
Expression.Error: We cannot apply field access to the type List. Details: Value=[List] Key=requests
What is notable
API service si returning records in the same order but faulty lists are random when calling with same parameters
some times is refresh without any error
The same power query called on another server is working correctly , problem is only with one specific server.
This problem started without notice on the most important server after 1,5 year without any problem.
Here is full text power of query for this main source, which is used later in other queries to extract all necessary data. Json is really complicated and I extract from it list of requests, list of solvers, list of solver groups,.... and this base query and its output is input for many referenced queries.
Errors in step Pages
let
BaseAPIUrl = apiurl&"apiservice?", /*apiurl is parameter - name of server e.g. https://xxxx.xxxxxx.sk/ */
EntitiesPerPage = RecordsPerPage, /*RecordsPerPage is parameter and defines nr. of record per page - we used as optimum 200-400 record per pages, but is working also with 4000 record per page*/
ApiToken = FnApiToken(), /*this function is returning apitoken value which is returning value of another api service apiurl&"api/auth/login", which use username and password in body of call to get apitoken */
GetJson = (QParm) => /*definiton general function to get data from data source*/
let
Options =
[ Query= QParm,
Headers=
[
Accept="application/json",
ApiKeyName="apitoken",
Authorization=ApiToken
]
],
RawData = Web.Contents(BaseAPIUrl, Options),
Json = Json.Document(RawData)
in Json,
GetEntityCount = () => /*one times called function to get nr of records using GetJson, which is returned as a part of each call*/
let
QParm = [pp="1", pg="1" ],
Json = GetJson(QParm),
Count = Json[totalRecord]
in
Count,
GetPage = (Index) => /*repeatadly called function to get each page of json using GetJson*/
let
PageNr = Text.From(Index+1),
PerPage = Text.From(EntitiesPerPage),
QParm = [pg = PageNr, pp=PerPage],
Json = GetJson(QParm),
Value = Json[data][requests]
in Value,
EntityCount = List.Max({ EntitiesPerPage, GetEntityCount() }), /*setup of nr. of records to variable*/
PageCount = Number.RoundUp(EntityCount / EntitiesPerPage), /*setup of nr. of pages */
PageIndices = { 0 .. PageCount - 1 },
Pages = List.Transform(PageIndices, each GetPage(_) /*Function.InvokeAfter(()=>GetPage(_),#duration(0,0,0,1))*/), /*here we call for each page GetJson function to get whole dataset - there is in comment test with delay between getpages but was not neccessary*/
Entities = List.Union(Pages),
Table = Table.FromList(Entities, Splitter.SplitByNothing(), null, null, ExtraValues.Error)
I also tried another way of appending pages to list using List.Generate. This is also bringing random errors in list but
it is bringing possibility to transform to table in contrast with original way with using List.Transform, but other referenced queries are failing and contains on the last row errors
When I am exploring content of faulty page/list extracting it via Add as New Query there are always all record without any fail.....
Source = List.Generate( /*another way to generate list of all pages*/
() => [Page = 0, ReqPageData = GetPage(0) ],
each [Page] < PageCount,
each [ReqPageData = GetPage( [Page] ),
Page = [Page] + 1 ],
each [ReqPageData]
),
#"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error), /*here i am able to generate table from list in contrast when is used List.Generate*/
#"Expanded Column1" = Table.ExpandListColumn(#"Converted to Table", "Column1"), /*here aj can expand list to column*/
#"Removed Errors" = Table.RemoveRowsWithErrors(#"Expanded Column1", {"Column1"}) /*here i try to exclude errors, but i dont know what happend and which records (if any) are excluded*/
Extracting errored page
and finnaly I am tottaly clueless not able to find the cause of this behavior on this specific server. I tested to call pages which are errored via POSTMAN, I discused this issue with author of API service and He also tried to call this API service with all parameters but server is returning every page OK, only Power query is not able to List.Transform ...
I will be grateful and appreciate any tips or advice or if somebody solved the same issue in the past ....
Kuby
No, each error line of list in step List.Transform coud by extracted as new query and there are all records from one page OK. hmmmm
Finnaly, problem described in this issue was caused by "corrupted" content of returning json. The provider of core system informed me that they found bug and after fixing on the side of servisdesk is everything OK again. I tried to find problem in Power query and problem was in servisdesk. :(

Laravel count raw query always return zero

I am working with multiple databases in single project. When I run simple raw query to get another database table count but it always return zero instead of actual counts. Even I have more than 1 million records in table.
I run raw query in following formats but result is zero
$dbconn = \DB::connection("archive_db");
$dbconn->table('activities_archived')->count()
$sql = "SELECT COUNT(*) as total FROM activities_archived";
$result = \DB::connection("archive_db")->select(\DB::raw($sql));
Event I have set the database connections strict option to false but still facing same issue.
Now I am totaly stuck that why this issue is coming
$someModel->setConnection('mysql2');
$something = $someModel->count();
return $something;

LOAD DATA FROM S3 falsely returns error in codeigniter query method

I´m hosting a codeigniter app on AWS and one method in that app is running a
"LOAD DATA FROM S3 'S3 path' ... "
This is essentially the same as a LOAD DATA INFILE query but customized in AWS Aurora to read files from S3 instead of your local volume. The query executes as expected but CI's
$this->db->error()
returns an array indicating that an error occurred. The content of that array is
[0, '']
First I thought it was a timeout but after reducing the size if the file to import and making sure the records was imported I started to suspect that CI's DB driver aren't designed to handle the result from that query.
The query don´t really return any data and I guess that confuses CI.
Is there any good way to bypass this behavior in CI without altering the frameworks "sourcecode"?
Thanks in advance!
I realized that since the DB driver class called MySQLi returns the following to $this->db instance in the model (see errno and error attributes)
[_mysqli:protected] => mysqli Object
(
[affected_rows] => 1
[client_info] => mysqlnd 5.0.12-dev - 20150407 - $Id: b382534eeb34d9ed79345235b8bae2234b287afcs21ad4e $
[client_version] => 50012
[connect_errno] => 0
[connect_error] =>
[errno] => 0
[error] =>
[error_list] => Array
(
)
I will never get any more information about the error then what mysql_errno and mysql_error returns to the driver. So instead if error checking the query like this
if($this->db->error())
{
return false;
}
I can simply do this
if(is_array($this->db->error()) && 0 !== $this->db->error()[0])
{
return false;
}
Errors are returned in array format or false from $this->db->error() and thus it is very simple to check if the errorcode is zero. If it is zero we can asume that the query executed as planned and that the import has completed.
The reason that we get an array instead of false from $this->db->error() is that no result was returned from the query which CI expects.
So... a simple solution after all.

BigQuery - Check if table already exists

I have a dataset in BigQuery. This dataset contains multiple tables.
I am doing the following steps programmatically using the BigQuery API:
Querying the tables in the dataset - Since my response is too large, I am enabling allowLargeResults parameter and diverting my response to a destination table.
I am then exporting the data from the destination table to a GCS bucket.
Requirements:
Suppose my process fails at Step 2, I would like to re-run this step.
But before I re-run, I would like to check/verify that the specific destination table named 'xyz' already exists in the dataset.
If it exists, I would like to re-run step 2.
If it does not exist, I would like to do foo.
How can I do this?
Thanks in advance.
Alex F's solution works on v0.27, but will not work on later versions. In order to migrate to v0.28+, the below solution will work.
from google.cloud import bigquery
project_nm = 'gc_project_nm'
dataset_nm = 'ds_nm'
table_nm = 'tbl_nm'
client = bigquery.Client(project_nm)
dataset = client.dataset(dataset_nm)
table_ref = dataset.table(table_nm)
def if_tbl_exists(client, table_ref):
from google.cloud.exceptions import NotFound
try:
client.get_table(table_ref)
return True
except NotFound:
return False
if_tbl_exists(client, table_ref)
Here is a python snippet that will tell whether a table exists (deleting it in the process--careful!):
def doesTableExist(project_id, dataset_id, table_id):
bq.tables().delete(
projectId=project_id,
datasetId=dataset_id,
tableId=table_id).execute()
return False
Alternately, if you'd prefer not deleting the table in the process, you could try:
def doesTableExist(project_id, dataset_id, table_id):
try:
bq.tables().get(
projectId=project_id,
datasetId=dataset_id,
tableId=table_id).execute()
return True
except HttpError, err
if err.resp.status <> 404:
raise
return False
If you want to know where bq came from, you can call build_bq_client from here: http://code.google.com/p/bigquery-e2e/source/browse/samples/ch12/auth.py
In general, if you're using this to test whether you should run a job that will modify the table, it can be a good idea to just do the job anyway, and use WRITE_TRUNCATE as a write disposition.
Another approach can be to create a predictable job id, and retry the job with that id. If the job already exists, the job already ran (you might want to double check to make sure the job didn't fail, however).
Enjoy:
def doesTableExist(bigquery, project_id, dataset_id, table_id):
try:
bigquery.tables().get(
projectId=project_id,
datasetId=dataset_id,
tableId=table_id).execute()
return True
except Exception as err:
if err.resp.status != 404:
raise
return False
There is an edit in exception.
you can use exists() now to check if dataset exists same with table
BigQuery exist documentation
recently big query introduced so called scripting statements that can be quite a game changer for some flows.
check them out here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting
Now for example to check if table exists you can use something like this:
sql = """
BEGIN
IF EXISTS(SELECT 1 from `YOUR_PROJECT.YOUR_DATASET.YOUR_TABLE) THEN
SELECT 'table_found';
END IF;
EXCEPTION WHEN ERROR THEN
# you can print your own message like above or return error message
# however google says not to rely on error message structure as it may change
select ##error.message;
END;
"""
With my_bigquery being an instance of class google.cloud.bigquery.Client (already authentified and associated to a project):
my_bigquery.dataset(dataset_name).table(table_name).exists() # returns boolean
It does an API call to test for the existence of the table via a GET request
Source: https://googlecloudplatform.github.io/google-cloud-python/0.24.0/bigquery-table.html#google.cloud.bigquery.table.Table.exists
It works for me using 0.27 of the Google Bigquery Python module
Inline SQL Alternative
tarheel's answer is probably the most correct at this point in time
but I was considering the comment from Ivan above that "404 could also mean the resource is not there for a bunch of reasons", so here is a solution that should always successfully run a metadata query and return a result.
It's not the fastest, because it always has to run the query, bigquery has overhead for small queries
A trick I've seen previously is to query information_schema for a (table) object, and union that to a fake query that ensures a record is always returned even if the the object doesn't. There's also a LIMIT 1 and an ordering to ensure the single record returned represents the table, if it does exist. See the SQL in the code below.
In spite of doc claims that Bigquery standard SQL is ISO compliant, they don't support information_schema, but they do have __table_summary__
dataset is required because you can't query __table_summary__ without specifying dataset
dataset is not a parameter in the SQL because you can't parameterize object names without sql injection issues (apart from with the magical _TABLE_SUFFIX, see https://cloud.google.com/bigquery/docs/querying-wildcard-tables )
#!/usr/bin/env python
"""
Inline SQL way to check a table exists in Bigquery
e.g.
print(table_exists(dataset_name='<dataset_goes_here>', table_name='<real_table_name'))
True
print(table_exists(dataset_name='<dataset_goes_here>', table_name='imaginary_table_name'))
False
"""
from __future__ import print_function
from google.cloud import bigquery
def table_exists(dataset_name, table_name):
client = bigquery.Client()
query = """
SELECT table_exists FROM
(
SELECT true as table_exists, 1 as ordering
FROM __TABLES_SUMMARY__ WHERE table_id = #table_name
UNION ALL
SELECT false as table_exists, 2 as ordering
) ORDER by ordering LIMIT 1"""
query_params = [bigquery.ScalarQueryParameter('table_name', 'STRING', table_name)]
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = query_params
if dataset_name is not None:
dataset_ref = client.dataset(dataset_name)
job_config.default_dataset = dataset_ref
query_job = client.query(
query,
job_config=job_config
)
results = query_job.result()
for row in results:
# There is only one row because LIMIT 1 in the SQL
return row.table_exists

Spring jdbcTemplate executing query

I have a strange problem ,
My Query looks like below.
String tokenQuery = "select id from table
where current_timestamp between
creation_time and (creation_time + interval '10' minute)
and token = '"+Token+"'";
But when I run, jdbcTemplate.queryForLong(tokenQuery) , no matter what , it always throws EmptyDataAccessException.
I am executing this in Oracle
Can we not append dynamic values to string and then pass it as a query and execute ?
What could be the issue ?
I assume that what you get is in fact an EmptyResultDataAccessException. The javadoc of this exception says:
Data access exception thrown when a result was expected to have at least one row (or element) but zero rows (or elements) were actually returned.
That simply means that the query is executed fine, and is supposed to return one row, but doesn't return any. So no row satisfies the criteria of your query.
If that is expected, then catch the exception, or use a method that returns a list rather then returning a single value. That way, you can test if the returned list is empty.
That said, you should use a parameterized query instead of concatenating the token like you're doing. This would prevent SQL injection attacks. It would also work even if the token contains a quote, for example.

Resources