Airflow retain the same database connection? - etl

I'm using Airflow for some ETL things and in some stages, I would like to use temporary tables (mostly to keep the code and data objects self-contained and to avoid to use a lot of metadata tables).
Using the Postgres connection in Airflow and the "PostgresOperator" the behaviour that I found was: For each execution of a PostgresOperator we have a new connection (or session, you name it) in the database. In other words: We lose all temporary objects of the previous component of the DAG.
To emulate a simple example, I use this code (do not run, just see the objects):
import os
from airflow import DAG
from airflow.operators.postgres_operator import PostgresOperator
default_args = {
'owner': 'airflow'
,'depends_on_past': False
,'start_date': datetime(2018, 6, 13)
,'retries': 3
,'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'refresh_views'
, default_args=default_args)
# Create database workflow
drop_exist_temporary_view = "DROP TABLE IF EXISTS temporary_table_to_be_used;"
create_temporary_view = """
CREATE TEMPORARY TABLE temporary_table_to_be_used AS
SELECT relname AS views
,CASE WHEN relispopulated = 'true' THEN 1 ELSE 0 END AS relispopulated
,CAST(reltuples AS INT) AS reltuples
FROM pg_class
WHERE relname = 'some_view'
ORDER BY reltuples ASC;"""
use_temporary_view = """
DO $$
DECLARE
is_correct integer := (SELECT relispopulated FROM temporary_table_to_be_used WHERE views LIKE '%<<some_name>>%');
BEGIN
start_time := clock_timestamp();
IF is_materialized = 0 THEN
EXECUTE 'REFRESH MATERIALIZED VIEW ' || view_to_refresh || ' WITH DATA;';
ELSE
EXECUTE 'REFRESH MATERIALIZED VIEW CONCURRENTLY ' || view_to_refresh || ' WITH DATA;';
END IF;
END;
$$ LANGUAGE plpgsql;
"""
# Objects to be executed
drop_exist_temporary_view = PostgresOperator(
task_id='drop_exist_temporary_view',
sql=drop_exist_temporary_view,
postgres_conn_id='dwh_staging',
dag=dag)
create_temporary_view = PostgresOperator(
task_id='create_temporary_view',
sql=create_temporary_view,
postgres_conn_id='dwh_staging',
dag=dag)
use_temporary_view = PostgresOperator(
task_id='use_temporary_view',
sql=use_temporary_view,
postgres_conn_id='dwh_staging',
dag=dag)
# Data workflow
drop_exist_temporary_view >> create_temporary_view >> use_temporary_view
At the end of execution, I receive the following message:
[2018-06-14 15:26:44,807] {base_task_runner.py:95} INFO - Subtask: psycopg2.ProgrammingError: relation "temporary_table_to_be_used" does not exist
Someone knows if Airflow has some way to retain the same connection to the database? I think it can save a lot of work in creating/maintaining several objects in the database.

You can retain the connection to the database by building a custom Operator which leverages the PostgresHook to retain a connection to the db while you perform some set of sql operations.
You may find some examples in contrib on incubator-airflow or in Airflow-Plugins.
Another option is to persist this temporary data to XCOMs. This will give you the ability to keep the metadata used with the task in which it was created. This may help troubleshooting down the road.

Related

DB2 iSeries doesn't lock on select for update

I'm migrating a legacy application using DB2 iSeries on AS400 that has a specific behavior that I have to reproduce using .NET and DB2.Data.DB2.iSeries client for .NET.
What I'm describing works for me with DB2 non AS400 but in AS400 DB2 it worlks for the legacy application i'm replacing - but not with my application.
The behavior in the original application:
Begin Transaction
ExecuteReader () => Select col1 from table1 where col1 = 1 for update.
The row is now locked. anyone else who tries to run Select for update should fail.
Close the Reader opened in line 2.
The row is now unlocked. - anyone else who tried to run select for update should succeed.
Close transaction and live happily ever after.
In my .NET code I have two problems:
Step 2 - only checks if the row is already locked - but doesn't actually lock it. so another user can and does run select for update - WRONG BEHAVIOUR
Once that works - I need the lock to get unlocked when the reader is closed (step 4)
Here's my code:
var cb = new IBM.Data.DB2.iSeries.iDB2ConnectionStringBuilder();
cb.DataSource = "10.0.0.1";
cb.UserID = "User";
cb.Password = "Password";
using (var con = new IBM.Data.DB2.iSeries.iDB2Connection(cb.ToString()))
{
con.Open();
var t = con.BeginTransaction(System.Data.IsolationLevel.ReadUncommitted);
using (var c = con.CreateCommand())
{
c.Transaction = t;
c.CommandText = "select col1 from table1 where col1=1 FOR UPDATE";
using (var r = c.ExecuteReader())
{
while (r.Read()) {
MessageBox.Show(con.JobName + "The Row Should Be Locked");
}
}
MessageBox.Show(con.JobName + "The Row Should Be unlocked");
}
}
When you run this code twice - you'll see both processes reach the "This row should be locked" which is the problem I'm describing.
The desired result would be that the first process will reach the "This row should be locked" and that the second process will fail with resource busy error.
Then when the first process reaches the second message box - "the row should be unlocked" the second process( after running again ) will reach the "This row should be locked" message.
Any help would be greatly appreciated
The documentation says:
When the UPDATE clause is used, FETCH operations referencing the cursor acquire an exclusive row lock.
This implies a cursor is being used, and the lock occurs when the fetch statement is executed. I don't see a cursor, or a fetch in your code.
Now, whether .NET handles this as a cursor, I don't know, but the DB2 UDB documentation does not have this notation.
Isolation Level allows this behavior. Reading rows that are locked.
ReadUncommitted
A dirty read is possible, meaning that no shared locks are issued and no exclusive locks are honored.
After much investigations we created a work around in the form of a stored procedure that performs the lock for us.
The stored procedure looks like this:
CREATE PROCEDURE lib.Select_For_Update (IN SQL CHARACTER (5000) )
MODIFIES SQL DATA CONCURRENT ACCESS RESOLUTION WAIT FOR OUTCOME
DYNAMIC RESULT SETS 1 OLD SAVEPOINT LEVEL COMMIT ON RETURN
NO DISALLOW DEBUG MODE SET OPTION COMMIT = *CHG BEGIN
DECLARE X CURSOR WITH RETURN TO CLIENT FOR SS ;
PREPARE SS FROM SQL ;
OPEN X ;
END
Then we call it using:
var cb = new IBM.Data.DB2.iSeries.iDB2ConnectionStringBuilder();
cb.DataSource = "10.0.0.1";
cb.UserID = "User";
cb.Password = "Password";
using (var con = new IBM.Data.DB2.iSeries.iDB2Connection(cb.ToString()))
{
con.Open();
var t = con.BeginTransaction(System.Data.IsolationLevel.ReadUncommitted);
using (var c = con.CreateCommand())
{
c.Transaction = t;
c.CommandType = CommandType.StoredProcedure;
c.AddParameter("sql","select col1 from table1 where col1=1 FOR UPDATE");
c.CommandText = "lib.Select_For_Update"
using (var r = c.ExecuteReader())
{
while (r.Read()) {
MessageBox.Show(con.JobName + "The Row Should Be Locked");
}
}
MessageBox.Show(con.JobName + "The Row Should Be unlocked");
}
}
We don't like it - but it works.

SSIS For Loop - Wait for file to carry on

I am creating an SSIS package that imports 10 different CSV files, however the availability of each file varies - so i want to get the file check to loop until all files are there and then start to import process.
I check this through a T-SQL script
IF OBJECT_ID('tblFileCheck') IS NOT NULL
DROP TABLE tbFileCheck;
CREATE TABLE tblFileCheck (
id int IDENTITY(1,1)
,subdirectory nvarchar(512)
,depth int
,isfile bit);
INSERT tblFileCheck
EXEC xp_dirtree '\\reports\Reports\CSV', 10, 1
BEGIN
IF EXISTS (
SELECT COUNT(id)
FROM tblFileCheck
HAVING count(id) > 9
)
BEGIN
PRINT 'Success - Import Latest File'
END
ELSE
BEGIN
RAISERROR ('Looping back to start - insufficient files to run', 16, 1 );
END
END
However I cant get the loop to work, I have created a variable varWaitForData (int32, val = 0), created a SQL task editor with result set = single row and set the result set option to the parameter.
Set the InitExpression #varWaitForData = 0 AND EvalExpression #varWaitForData == 0.
But i keep getting the error [Execute SQL Task] Error: An error occurred while assigning a value to variable "varWaitForData": "Exception from HRESULT: 0xC0015005".
I would use a different approach:
set up a variable called fileCount as int
script task: Get file count
System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo("\\reports\Reports\CSV");
Dts.Variables["User::fileCount"].Value = dir.GetFiles().Length;
Now set up a foreach file enumerator
and put a constraint on the path that fileCount > 9

Identify Redshift COPY failing records in Ruby

When performing a COPY command, a few informations are printed, like :
INFO: Load into table '<table>' completed, 22666 record(s) loaded successfully.
INFO: Load into table '<table>' completed, 1 record(s) could not be loaded. Check 'stl_load_errors' system table for details.
And I need to identify failing records.
Thus I need 2 things :
Determine when there are failing rows: now, it's only printed on screen and I don't know how to get the message in code.
Determine the failing rows.
One way to do that would be to access to the query identifier that is visible in the table stl_load_errors, but I have no clue how to access it by code.
(I currently use the pg gem to connect to redshift)
stl_load_errors is a table in Redshift that (as you may have guessed already) includes all the errors that happen when loading into Redshift. So you can query it by doing something like:
SELECT * FROM stl_load_errors
Now, to answer your questions use the following snippet:
database = PG.connect(redshift)
begin
query = "COPY %s (%s) FROM 's3://%s/%s' CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s' CSV GZIP" %
[ table, columns, s3_bucket, s3_key, access_key_id, secret_access_key ]
database.exec(query)
puts 'File succesfully imported'
rescue PG::InternalError
res = database.exec("SELECT line_number, colname, err_reason FROM pg_catalog.stl_load_errors WHERE filename = 's3://#{s3_bucket}/#{s3_key}'")
res.each do |row|
puts "Importing failed:\n> Line %s\n> Column: %s\n> Reason: %s" % row.values_at('line_number', 'colname', 'err_reason')
end
end
That should output all the information you need, recall variables like redshift, table, columns, s3_bucket, s3_key, access_key_id, and secret_access_key depend on your configuration.
UPDATE:
To answer your comment below, more specifically, you could use a query like this:
"SELECT lines_scanned FROM pg_catalog.stl_load_commits WHERE filename = 's3://#{s3_bucket}/#{s3_key}' AND errors = -1"

Script (or my logic in code) is causing deadlocks

Below you will see the stored procedure that causes locks in Oracle 11. I am unfortunately in a corporate environment that doesn't allow me access to trace logs, or basically anything. So I've read tons of recommendations for how to do that, and cry that I can't do it. I'm not sure you can even determine from below if this would be the culprit. Anyway, thanks for everyone's time.
MERGE INTO my_owner.my_table tgtT
USING (SELECT diTT, sisCd FROM my_owner.NGR_DMNSN_TYP
where DMNSN_NM_TXT = 'myname'
and sisCd = 'RefDB') srcT
ON ( TGTT.diTT = srcT.diTT and TGTT.ID_TXT = 50000 )
WHEN MATCHED
THEN UPDATE SET TGTT.DSCRPTN_TXT = 'dfdf', TGTT.NM_TXT = 'bldfdfah'
WHEN NOT MATCHED
THEN
INSERT (TGTT.DSCRPTN_TXT, TGTT.NM_TXT, TGTT.diTT, TGTT.ID_NBR, TGTT.ID_TXT)
VALUES ('ee', 'ee', srcT.diTT, 50000, '50000' );

PG::ERROR: another command is already in progress

I have an importer which takes a list of emails and saves them into a postgres database. Here is a snippet of code within a tableless importer class:
query_temporary_table = "CREATE TEMPORARY TABLE subscriber_imports (email CHARACTER VARYING(255)) ON COMMIT DROP;"
query_copy = "COPY subscriber_imports(email) FROM STDIN WITH CSV;"
query_delete = "DELETE FROM subscriber_imports WHERE email IN (SELECT email FROM subscribers WHERE suppressed_at IS NOT NULL OR list_id = #{list.id}) RETURNING email;"
query_insert = "INSERT INTO subscribers(email, list_id, created_at, updated_at) SELECT email, #{list.id}, NOW(), NOW() FROM subscriber_imports RETURNING id;"
conn = ActiveRecord::Base.connection_pool.checkout
conn.transaction do
raw = conn.raw_connection
raw.exec(query_temporary_table)
raw.exec(query_copy)
CSV.read(csv.path, headers: true).each do |row|
raw.put_copy_data row['email']+"\n" unless row.nil?
end
raw.put_copy_end
while res = raw.get_result do; end # very important to do this after a copy
result_delete = raw.exec(query_delete)
result_insert = raw.exec(query_insert)
ActiveRecord::Base.connection_pool.checkin(conn)
{
deleted: result_delete.count,
inserted: result_insert.count,
updated: 0
}
end
The issue I am having is that when I try to upload I get an exception:
PG::ERROR: another command is already in progress: ROLLBACK
This is all done in one action, the only other queries I am making are user validation and I have a DB mutex preventing overlapping imports. This query worked fine up until my latest push which included updating my pg gem to 0.14.1 from 0.13.2 (along with other "unrelated" code).
The error initially started on our staging server, but I was then able to reproduce it locally and am out of ideas.
If I need to be more clear with my question, let me know.
Thanks
Found my own answer, and this might be useful if anyone finds the same issue when importing loads of data using "COPY"
An exception is being thrown within the CSV.read() block, and I do catch it, but I was not ending the process correctly.
begin
CSV.read(csv.path, headers: true).each do |row|
raw.put_copy_data row['email']+"\n" unless row.nil?
end
ensure
raw.put_copy_end
while res = raw.get_result do; end # very important to do this after a copy
end
This block ensures that the COPY command is completed. I also added this at the end to release the connection back into the pool, without disrupting the flow in the case of a successful import:
rescue
ActiveRecord::Base.connection_pool.checkin(conn)

Resources