Apache airflow with conditional statements - bash

I am new to Airflow. I want to do an operation like below using airflow operators.
Briefly I want to read some data from a database table and according to the values of a column in that table I want to do different tasks.
This is the table which I used to get data.
+-----------+--------+
| task_name | status |
+-----------+--------+
| a | 1 |
| b | 2 |
| c | 4 |
| d | 3 |
| e | 4 |
+-----------+--------+
From the above table I want to select the rows where status=4 and according to their task name run the relevant jar file (for running jar files I am planning to use Bash Operator). I want to execute this task using Airflow. Note that I am using PostgreSQL.
This is the code which I have implemented so far.
from airflow.models import DAG
from airflow.operators.postgres_operator import PostgresOperator
from datetime import datetime, timedelta
from airflow import settings
#set the default attributes
default_args = {
'owner': 'Airflow',
'start_date': datetime(2020,10,4)
}
status_four_dag = DAG(
dag_id = 'status_check',
default_args = default_args,
schedule_interval = timedelta(seconds=5)
)
test=PostgresOperator(
task_id='check_status',
sql='''select * from table1 where status=4;''',
postgres_conn_id='test',
database='status',
dag=status_four_dag,
)
I am stuck in the place where I want to check the task_name and call the relevant BashOperators.
Your support is appreciated. Thank you.

XComs are used for communicating messages between tasks. Send the JAR filename and other arguments for forming the command to xcom and consume it in the subsequent tasks.
For example,
check_status >> handle_status
check_status - checks status from DB and write JAR filename and arguments to xcom
handle_status - pulls the JAR filename and arguments from xcom, forms the command and execute it
Sample code:
def check_status(**kwargs):
if randint(1, 100) % 2 == 0:
kwargs["ti"].xcom_push("jar_filename", "even.jar")
else:
kwargs["ti"].xcom_push("jar_filename", "odd.jar")
with DAG(dag_id='new_example', default_args=default_args) as dag:
t0 = PythonOperator(
task_id="check_status",
provide_context=True,
python_callable=check_status
)
t1 = BashOperator(
task_id="handle_status",
bash_command="""
jar_filename={{ ti.xcom_pull(task_ids='check_status', key='jar_filename') }}
echo "java -jar ${jar_filename}"
"""
)
t0 >> t1

Related

I cannot get my honeycomb panel in Sumo Logic to display the correct colors

I have the following search:
_sourceCategory="/api/SQLWatch" AND "'checktype':'AGDetails'"
| parse "'synchronization_pair':''" as synchronization_pair
| parse "'Database_Name':''" as Database_Name
| parse "'synchronization_health_desc':''" as synchronization_health_desc
| parse "'AG_Name':''" as AG_Name
| parse "'DAG':'*'" as DAG
| IF (synchronization_health_desc = "HEALTHY", 1, 0) as isHealthy
| first(_messagetime) group by synchronization_pair, Database_Name, synchronization_health_desc, AG_Name, DAG, isHealthy
which produces results like this:
synchronization_pair Database_Name synchronization_health_desc AG_Name DAG Healthy
ServerA-ServerB DB1 HEALTHY AG1 No 1
ServerC-ServerD DB2 UNHEALTHY AG2 Yes 0
When I add a honeycomb panel to my dashboard with this search all the honeycombs are blue.
I add the following settings to my Visual Settings and they are still blue.
1 to 1 GREEN
0 to 0 RED
Please help.
Thanks.
Charles.
i figured it out!
the column in the first() is the value used for the visual settings
i changed the last line to the following and it works now!!!
| first(isHealthy) group by synchronization_pair, Database_Name, synchronization_health_desc, AG_Name, DAG //, isHealthy

how to update terminal values in realtime

I have the following asc table:
+---------------------------------------------+
| Report |
+----------+----------+-------------+---------+
| Store | Total |
+----------+----------+-------------+---------+
| A | 2723 |
| B | 7277 |
+----------+----------+-------------+---------+
I need to update the total while threre are updates running on my database.
How can I do that?
I already have the method that gets updated total.
But how can I persist the total on the terminal screen?
You can achieve this using the following gems
https://github.com/ruby/curses
https://github.com/tj/terminal-table
Example :
require 'terminal-table'
require "curses"
Curses.init_screen
Curses.crmode
Curses.noecho
Curses.stdscr.keypad = true
begin
x = 0
y = 0
loop do
table = Terminal::Table.new do |t|
t << ['Random 1', Random.rand(1...10)]
t.add_row ['Random 1', Random.rand(10...100)]
end
Curses.setpos(x, y)
output = table.render.to_s
Curses.addstr(output)
Curses.refresh
sleep 1
end
ensure
close_screen
end

Insert collect_set values into Elasticsearch with PIG

I have a HIVE table which contains 3 columns- "id"(String), "booklist"(Array of String), and "date"(string) with the following data:
----------------------------------------------------
id | booklist | date
----------------------------------------------------
1 | ["Book1" , "Book2"] | 2017-11-27T01:00:00.000Z
2 | ["Book3" , "Book4"] | 2017-11-27T01:00:00.000Z
When trying to insert into Elasticsearch with this PIG script
-------------------------Script begins------------------------------------------------
SET hive.metastore.uris 'thrift://node:9000';
REGISTER hdfs://node:9001/library/elasticsearch-hadoop-5.0.0.jar;
DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE EsStore org.elasticsearch.hadoop.pig.EsStorage(
'es.nodes = elasticsearch.service.consul',
'es.port = 9200',
'es.write.operation = upsert',
'es.mapping.id = id',
'es.mapping.pig.tuple.use.field.names=true'
);
hivetable = LOAD 'default.reading' USING HCatLoader();
hivetable_flat = FOREACH hivetable
GENERATE
id AS id,
booklist as bookList,
date AS date;
STORE hivetable_flat INTO 'readings/reading' USING EsStore();
-------------------------Script Ends------------------------------------------------
When running above, i got an error saying:
ERROR 2999:Unexpected internal error. Found unrecoverable error [ip:port] returned Bad Request(400) - failed to parse [bookList]; Bailing out..
Can anyone shed any light on how to parse ARRAY of STRING into ES and get above to work?
Thank you!

Reshape data in pig - change row values to column names

Is there a way to reshape the data in pig?
The data looks like this -
id | p1 | count
1 | "Accessory" | 3
1 | "clothing" | 2
2 | "Books" | 1
I want to reshape the data so that the output would look like this--
id | Accessory | clothing | Books
1 | 3 | 2 | 0
2 | 0 | 0 | 1
Can anyone please suggest some way around?
If its a fixed set of product line the below code might help, otherwise you can go for a custom UDF which helps in achieving the objective.
Input : a.csv
1|Accessory|3
1|Clothing|2
2|Books|1
Pig Snippet :
test = LOAD 'a.csv' USING PigStorage('|') AS (product_id:long,product_name:chararray,rec_cnt:long);
req_stats = FOREACH (GROUP test BY product_id) {
accessory = FILTER test BY product_name=='Accessory';
clothing = FILTER test BY product_name=='Clothing';
books = FILTER test BY product_name=='Books';
GENERATE group AS product_id, (IsEmpty(accessory) ? '0' : BagToString(accessory.rec_cnt)) AS a_cnt, (IsEmpty(clothing) ? '0' : BagToString(clothing.rec_cnt)) AS c_cnt, (IsEmpty(books) ? '0' : BagToString(books.rec_cnt)) AS b_cnt;
};
DUMP req_stats;
Output :DUMP req_stats;
(1,3,2,0)
(2,0,0,1)

cloudera impala jdbc query doesn't see array<string> Hive column

I have a table in Hive that has the following structure:
> describe volatility2;
Query: describe volatility2
+------------------+---------------+---------+
| name | type | comment |
+------------------+---------------+---------+
| version | int | |
| unmappedmkfindex | int | |
| mfvol | array<string> | |
+------------------+---------------+---------+
It was created by Spark HiveContext code by using a DataFrame API like this:
val volDF = hc.createDataFrame(volRDD)
volDF.saveAsTable(volName)
which carried over the RDD structure that was defined in the schema:
def schemaVolatility: StructType = StructType(
StructField("Version", IntegerType, false) ::
StructField("UnMappedMKFIndex", IntegerType, false) ::
StructField("MFVol", DataTypes.createArrayType(StringType), true) :: Nil)
However, when I'm trying to select from this table using the latest JDBC Impala driver the last column is not visible to it. My query is very simple - trying to print the data to the console - exactly like in the example code provided by the driver download:
String sqlStatement = "select * from default.volatility2";
Class.forName(jdbcDriverName);
con = DriverManager.getConnection(connectionUrl);
Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(sqlStatement);
System.out.println("\n== Begin Query Results ======================");
ResultSetMetaData metadata = rs.getMetaData();
for (int i=1; i<=metadata.getColumnCount(); i++) {
System.out.println(rs.getMetaData().getColumnName(i)+":"+rs.getMetaData().getColumnTypeName(i));
}
System.out.println("== End Query Results =======================\n\n");
The console output it this:
== Begin Query Results ======================
version:version
unmappedmkfindex:unmappedmkfindex
== End Query Results =======================
Is it a driver bug or I'm missing something?
I found the answer to my own question. Posting it here so it may help others and save time in searching. Apparently Impala lately introduced the so called "complex types" support to their SQL that include array among others. The link to the document is this:
http://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types_using
According to this what I had to do is change the query to look like this:
select version, unmappedmkfindex, mfvol.ITEM from volatility2, volatility2.mfvol
and I got the right expected results back

Resources