How to run Apache Airflow DAG as Unix user - hadoop

I installed Apache Airflow on my cluster using root account. I know it is bad practice, but it is only test environment. I created a simple DAG:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
dag = DAG('create_directory', description='simple create directory workflow', start_date=datetime(2017, 6, 1))
t1 = BashOperator(task_id='create_dir', bash_command='mkdir /tmp/airflow_dir_test', dag=dag)
t2 = BashOperator(task_id='create_file', bash_command='echo airflow_works > /tmp/airflow_dir_test/airflow.txt')
t2.set_upstream(t1)
The problem is that when I run this job, the root user executes it. I tried to add owner parameter, but it doesn't work. Airflow says:
Broken DAG: [/opt/airflow/dags/create_directory.py] name 'user1' is not defined
My question is, how I can run Apache Airflow DAG using other user than root?

You can use the run_as_user parameter to impersonate a unix user for any task:
t1 = BashOperator(task_id='create_dir', bash_command='mkdir /tmp/airflow_dir_test', dag=dag, run_as_user='user1')
You can use default_args if you want to apply it to every task in the DAG:
dag = DAG('create_directory', description='simple create directory workflow', start_date=datetime(2017, 6, 1), default_args={'run_as_user': 'user1'})
t1 = BashOperator(task_id='create_dir', bash_command='mkdir /tmp/airflow_dir_test', dag=dag)
t2 = BashOperator(task_id='create_file', bash_command='echo airflow_works > /tmp/airflow_dir_test/airflow.txt')
Note that the owner parameter is for something else, multi-tenancy.

Related

Why does huggingface hang on list input for pipeline sentiment-analysis?

With python 3.10 and latest version of huggingface.
for simple code likes this
from transformers import pipeline
input_list = ['How do I test my connection? (Windows)', 'how do I change my payment method?', 'How do I contact customer support?']
classifier = pipeline('sentiment-analysis')
results = classifier(input_list)
the program hangs and returns error messages:
File ".......env/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
but replace the list input with a string, it works
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
result = classifier('How do I test my connection? (Windows)')
It needs to define a main function to run multitask that the list input depends on. Following update works
from transformers import pipeline
def main():
input_list = ['How do I test my connection? (Windows)',
'how do I change my payment method?',
'How do I contact customer support?']
classifier = pipeline('sentiment-analysis')
results = classifier(input_list)
if __name__ == '__main__':
main()
The question is reduced to where to put freeze_support() in a Python script?

Apache Airflow: run all parallel tasks in single DAG run

I have a DAG that has 30 (or more) dynamically created parallel tasks.
I have concurrency option set on that DAG so that I only have single DAG Run running, when catching up the history.
When I run it on my server only 16 tasks actually run in parallel, while the rest 14 just wait being queued.
Which setting should I alter so that I have only 1 DAG Run running, but with all 30+ tasks running in parallel?
According to this FAQ, it seems like it's one of the dag_concurrency or max_active_runs_per_dag, but the former seem to be overdriven by concurrency setting already, while the latter seemed to have no effect (or I effectively messed up my setup).
Here's the sample code:
import datetime as dt
import logging
from airflow.operators.dummy_operator import DummyOperator
import config
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'wait_for_downstream': True,
'concurrency': 1,
'retries': 0,
}
def print_operators(ds, **kwargs):
logging.info(f"Task {kwargs.get('task_instance_key_str', 'unknown_task_instance')}")
dag = DAG(
dag_id='test_parallelism_dag',
start_date=dt.datetime(2019, 1, 1),
default_args=default_args,
schedule_interval='#daily',
catchup=True,
template_searchpath=[config.DAGS_PATH],
params={'schema': config.SCHEMA_DB},
max_active_runs=1,
)
print_operators = [PythonOperator(
task_id=f'test_parallelism_dag.print_operator_{i}',
python_callable=print_operators,
provide_context=True,
dag=dag
) for i in range(60)]
dummy_operator_start = DummyOperator(
task_id=f'test_parallelism_dag.dummy_operator_start',
)
dummy_operator_end = DummyOperator(
task_id=f'test_parallelism_dag.dummy_operator_end',
)
dummy_operator_start >> print_operators >> dummy_operator_end
EDIT 1:
My current airflow.cfg contains:
executor = SequentialExecutor
parallelism = 32
dag_concurrency = 24
max_active_runs_per_dag = 26
My env variables are as following (set all of them different to easily spot which one helps):
AIRFLOW__CORE__EXECUTOR=LocalExecutor
AIRFLOW__CORE__DAG_CONCURRENCY=18
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=20
AIRFLOW__CORE__WORKER_CONCURRENCY=22
And with that I have following Gantt diagram:
Which kind of gives me a hint that setting DAG_CONCURRENCY env variable works.
The actual parameter to change was dag_concurrency in airflow.cfg or override it with AIRFLOW__CORE__DAG_CONCURRENCY env variable.
As per docs I referred to in my question:
concurrency: The Airflow scheduler will run no more than $concurrency
task instances for your DAG at any given time. Concurrency is defined
in your Airflow DAG. If you do not set the concurrency on your DAG,
the scheduler will use the default value from the dag_concurrency
entry in your airflow.cfg.
Which means following simplified code:
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'wait_for_downstream': True,
'concurrency': 1,
}
dag = DAG(
dag_id='test_parallelism_dag',
default_args=default_args,
max_active_runs=1,
)
should be rewritten to:
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'wait_for_downstream': True,
}
dag = DAG(
dag_id='test_parallelism_dag',
default_args=default_args,
max_active_runs=1,
concurrency=30
)
My code actually has wrong assumption that default_args at some point substitute actual kwargs to DAG constructor. I don't know what lead me to that conclusion back then, but I guess setting concurrency to 1 there is some draft leftover, which never actually affected anything and actual DAG concurrency was set from config default, which is 16.
Update the concurrency config as well in your airflow.cfg file. If it is 16, increase it to 32.
If you are using Celery Executor, change worker_concurrency to 32.

Airflow retain the same database connection?

I'm using Airflow for some ETL things and in some stages, I would like to use temporary tables (mostly to keep the code and data objects self-contained and to avoid to use a lot of metadata tables).
Using the Postgres connection in Airflow and the "PostgresOperator" the behaviour that I found was: For each execution of a PostgresOperator we have a new connection (or session, you name it) in the database. In other words: We lose all temporary objects of the previous component of the DAG.
To emulate a simple example, I use this code (do not run, just see the objects):
import os
from airflow import DAG
from airflow.operators.postgres_operator import PostgresOperator
default_args = {
'owner': 'airflow'
,'depends_on_past': False
,'start_date': datetime(2018, 6, 13)
,'retries': 3
,'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'refresh_views'
, default_args=default_args)
# Create database workflow
drop_exist_temporary_view = "DROP TABLE IF EXISTS temporary_table_to_be_used;"
create_temporary_view = """
CREATE TEMPORARY TABLE temporary_table_to_be_used AS
SELECT relname AS views
,CASE WHEN relispopulated = 'true' THEN 1 ELSE 0 END AS relispopulated
,CAST(reltuples AS INT) AS reltuples
FROM pg_class
WHERE relname = 'some_view'
ORDER BY reltuples ASC;"""
use_temporary_view = """
DO $$
DECLARE
is_correct integer := (SELECT relispopulated FROM temporary_table_to_be_used WHERE views LIKE '%<<some_name>>%');
BEGIN
start_time := clock_timestamp();
IF is_materialized = 0 THEN
EXECUTE 'REFRESH MATERIALIZED VIEW ' || view_to_refresh || ' WITH DATA;';
ELSE
EXECUTE 'REFRESH MATERIALIZED VIEW CONCURRENTLY ' || view_to_refresh || ' WITH DATA;';
END IF;
END;
$$ LANGUAGE plpgsql;
"""
# Objects to be executed
drop_exist_temporary_view = PostgresOperator(
task_id='drop_exist_temporary_view',
sql=drop_exist_temporary_view,
postgres_conn_id='dwh_staging',
dag=dag)
create_temporary_view = PostgresOperator(
task_id='create_temporary_view',
sql=create_temporary_view,
postgres_conn_id='dwh_staging',
dag=dag)
use_temporary_view = PostgresOperator(
task_id='use_temporary_view',
sql=use_temporary_view,
postgres_conn_id='dwh_staging',
dag=dag)
# Data workflow
drop_exist_temporary_view >> create_temporary_view >> use_temporary_view
At the end of execution, I receive the following message:
[2018-06-14 15:26:44,807] {base_task_runner.py:95} INFO - Subtask: psycopg2.ProgrammingError: relation "temporary_table_to_be_used" does not exist
Someone knows if Airflow has some way to retain the same connection to the database? I think it can save a lot of work in creating/maintaining several objects in the database.
You can retain the connection to the database by building a custom Operator which leverages the PostgresHook to retain a connection to the db while you perform some set of sql operations.
You may find some examples in contrib on incubator-airflow or in Airflow-Plugins.
Another option is to persist this temporary data to XCOMs. This will give you the ability to keep the metadata used with the task in which it was created. This may help troubleshooting down the road.

Stanford CoreNLP use case using Pyspark script runs fine on local node but on yarn cluster mode it runs very slow

I tried debugging all the possible solutions but unable to run this and scale this on cluster as i need to process 100 million records, This script runs very well on local node as expected but fails to run on Cloudera Amazon cluster. Here is the sample data that works on local node. According to me the problem is the 2 files that I am using in the udf's are not getting distributed on the executors/containers or nodes and the jobs just keeps running and processing is very slow. I am unable to fix this code to execute this on the cluster.
##Link to the 2 files which i use in the script###
##https://nlp.stanford.edu/software/stanford-ner-2015-12-09.zip
####Link to the data set########
##https://docs.google.com/spreadsheets/d/17b9NUonmFjp_W0dOe7nzuHr7yMM0ITTDPCBmZ6xM0iQ/edit?usp=drivesdk&lipi=urn%3Ali%3Apage%3Ad_flagship3_messaging%3BQHHZFKYfTPyRb%2FmUg6ahsQ%3D%3D
#spark-submit --packages com.databricks:spark-csv_2.10:1.5.0 --master yarn-cluster --files /home/ec2-user/StanfordParser/stanford-ner-2016-10-31/stanford-ner.jar,/home/ec2-user/StanfordParser/stanford-ner-2016-10-31/classifiers/english.all.3class.distsim.crf.ser.gz stanford_ner.py
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import os
from pyspark import SparkFiles
from pyspark import SparkContext, SparkConf
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql import SQLContext
def stanford(str):
os.environ['JAVA_HOME']='/usr/java/jdk1.8.0_131/'
stanford_classifier = SparkFiles.get("english.all.3class.distsim.crf.ser.gz")
stanford_ner_path = SparkFiles.get("stanford-ner.jar")
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
output = st.tag(str.split())
organizations = []
organization = ""
for t in output:
#The word
word = t[0]
#What is the current tag
tag = t[1]
#print(word, tag)
#If the current tag is the same as the previous tag Append the current word to the previous word
if (tag == "ORGANIZATION"):
organization += " " + word
organizations.append(organization)
final = "-".join(organizations)
return final
stanford_lassification = udf(stanford, StringType())
###################Pyspark Section###############
#Set context
sc = SparkContext.getOrCreate()
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)
#Get data
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(r"/Downloads/authors_data.csv")
#Create new dataframe with new column organization
df = df.withColumn("organizations", stanford_lassification(df['affiliation_string']))
#Save result
df.select('pmid','affiliation_string','organizations').write.format('com.databricks.spark.csv').save(r"/Downloads/organizations.csv")

error spark-shell, falling back to uploading libraries under SPARK_HOME

I'm trying to connect a spark-shell amazon hadoop, but I esart all the time giving the following error and do not know how to fix it or configure what is missing.
spark.yarn.jars, spark.yarn.archive
spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/12 07:47:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
16/08/12 07:47:28 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Thx!!!
Error1
I'm trying to run a SQL query, something totally simple as:
val sqlDF = spark.sql("SELECT col1 FROM tabl1 limit 10")
sqlDF.show()
WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Error2
Then I try to run a script scala, something simple collected in:
https://blogs.aws.amazon.com/bigdata/post/Tx2D93GZRHU3TES/Using-Spark-SQL-for-ETL
import org.apache.hadoop.io.Text;
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import com.amazonaws.services.dynamodbv2.model.AttributeValue
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io.LongWritable
import java.util.HashMap
var ddbConf = new JobConf(sc.hadoopConfiguration)
ddbConf.set("dynamodb.output.tableName", "tableDynamoDB")
ddbConf.set("dynamodb.throughput.write.percent", "0.5")
ddbConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
ddbConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
var genreRatingsCount = sqlContext.sql("SELECT col1 FROM table1 LIMIT 1")
var ddbInsertFormattedRDD = genreRatingsCount.map(a => {
var ddbMap = new HashMap[String, AttributeValue]()
var col1 = new AttributeValue()
col1.setS(a.get(0).toString)
ddbMap.put("col1", col1)
var item = new DynamoDBItemWritable()
item.setItem(ddbMap)
(new Text(""), item)
}
)
ddbInsertFormattedRDD.saveAsHadoopDataset(ddbConf)
scala.reflect.internal.Symbols$CyclicReference: illegal cyclic reference involving object InterfaceAudience
at scala.reflect.internal.Symbols$Symbol$$anonfun$info$3.apply(Symbols.scala:1502)
at scala.reflect.internal.Symbols$Symbol$$anonfun$info$3.apply(Symbols.scala:1500)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
Looks like spark UI not started, tried to start spark shell and also check sparkUI localhost:4040 running correctly.

Categories

Resources