Apache Airflow: run all parallel tasks in single DAG run

Apache Airflow: run all parallel tasks in single DAG run - parallel-processing

I have a DAG that has 30 (or more) dynamically created parallel tasks.
I have concurrency option set on that DAG so that I only have single DAG Run running, when catching up the history.
When I run it on my server only 16 tasks actually run in parallel, while the rest 14 just wait being queued.
Which setting should I alter so that I have only 1 DAG Run running, but with all 30+ tasks running in parallel?
According to this FAQ, it seems like it's one of the dag_concurrency or max_active_runs_per_dag, but the former seem to be overdriven by concurrency setting already, while the latter seemed to have no effect (or I effectively messed up my setup).
Here's the sample code:
import datetime as dt
import logging
from airflow.operators.dummy_operator import DummyOperator
import config
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'wait_for_downstream': True,
'concurrency': 1,
'retries': 0,
}
def print_operators(ds, **kwargs):
logging.info(f"Task {kwargs.get('task_instance_key_str', 'unknown_task_instance')}")
dag = DAG(
dag_id='test_parallelism_dag',
start_date=dt.datetime(2019, 1, 1),
default_args=default_args,
schedule_interval='#daily',
catchup=True,
template_searchpath=[config.DAGS_PATH],
params={'schema': config.SCHEMA_DB},
max_active_runs=1,
)
print_operators = [PythonOperator(
task_id=f'test_parallelism_dag.print_operator_{i}',
python_callable=print_operators,
provide_context=True,
dag=dag
) for i in range(60)]
dummy_operator_start = DummyOperator(
task_id=f'test_parallelism_dag.dummy_operator_start',
)
dummy_operator_end = DummyOperator(
task_id=f'test_parallelism_dag.dummy_operator_end',
)
dummy_operator_start >> print_operators >> dummy_operator_end
EDIT 1:
My current airflow.cfg contains:
executor = SequentialExecutor
parallelism = 32
dag_concurrency = 24
max_active_runs_per_dag = 26
My env variables are as following (set all of them different to easily spot which one helps):
AIRFLOW__CORE__EXECUTOR=LocalExecutor
AIRFLOW__CORE__DAG_CONCURRENCY=18
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=20
AIRFLOW__CORE__WORKER_CONCURRENCY=22
And with that I have following Gantt diagram:
Which kind of gives me a hint that setting DAG_CONCURRENCY env variable works.

The actual parameter to change was dag_concurrency in airflow.cfg or override it with AIRFLOW__CORE__DAG_CONCURRENCY env variable.
As per docs I referred to in my question:
concurrency: The Airflow scheduler will run no more than $concurrency
task instances for your DAG at any given time. Concurrency is defined
in your Airflow DAG. If you do not set the concurrency on your DAG,
the scheduler will use the default value from the dag_concurrency
entry in your airflow.cfg.
Which means following simplified code:
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'wait_for_downstream': True,
'concurrency': 1,
}
dag = DAG(
dag_id='test_parallelism_dag',
default_args=default_args,
max_active_runs=1,
)
should be rewritten to:
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'wait_for_downstream': True,
}
dag = DAG(
dag_id='test_parallelism_dag',
default_args=default_args,
max_active_runs=1,
concurrency=30
)
My code actually has wrong assumption that default_args at some point substitute actual kwargs to DAG constructor. I don't know what lead me to that conclusion back then, but I guess setting concurrency to 1 there is some draft leftover, which never actually affected anything and actual DAG concurrency was set from config default, which is 16.

Update the concurrency config as well in your airflow.cfg file. If it is 16, increase it to 32.
If you are using Celery Executor, change worker_concurrency to 32.

Related

Why does huggingface hang on list input for pipeline sentiment-analysis?

With python 3.10 and latest version of huggingface.
for simple code likes this
from transformers import pipeline
input_list = ['How do I test my connection? (Windows)', 'how do I change my payment method?', 'How do I contact customer support?']
classifier = pipeline('sentiment-analysis')
results = classifier(input_list)
the program hangs and returns error messages:
File ".......env/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
but replace the list input with a string, it works
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
result = classifier('How do I test my connection? (Windows)')

It needs to define a main function to run multitask that the list input depends on. Following update works
from transformers import pipeline
def main():
input_list = ['How do I test my connection? (Windows)',
'how do I change my payment method?',
'How do I contact customer support?']
classifier = pipeline('sentiment-analysis')
results = classifier(input_list)
if __name__ == '__main__':
main()
The question is reduced to where to put freeze_support() in a Python script?

Why aren't the Superset Alert Mails working, even after setting all the required configurations?

So, I am running Superset installed on an EC2 instance. In my config.py file, I had only made these changes:
FEATURE_FLAGS = {
"ALERT_REPORTS": True
}
EMAIL_NOTIFICATIONS = True
SMTP_HOST = "email-smtp.us-east-1.amazonaws.com"
SMTP_STARTTLS = True
SMTP_SSL = False
SMTP_USER = "***my user***"
SMTP_PORT = 25
SMTP_PASSWORD = "***my pass***"
SMTP_MAIL_FROM = "***an email ID***"
ENABLE_SCHEDULED_EMAIL_REPORTS = True
ENABLE_ALERTS = True
After setting these, I remembered to do superset init before launching the service.
Yet, after the scheduled time, the UI shows no value in the last run column and the logs gives the following message:
DEBUG:cron_descriptor.GetText:Failed to find locale en_US
INFO:werkzeug:127.0.0.1 - - [02/Apr/2021 10:56:51] "GET /api/v1/report/?q=(filters:!((col:type,opr:eq,value:Alert)),order_column:name,order_direction:desc,page:0,page_size:25) HTTP/1.1" 200 -
Here's a sreenshot of the UI:
As can be seen, there is nothing in the last run column, even after the scheduled time (I had also scheduled it to every 1 minute interval - but same results

Alerts/reports are executed in the workers, with celery beat scheduling the jobs and celery workers executing them. You need to configure celery beat and workers to run, check out this documentation for some info on how to set this up with docker-compose: https://superset.apache.org/docs/installation/alerts-reports

How to run Apache Airflow DAG as Unix user

I installed Apache Airflow on my cluster using root account. I know it is bad practice, but it is only test environment. I created a simple DAG:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
dag = DAG('create_directory', description='simple create directory workflow', start_date=datetime(2017, 6, 1))
t1 = BashOperator(task_id='create_dir', bash_command='mkdir /tmp/airflow_dir_test', dag=dag)
t2 = BashOperator(task_id='create_file', bash_command='echo airflow_works > /tmp/airflow_dir_test/airflow.txt')
t2.set_upstream(t1)
The problem is that when I run this job, the root user executes it. I tried to add owner parameter, but it doesn't work. Airflow says:
Broken DAG: [/opt/airflow/dags/create_directory.py] name 'user1' is not defined
My question is, how I can run Apache Airflow DAG using other user than root?

You can use the run_as_user parameter to impersonate a unix user for any task:
t1 = BashOperator(task_id='create_dir', bash_command='mkdir /tmp/airflow_dir_test', dag=dag, run_as_user='user1')
You can use default_args if you want to apply it to every task in the DAG:
dag = DAG('create_directory', description='simple create directory workflow', start_date=datetime(2017, 6, 1), default_args={'run_as_user': 'user1'})
t1 = BashOperator(task_id='create_dir', bash_command='mkdir /tmp/airflow_dir_test', dag=dag)
t2 = BashOperator(task_id='create_file', bash_command='echo airflow_works > /tmp/airflow_dir_test/airflow.txt')
Note that the owner parameter is for something else, multi-tenancy.

Optimal way of creating a cache in the PySpark environment

I am using Spark Streaming for creating a system to enrich incoming data from a cloudant database. Example -
Incoming Message: {"id" : 123}
Outgoing Message: {"id" : 123, "data": "xxxxxxxxxxxxxxxxxxx"}
My code for the driver class is as follows:
from Sample.Job import EnrichmentJob
from Sample.Job import FunctionJob
import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from kafka import KafkaConsumer, KafkaProducer
import json
class SampleFramework():
def __init__(self):
pass
#staticmethod
def messageHandler(m):
return json.loads(m.message)
#staticmethod
def processData(rdd):
if (rdd.isEmpty()):
print("RDD is Empty")
return
# Expand
expanded_rdd = rdd.mapPartitions(EnrichmentJob.enrich)
# Score
scored_rdd = expanded_rdd.map(FunctionJob.function)
# Publish RDD
def run(self, ssc):
self.ssc = ssc
directKafkaStream = KafkaUtils.createDirectStream(self.ssc, QUEUENAME, \
{"metadata.broker.list": META,
"bootstrap.servers": SERVER}, \
messageHandler= SampleFramework.messageHandler)
directKafkaStream.foreachRDD(SampleFramework.processData)
ssc.start()
ssc.awaitTermination()
Code for the the Enrichment Job is as follows:
class EnrichmentJob:
cache = {}
#staticmethod
def enrich(data):
# Assume that Cloudant Connector using the available config
cloudantConnector = CloudantConnector(config, config["cloudant"]["host"]["req_db_name"])
final_data = []
for row in data:
id = row["id"]
if(id not in EnrichmentJob.cache.keys()):
data = cloudantConnector.getOne({"id": id})
row["data"] = data
EnrichmentJob.cache[id]=data
else:
data = EnrichmentJob.cache[id]
row["data"] = data
final_data.append(row)
cloudantConnector.close()
return final_data
My question is - Is there someway to maintain [1]"a global cache on the main memory that is accessible to all workers" or [2]"local caches on each of the workers such that they remain persisted in the foreachRDD setting"?
I have already explored the following -
Broadcast Variables - Here we go the [1] way. As I understand, they are meant to be read-only and immutable. I have checked out this reference but it cites an example of unpersisting/persisting the broadcasted variable. Is this a good practice?
Static Variables - Here we go the [2] way. The class that is being referred to ("Enricher" in this case) maintains a cache in the form of a static variable dictionary. But it turns out that the ForEachRDD function spawns a completely new process for each incoming RDD and this removes the previously initiated static variable. This is the one coded above.
I have two possible solutions right now -
Maintain an offline cache on the file system.
Do the entire computation of this enrichment task on my driver node. This would cause the entire data to end up on driver and be maintained there. The cache object will be sent to the enrichment job as an argument to the mapping function.
Here obviously the first one looks better than the second, but I wish to conclude that these two are the only ways around, before committing to them. Any pointers would be appreciated!

Is there someway to maintain [1]"a global cache on the main memory that is accessible to all workers"
No. There is no "main memory" which can be accessed by all workers. Each worker runs in a separate process and communicates with external world with sockets. Not to mention separation between different physical nodes in non-local mode.
There are some techniques that can be applied to achieve worker scoped cache with memory mapped data (using SQLite being the simplest one) but it takes some additional effort to implement the right way (avoid conflicts and such).
or [2]"local caches on each of the workers such that they remain persisted in the foreachRDD setting"?
You can use standard caching techniques with scope limited to the individual worker processes. Depending on the configuration (static vs. dynamic resource allocation, spark.python.worker.reuse) it may or may not be preserved between multiple tasks and batches.
Consider following, simplified, example:
map_param.py:
from pyspark import AccumulatorParam
from collections import Counter
class CounterParam(AccumulatorParam):
def zero(self, v: Counter) -> Counter:
return Counter()
def addInPlace(self, acc1: Counter, acc2: Counter) -> Counter:
acc1.update(acc2)
return acc1
my_utils.py:
from pyspark import Accumulator
from typing import Hashable
from collections import Counter
# Dummy cache. In production I would use functools.lru_cache
# but it is a bit more painful to show with accumulator
cached = {}
def f_cached(x: Hashable, counter: Accumulator) -> Hashable:
if cached.get(x) is None:
cached[x] = True
counter.add(Counter([x]))
return x
def f_uncached(x: Hashable, counter: Accumulator) -> Hashable:
counter.add(Counter([x]))
return x
main.py:
from pyspark.streaming import StreamingContext
from pyspark import SparkContext
from counter_param import CounterParam
import my_utils
from collections import Counter
def main():
sc = SparkContext("local[1]")
ssc = StreamingContext(sc, 5)
cnt_cached = sc.accumulator(Counter(), CounterParam())
cnt_uncached = sc.accumulator(Counter(), CounterParam())
stream = ssc.queueStream([
# Use single partition to show cache in work
sc.parallelize(data, 1) for data in
[[1, 2, 3], [1, 2, 5], [1, 3, 5]]
])
stream.foreachRDD(lambda rdd: rdd.foreach(
lambda x: my_utils.f_cached(x, cnt_cached)))
stream.foreachRDD(lambda rdd: rdd.foreach(
lambda x: my_utils.f_uncached(x, cnt_uncached)))
ssc.start()
ssc.awaitTerminationOrTimeout(15)
ssc.stop(stopGraceFully=True)
print("Counter cached {0}".format(cnt_cached.value))
print("Counter uncached {0}".format(cnt_uncached.value))
if __name__ == "__main__":
main()
Example run:
bin/spark-submit main.py
Counter cached Counter({1: 1, 2: 1, 3: 1, 5: 1})
Counter uncached Counter({1: 3, 2: 2, 3: 2, 5: 2})
As you can see we get expected results:
For "cached" objects accumulator is updated only once per unique key per worker process (partition).
For not-cached objects accumulator is update each time key occurs.

JIRA Email Notification when log time worked is reaching budget time. OnDemand

I would like to know if is there an option to send a notification when log time worked is reaching estimating time.
example:
Start.
Estimated:
Original Estimate - 5 minutes
5m
Remaining:
Remaining Estimate - 5 minutes
5m
Logged:
Time Spent - Not Specified
Not Specified
When i Log time.
Estimated:
Original Estimate - 5 minutes
5m
Remaining:
Time Spent - 4 minutes Remaining Estimate - 1 minute
1m
Logged:
Time Spent - 4 minutes Remaining Estimate - 1 minute
4m
I like JIRA send the notification before 1 minute ending or what ever i set.
I'm sorry for the bad english.
Thank you

I believe you wanted to create some kind of triggered event when logged work approaches original estimate, but I do not know how you could do that in JIRA. Nevertheless, I know something that still might help you to solve your problem.
Try using following groovy script:
import com.atlassian.jira.ComponentManager
import com.atlassian.jira.component.ComponentAccessor
import com.atlassian.jira.config.properties.APKeys
import com.atlassian.jira.config.properties.ApplicationProperties
import com.atlassian.jira.issue.Issue
import com.atlassian.jira.issue.search.SearchResults
import com.atlassian.jira.issue.worklog.Worklog
import com.atlassian.jira.jql.parser.DefaultJqlQueryParser
import com.atlassian.jira.web.bean.PagerFilter
import com.atlassian.mail.Email
import com.atlassian.mail.MailException
import com.atlassian.mail.MailFactory
import com.atlassian.mail.queue.SingleMailQueueItem
import com.atlassian.query.Query
import groovy.text.GStringTemplateEngine
import org.apache.log4j.Logger
import com.atlassian.core.util.DateUtils
def componentManager = ComponentManager.getInstance()
def worklogManager = componentManager.getWorklogManager()
def userUtil = componentManager.getUserUtil()
def user = userUtil.getUser('admin')
def searchProvider = componentManager.getSearchProvider()
def queryParser = new DefaultJqlQueryParser()
Logger log = Logger.getLogger('worklogNotification')
Query jql = queryParser.parseQuery('project = ABC and updated > startOfDay(-1d)')
SearchResults results = searchProvider.search(jql, user, PagerFilter.getUnlimitedFilter())
List issues = results.getIssues()
String emailFormat = 'HTML'
def mailServerManager = componentManager.getMailServerManager()
def mailServer = mailServerManager.getDefaultSMTPMailServer()
String defaultSubject = 'Logged work on JIRA issue %ISSUE% exceeds original estimate'
String body = ''
Map binding = [:]
String loggedWorkDiff = ''
String template = '''
Dear ${issue.assignee.displayName}, <br /><br />
Logged work on issue ${issue.key} (${issue.summary}) exceeds original estimate ($loggedWorkDiff more than expected).<br />
*** This is an automatically generated email, you do not need to reply ***<br />
'''
GStringTemplateEngine engine = new GStringTemplateEngine()
ApplicationProperties applicationProperties = componentManager.getApplicationProperties()
binding.put("baseUrl", applicationProperties.getString(APKeys.JIRA_BASEURL))
if (mailServer && ! MailFactory.isSendingDisabled()) {
for (Issue issue in issues) {
if(issue.originalEstimate) {
loggedWork = 0
worklogs = worklogManager.getByIssue(issue)
worklogs.each{Worklog worklog -> loggedWork += worklog.getTimeSpent()}
if(loggedWork - issue.originalEstimate) {
loggedWorkDiff = DateUtils.getDurationString(Math.round(loggedWork - issue.originalEstimate))
email = new Email(issue.getAssigneeUser().getEmailAddress())
email.setFrom(mailServer.getDefaultFrom())
email.setSubject(defaultSubject.replace('%ISSUE%', issue.getKey()))
email.setMimeType(emailFormat == "HTML" ? "text/html" : "text/plain")
binding.put("issue", issue)
binding.put('loggedWorkDiff', loggedWorkDiff)
body = engine.createTemplate(template).make(binding).toString()
email.setBody(body)
try {
log.debug ("Sending mail to ${email.getTo()}")
log.debug ("with body ${email.getBody()}")
log.debug ("from template ${template}")
SingleMailQueueItem item = new SingleMailQueueItem(email);
ComponentAccessor.getMailQueue().addItem(item);
}
catch (MailException e) {
log.warn ("Error sending email", e)
}
}
}
}
}
This script takes issues for project ABC, which have been updated during the previous day (JQL: "project = ABC and updated > startOfDay(-1d)"), calculates difference between logged work and estimated work and sends notification to the issue assignee if logged work exceeds original estimate.
You can add this script to the list of JIRA services (JIRA -> Administration -> System -> Advanced -> Services).
Name: [any name]
Class: com.onresolve.jira.groovy.GroovyService
Delay: 1440
Input file: [path to your script on server]
Delay: 1440
Note that you will put 1440 (min) as a service delay, which equals to one day. So, the script will be executed once per day sending notification to issue assignees about exceeded original estimates.
Also note that Groovy Scripting plugin should be installed in order to be able to run your script.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Apache Airflow: run all parallel tasks in single DAG run - parallel-processing

Update the concurrency config as well in your airflow.cfg file. If it is 16, increase it to 32. If you are using Celery Executor, change worker_concurrency to 32.

Related

Why does huggingface hang on list input for pipeline sentiment-analysis?

Why aren't the Superset Alert Mails working, even after setting all the required configurations?

How to run Apache Airflow DAG as Unix user

Optimal way of creating a cache in the PySpark environment

JIRA Email Notification when log time worked is reaching budget time. OnDemand

Categories

Resources