DAG failures are not getting reported - etl

We have set up a callback to be triggered on task failures which sends out a slack message. Also, the "email_on_failure" option in enabled to send out emails on failure. However, in many instances, we found that the DAGS was marked as failed in the UI but the error was not reported either by slack or email. Here is how we have setup our DAG:
default_args = {
"owner": "Test",
"depends_on_past": True,
"wait_for_downstream": True,
"start_date": datetime(2019, 5, 8),
"email": ["a#a.com"],
"email_on_failure": True,
"email_on_retry": False,
"retries": 1,
'on_failure_callback': task_fail_slack_alert,
"retry_delay": timedelta(minutes=5),}
mydag = DAG(dag_id='test_etl',
default_args=default_args,
schedule_interval='05 7 * * *',
max_active_runs=1,
dagrun_timeout=timedelta(hours=6),
params={})
I am suspecting this happens when the tasks were either completed or in-flight but the DAG timed out. I was wondering if anyone else has seen this issue.

Related

Strapi deployment to Heroku

I am totally new on Strapi and Heroku. I am trying to deploy my app that is working well locally to Heroku but I am getting the following error:
2020-06-15T09:56:29.114780+00:00 app[web.1]: [2020-06-15T09:56:29.114Z] error Impossible to register the 'menus.menus' model.
2020-06-15T09:56:29.115672+00:00 app[web.1]: [2020-06-15T09:56:29.115Z] error TimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
At the beginning I thought it was a problem connecting to the database, but in my local environment it work perfectly and connect with no issues.
I even upgraded my database to a paid version in case the connection is timing out.
I also follow some answers I found only about modifying my config/environment/production/database.json as follow:
{
"defaultConnection": "default",
"connections": {
"default": {
"connector": "bookshelf",
"settings": {
"client": "postgres",
"host": "***.compute-1.amazonaws.com",
"port": "5432",
"database": "***",
"username": "***",
"password": "***",
"ssl": { "rejectUnauthorized": false }
},
"options": {
"debug": false,
"acquireConnectionTimeout": 100000,
"pool": {
"min": 0,
"max": 10,
"createTimeoutMillis": 30000,
"acquireTimeoutMillis": 600000,
"idleTimeoutMillis": 20000,
"reapIntervalMillis": 20000,
"createRetryIntervalMillis": 200
}
}
}
}
}
Any other idea of what can it be?
When I run the develop locally I got a warn (but even this the app run anyway after):
[2020-06-15T10:36:41.261Z] warn The bootstrap function is taking unusually long to execute (3500 miliseconds).
[2020-06-15T10:36:41.261Z] warn Make sure you call it?
[2020-06-15T10:36:42.476Z] warn The bootstrap function is taking unusually long to execute (3500 miliseconds).
[2020-06-15T10:36:42.476Z] warn Make sure you call it?
One simple first step option is to launch a strapi quickstart application on heroku. You can find this link here.. https://github.com/strapi/strapi
Relaunching with this method will provide you with a working, secure instance to begin development on.
Also note that heroku deploys strapi to production, so that you are not able to use the content-types editor, so it is recommended that you develop locally & test your app and use the heroku cli to update your deployment.

From where do I find the dialogflow request payload?

I have a flask based service hosted in Heroku. The endpoint of which is given as a fulfillment in dialogflow. Now I cannot figure out how to capture the request payload which dialogflow triggers everytime I request something.
I tried capturing and logging the same in heroku itself but that does not seem to be working.
The service code is as follows:
#app.route('/date/currentdate/<date>', methods = ['POST'])
def postJsonHandler():
print (request.is_json)
content = request.get_json()
logging.warning(content)
return 'JSON posted'
The json which i am getting is:
WARNING:root:{'responseId': 'c5115583-e9c5-497a-8a50-1ea07ab02dba-baaf0c1f', 'queryResult': {'queryText': 'send me the asap for 4568999', 'parameters': {'Dashboard': 'ASAP', 'number': 4568999.0}, 'allRequiredParamsPresent': True, 'fulfillmentMessages': [{'text': {'text': ['Hi I can definitely help you out with that.']}, 'platform': 'SKYPE'}, {'text': {'text': ['']}}], 'intent': {'name': 'replaced this', 'displayName': 'ASAP Dashboard'}, 'intentDetectionConfidence': 0.7012109, 'languageCode': 'en'}, 'originalDetectIntentRequest': {'payload': {}}, 'session': 'replaced this'}
There is diagnostic info section in the agent section which has got all the info related to request and response, it is sometimes not visible if there is pop up on the top of page, in my case it was the v2 getting outdated banner; on closing it the diagnostic info became visible.

aws-amplify subscription to appsync stops after 2 mins

When subscribing to appsync using amplify using API.graphql(graphqlOperation(subscription)); the subscription works for a while, but the I get socket closed error.
CONSOLE LOG [native code]: {
"[INFO] 24:12.893 MqttOverWSProvider": {
"errorCode": 8,
"errorMessage": "AMQJS0008I Socket closed.",
"uri": "wss://<uri>
The same subscription works in the appsync console without any issue and I get events as long as the subscription is running, but stops after around 2 mins on device.
code involved
let cs = await this.subscriptionService.appSubscriptions();
this.sbscriptions = this.cs.subscribe({
next: data => {
console.dir(data);
},
error: error => console.dir(JSON.stringify(error)),
close: () => console.log("closed")
});
This part of code throws no error, and the error shown above is only visible when setting LOG_LEVEL = DEBUG
NOTE : The subscription does work for a while before this error shows up, and in that time I can see all the events that I trigger from console.

Chronos can't run a private Docker container

I'm playing on localhost with a DC/OS installation. While everything works fine, I can't seem to run a docker image located inside a private repo. I'm using python to communicate with chronos:
#celery.task(name='add-job', soft_time_limit=5)
def add_job(job_id):
job_document = mongo.jobs.find_one({
'_id': job_id
})
if job_document:
worker_document = mongo.workers.find_one({
'_id': job_document['workerId']
})
if worker_document:
job = {
'async': True,
'name': job_document['_id'],
'owner': 'owner#gmail.com',
'command': "python /code/run.py",
"disabled": False,
"shell": True,
"cpus": worker_document['cpus'],
"disk": worker_document['disk'],
"mem": worker_document['memory'],
'schedule': 'R1//PT300S',# start now,
"epsilon": "PT60M",
"container": {
"type": "DOCKER",
"forcePullImage": True,
"image": "quay.io/username/container",
"network": "HOST",
"volumes": [{
"containerPath": "/images/",
"hostPath": "/images/",
"mode": "RW"
}]
},
"uris": [
"file:///images/docker.tar.gz"
]
}
return chronos_client.add(job)
else:
return 'worker not found'
else:
return 'job not found'
The job runs fine with a public image (alpine:latest) but it fails without any error inside the dcos installation.
The job gets executed but it fails immediately. The error log of the job inside chronos looks like this:
I1212 12:39:11.141639 25058 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/61d6d037-c9f5-482b-a441-11d85554461b-S1\/root","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"executable":false,"extract":false,"value":"file:\/\/\/images\/docker.tar.gz"}}],"sandbox_directory":"\/var\/lib\/mesos\/slave\/slaves\/61d6d037-c9f5-482b-a441-11d85554461b-S1\/docker\/links\/7029bbea-4c3d-439a-8720-411f6fe40eb9","user":"root"}
I1212 12:39:11.143575 25058 fetcher.cpp:409] Fetching URI 'file:///images/docker.tar.gz'
I1212 12:39:11.143587 25058 fetcher.cpp:250] Fetching directly into the sandbox directory
I1212 12:39:11.143602 25058 fetcher.cpp:187] Fetching URI 'file:///images/docker.tar.gz'
I1212 12:39:11.143612 25058 fetcher.cpp:167] Copying resource with command:cp '/images/docker.tar.gz' '/var/lib/mesos/slave/slaves/61d6d037-c9f5-482b-a441-11d85554461b-S1/docker/links/7029bbea-4c3d-439a-8720-411f6fe40eb9/docker.tar.gz'
I1212 12:39:11.146726 25058 fetcher.cpp:547] Fetched 'file:///images/docker.tar.gz' to '/var/lib/mesos/slave/slaves/61d6d037-c9f5-482b-a441-11d85554461b-S1/docker/links/7029bbea-4c3d-439a-8720-411f6fe40eb9/docker.tar.gz'
Stdout is empty. Executed directly inside marathon as an application with the same settings the authentication works and my image is downloaded & executed. Is this something that chronos does not support? It should...I mean, it has commands for docker...
Update: digging deeper into the agent logs I found this:
Failed to run 'docker -H unix:///var/run/docker.sock pull quay.io/username/container': exited with status 1; stderr='Error: Status 403 trying to pull repository username/container: "{\"error\": \"Permission Denied\"}"
I tried the archive with it's config.json file on the agent itself and it can download when triggered from the command line. I just can't seem to understand why chronos is not using it properly. I can't find any other reference on how to put my credentials other than this.
As it turns out...the uris param is deprecated in favor of fetch. I started from scratch with a marathon config applied to chronos and watched the logs carefully when I saw this: {'message': 'Tried to add both uri (deprecated) and fetch parameters on aBPepwhG5z33e4teG', 'status': 'Bad Request'}. Then I changed my uris parameter into:
"fetch": [{
"uri": "/images/docker.tar.gz",
"extract": true,
"executable": false,
"cache": false
}]
...and it worked.
your post looked a little like this one, which turned out to be a problem with volumes.

Connection lost whatever the long poll's settings with crossbar.io

I'm using crossbar to test the websockets and the long polling.
But each time I try using long-polling as default transport, whatever the settings I set, I get a "connection lost" every 2 seconds in my console.
By the way, it works perfectly with the websocket.
Here's the settings I want to test:
On the server site:
{
"lp": {
"type": "longpoll",
"options": {
"request_timeout": 0,
"session_tiemout": 0,
"queue_limit_bytes": 0,
"queue_limit_messages": 0
}
}
}
On the client side:
var connection = new autobahn.Connection({
transports: [{
url: [my url],
type: "longpoll",
max_retries: 1,
initial_retry_delay: 1,
retry_delay_growth: 3,
retry_delay_jitter: 3
}], ...
I'm using python on the server side, Chrome 43 as default browser (also tested on firefox).
Is something wrong in my settings ?
Sorry, I cannot replicate this. I'm using the longpoll example (https://github.com/crossbario/crossbarexamples/tree/master/longpoll) and have modified the config and the connection data to mirror what you list here. (I assume that the "tiemout" is just a typo here, since Crossbar.io doesn't start with this.)
This works fine in Chrome 43.
My best guess is that the problem is with something you didn't list.
My suggestion: Start from the example, and see whether this works for you.

Resources