curl command BashOpertor in Cloud Composer - bash

I am following the tutorial mentioned in this link - download_rocket_launches.py . As I am running this in Cloud Composer, I want to put in the native path i.e. /home/airflow/gcs/dags but it's failing with error path not found.
What path can I give for this command to work? Here is the task I am trying to execute -
download_launches = BashOperator(
task_id="download_launches",
bash_command="curl -o /tmp/launches.json -L 'https://ll.thespacedevs.com/2.0.0/launch/upcoming'", # noqa: E501
dag=dag,
)

This worked on my end:
import json
import pathlib
import airflow.utils.dates
import requests
import requests.exceptions as requests_exceptions
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
dag = DAG(
dag_id="download_rocket_launches",
description="Download rocket pictures of recently launched rockets.",
start_date=airflow.utils.dates.days_ago(14),
schedule_interval="#daily",
)
download_launches = BashOperator(
task_id="download_launches",
bash_command="curl -o /home/airflow/gcs/data/launches.json -L 'https://ll.thespacedevs.com/2.0.0/launch/upcoming' ", # put space in between single quote and double quote
dag=dag,
)
download_launches
Output:
The key was to put space between single quote ' and double quote " towards the end of your bash command.
Also, it is recommended to use the Data folder when it comes to mapping out your output file as stated in the GCP documentation:
gs://bucket-name/data /home/airflow/gcs/data: Stores the data that tasks produce and use. This folder is mounted on all worker nodes.

Related

Let go tool pprof collect new data periodically

I'm using go pprof like this:
go tool pprof -no_browser -http=0.0.0.0:8081 http://localhost:6060/debug/pprof/profile?seconds=60
How can I ask pprof to fetch the profiling data periodically?
Here's a python script that uses wget to grab the data every hour, putting the output into a file whose name includes the timestamp.
Each file can be inspected by running
go tool pprof pprof_data_YYYY_MM_DD_HH
Here's the script:
import subprocess
import time
from datetime import datetime
while True:
now = datetime.now()
sleepTime = 3601 - (60 * now.minute + now.second + 1e-6 * now.microsecond)
time.sleep(sleepTime)
now = datetime.now()
tag = f"{now.year}-{now.month:02d}-{now.day:02d}_{now.hour:02d}"
subprocess.run(["wget", "-O", f"pprof_data_{tag}", "-nv", "-o", "/dev/null", "http://localhost:6060/debug/pprof/profile?seconds=60"])
The 3601 causes wget to run about 1 second after the top of the hour to avoid the race condition that time.Sleep returns just before the top of the hour.
You could obviously write a similar script in bash or your favorite language.

Extract and Add specific ARN value into a command repeatedly in Bash Script

I have a bash script, that runs an aws cli command to grab the list of Secrets Manager Secrets ARNs.. part of script:
export SECRETS_ARN_LIST=$(cat terraform_secrets.json | cut -b 21-120 | cut -f1 -d"","" | sed 's/.$//')
The value of $SECRETS_ARN_LIST is something like:
arn:aws:secretsmanager:us-east-1:1244:secret:/test/app/xxxx1
arn:aws:secretsmanager:us-east-1:1244:secret:/test/app/xxxx2
arn:aws:secretsmanager:us-east-1:1244:secret:/test/app/xxxx3
.
.
.
Now, I want to run terraform import command and grab the value of each secret ARN ONE BY ONE.. and put it into
terraform import 'module.env-vars.aws_secretsmanager_secret.secrets_manager_env_vars["secret_name"]' arn:aws:secretsmanager:us-east-1:xxxx1
so then, it should run:
terraform import 'module.env-vars.aws_secretsmanager_secret.secrets_manager_env_vars["secret_name"]' arn:aws:secretsmanager:us-east-1:xxxx2
and then, so on .... ARN with xxxx3
Now, how do I tell bash to get those values of the ARNs one by one and then add them there in place of the ARN?

Running python script in Service account by using windows task scheduler

NOTE 1- All files are running using cmd in my profile and fetching
correct results.But not with the windows task scheduler.
**> NOTE 2- I finally got a lead that glob.glob and os.listdir is not
working in the windows task scheduler in my python script in which I
am making a connection to a remote server, but it is working in my
local using cmd and pycharm.**
**
print("before for loop::", os.path.join(file_path, '*'))
print(glob.glob( os.path.join(file_path, '*') ))
for filename in glob.glob( os.path.join(file_path, '*') ):
print("after for loop")
**
While running above .py script I got: before for loop:: c:\users\path\dir\*
While executing print(glob.glob( os.path.join(file_path, '*') )) giving "[]" and not able to find why?
I followed this StackOverflow link for setting up Windows Scheduler for python by referring to MagTun comment:Scheduling a .py file on Task Scheduler in Windows 10
Currently, I am having scheduler.py which is calling the other 4 more .py files.
When I try to run Scheduler.py from Windows Task SCHEDULER,
It runs Scheduler.py and then after 1 minute it runs all other 4 .py files and exit within a seconds. Not giving any output in elastic search.
I used this for cmd script:
#echo off
cmd /k "cd /d D:\folder\env\Scripts\ & activate & cd /d D:\folder\files & python scheduler.py" >> open.log
timeout /t 15
In this above cmd command, It is not saving anything in open.log when running with windows task scheduler.
Script with multiple .py subprocess schedulers is like this:
from apscheduler.schedulers.blocking import BlockingScheduler
import datetime
from subprocess import call
from datetime import datetime
import os
from apscheduler.schedulers.blocking import BlockingScheduler
def a():
call(['python', r'C:\Users\a.py'])
def b():
call(['python', r'C:\Users\b.py'])
def c():
call(['python', r'C:\Users\c.py'])
def d():
call(['python', r'C:\Users\d.py'])
if __name__ == '__main__':
scheduler = BlockingScheduler()
scheduler.add_job(a, 'interval', minutes=1)
scheduler.add_job(b, 'interval', minutes=2)
scheduler.add_job(c, 'interval', minutes=1)
scheduler.add_job(d, 'interval', minutes=2)
print('Press Ctrl+{0} to exit'.format('Break' if os.name == 'nt' else 'C'))
try:
scheduler.start()
print("$$$$$$$$$$$$$$$$$$")
except (KeyboardInterrupt, SystemExit):
print("****#####")
pass
Having the same bizar issue. Works like a charm when running as user. With a windows task the glob query returns no results.
Edit: was using a network share by its mapping name. Only works when using the full UNC path (including the server name).

boto3 - base64 encoded lifecycle configuration produces instance failure

I am trying to set up lifecycle configurations for Sagemaker notebooks over the aws api via boto3. From the docs it reads that a base64 encoded string of the configuration has to be provided.
I am using the following code:
with open(lifecycleconfig.sh, 'rb') as fp:
file_content = fp.read()
config_string = base64.b64encode(file_content).decode('utf-8')
boto3.client('sagemaker').create_notebook_instance_lifecycle_config(
NotebookInstanceLifecycleConfigName='mylifecycleconfig1',
OnCreate=[
{
'Content': config_string
},
],
)
With some lifecycleconfig.sh:
#!/bin/bash
set -e
This creates a lifecycle configuration which shows up in the web interface and whose content is seemingly identical to creating a config by hand:
image.
However Notebooks using the lifecycle config created via boto3 will not start and the log file will show error:
/home/ec2-user/SageMaker/create_script.sh: line 2: $'\r': command not found
/home/ec2-user/SageMaker/create_script.sh: line 3: set: -
set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...]
Moreover, if I copy paste the content of the corrupted config and create a new config by hand, the new one will now also not start.
How do I have to encode a bash script for a working aws lifecycle configuration?
Found out that it is actually a Windows specific problem concerning the difference between open(..., 'rb').read() and open(..., 'r').read().encode('utf-8').
On my linux machine these two give the same result. On Windows however open(..., 'rb') gives stuff like \r\n for new lines, which appearantly can be comprehended by Amazon's web interface, but not the linux machine where the script gets deployed.
This is a os independent solution:
with open(lifecycleconfig.sh, 'r') as fp:
file_content = fp.read()
config_string = base64.b64encode(file_content.encode('utf-8')).decode('utf-8')

SpaCy model won't load in AWS Lambda

Has anyone gotten SpaCy 2.0 to work in AWS Lambda? I have everything zipped and packaged correctly, since I can get a generic string to return from my lambda function if I test it. But when I do the simple function below to test, it stalls for about 10 seconds and then returns empty, and I don't get any error messages. I did set my Lambda timeout at 60 seconds so that isn't the problem.
import spacy
nlp = spacy.load('en_core_web_sm') #model package included
def lambda_handler(event, context):
doc = nlp(u'They are')
msg = doc[0].lemma_
return msg
When I load the model package without using it, it also returns empty, but if I comment it out it sends me the string as expected, so it has to be something about loading the model.
import spacy
nlp = spacy.load('en_core_web_sm') #model package included
def lambda_handler(event, context):
msg = 'message returned'
return msg
To optimize model load you have to store it on S3, and download it using your own script to tmp folder in lambda and then load it into spacy from it.
It will take 5 seconds to download it from S3 and run. The good optimization here is to keep model on warm container and check if it was already downloaded. On warm container code takes 0.8 seconds to run.
Here is the link to the code and package with example:
https://github.com/ryfeus/lambda-packs/blob/master/Spacy/source2.7/index.py
import spacy
import boto3
import os
def download_dir(client, resource, dist, local='/tmp', bucket='s3bucket'):
paginator = client.get_paginator('list_objects')
for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
if result.get('CommonPrefixes') is not None:
for subdir in result.get('CommonPrefixes'):
download_dir(client, resource, subdir.get('Prefix'), local, bucket)
if result.get('Contents') is not None:
for file in result.get('Contents'):
if not os.path.exists(os.path.dirname(local + os.sep + file.get('Key'))):
os.makedirs(os.path.dirname(local + os.sep + file.get('Key')))
resource.meta.client.download_file(bucket, file.get('Key'), local + os.sep + file.get('Key'))
def handler(event, context):
client = boto3.client('s3')
resource = boto3.resource('s3')
if (os.path.isdir("/tmp/en_core_web_sm")==False):
download_dir(client, resource, 'en_core_web_sm', '/tmp','ryfeus-spacy')
spacy.util.set_data_path('/tmp')
nlp = spacy.load('/tmp/en_core_web_sm/en_core_web_sm-2.0.0')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
print(token.text, token.pos_, token.dep_)
return 'finished'
P.S. To package spacy within AWS Lambda you have to strip shared libraries.
Knew it was probably going to be something simple. The answer is that there wasn't enough allocated memory to run the Lambda function - I found that I had to minimally increase it to near the max 2816 MB to get the example above to work. It is notable that before last month it wasn't possible to go this high:
https://aws.amazon.com/about-aws/whats-new/2017/11/aws-lambda-doubles-maximum-memory-capacity-for-lambda-functions/
I turned it up to the max of 3008 MB to handle more text and everything seems to work just fine now.
What worked for me was cding into <YOUR_ENV>/lib/Python<VERSION>/site-packages/ and removing the language models I didn't need. For example, I only needed the English language model so once in my own site-packages directory I just needed to run als -d */ | grep -v en | xargs rm -rf`, and then zip up the contents to get it under Lambda's limits.

Resources