How to load results of bash command into gcs using Airflow? - bash

I try to save json that is a result of bash command in the bucket of gcs. Executing the bash command in my local terminal everything works properly and it loads data into gcs. Unfortunately the same bash command doesn't work via Airflow. Airflow marks the task as successfully done but in gcs I can see empty file. I suspect that this happens because of the out of airflow memory but I am not sure. If so someone can explain me how and where the results are stored in airflow ? I see in the bash operator documentation that airflow creates a temporary directory which is cleaned after the execution. Does it mean that the results of bash command also are cleaned afterwards ? Is there any way to save the results in gcs ?
This is my dag:
get_data = BashOperator(
task_id='get_data',
bash_command='curl -X GET -H 'XXX: xxx' some_url | gsutil cp -L manifest.txt - gs://bucket/folder1/filename.json; rm manifest.txt',
dag=dag
)

Related

PSQL /copy :variable substitution not working | Postgresql 11

I'm trying to read CSV file and writing the same into the table, CSV file was located in my local machine(client). I used /copy command and achieved the same. Here I have hardcoded my filepath in sql script. I want to parameterised my csv file path.
Based on my analysis /copy not supported :variable substitution, but not sure
I believe we can achieve this using shell variables, but I tried the same, It's not working as expected.
Following are my sample scripts
command:
psql -U postgres -h localhost testdb -a -f '/tmp/psql.sql' -v path='"/tmp/userData.csv"'
psql script:
\copy test_user_table('username','dob') from :path DELIMITER ',' CSV HEADER;
I executing this commands from shell and I'm getting no such a file not found exception. But same script is working with hardcoded path.
Anyone able to advise me on this.
Reference :
Variable substitution in psql \copy
https://www.postgresql.org/docs/devel/app-psql.html
I am new to Bash. So far your problem is way hard for me.
I can do it in one shell script. Maybe later I can make it to two scripts.
The follow is a simple one script file.
#!bin/bash
p=\'"/mnt/c/Users/JIAN HE/Desktop/test.csv"\'
c="copy emp from ${p}"
a=${c}
echo $a
psql -U postgres -d postgres -c "${a}"

execute aws command in script with sudo

I am running a bash script with sudo and have tried the below but am getting the error below using aws cp. I think the problem is that the script is looking for the config in /root which does not exist. However doesn't the -E preserve the original location? Is there an option that can be used with aws cp to pass the location of the config. Thank you :).
sudo -E bash /path/to/.sh
- inside of this script is `aws cp`
Error
The config profile (name) could not be found
I have also tried `export` the name profile and `source` the path to the `config`
You can use the original user like :
sudo -u $SUDO_USER aws cp ...
You could also run the script using source instead of bash -- using source will cause the script to run in the same shell as your open terminal window, which will keep the same env together (such as user) - though honestly, #Philippe answer is the better, more correct one.

unrecognized arguments when executing script via crontab

I have my crontab set up as follows (this is inside a docker container).
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
SHELL=/bin/bash
5 * * * * bash /usr/local/bin/process-logs > /proc/1/fd/1 2>/proc/1/fd/
The /usr/local/bin/process-logs is designed to expose some MongoDB logs using mtools to a simple web server.
The problematic part of the script is fairly simple. raw_name is archive_name without the file extension.
aws s3 cp "s3://${s3_bucket}/${file_name}" "${archive_name}"
gunzip "${archive_name}"
mlogvis --no-browser "${raw_name}"
If I manually run the command as specified in the crontab config above
bash /usr/local/bin/process-logs > /proc/1/fd/1 2>/proc/1/fd/2
It all works as expected (this is the expected output from mlogvis)
...
copying /usr/local/lib/python3.5/dist-packages/mtools/data/index.html to /some/path/mongod.log-20190313-1552456862.html
...
When the script gets triggered via crontab it throws the following error
usage: mlogvis [-h] [--version] [--no-progressbar] [--no-browser] [--out OUT]
[--line-max LINE_MAX]
mlogvis: error: unrecognized arguments: mongod.log-20190313-1552460462
The mlogvis command that caused the following error (actual values not parameters)
mlogvis --no-browser "mongod.log-20190313-1552460462"
Again if I run this command myself it all works as expected.
mlogvis: http://blog.rueckstiess.com/mtools/mlogvis.html
I don't believe this to be an issue with the file not having correct permissions or not existing as mlogvis produces a different error in these conditions. I've also tested with removing '-' from the file name thinking it might be trying to parse these as arguments but it made no difference.
I know cron execution doesn't have the same execution environment as the user I tested the script as. I've set the PATH to be the same as the user and when the container starts up I execute env >> /etc/environment so all the environment vars and properly set.
Does anyone know of a way to debug this or has anyone encountered similar? All other components of the script are functioning except mlogvis which is core to the purpose of this job.
Summary of what I've tried as a fix:
Set environment and PATH for cron execution to be the same as the user I tested the script as
Replace - in file name(s) to see if it was parsing the parts as arguments
hardcode a filename with full permissions to see if it was permissions related
Manually run the script -> this works
Manually run the mlogvis command in isolation -> this works
try to load /home/user/.bash_profile before executing script and try again. I suspect that you have missing PATH or other environment variable which is not set.
source /home/user/.bash_profile
Please post your complete script, because while executing via crontab,
you have to be sure your raw_name variable was properly created. As
it seems to depend on archive_name, posting some more context can
help us to help you.
In any case, if you are using bash, you can try something like :
aws s3 cp "s3://${s3_bucket}/${file_name}" "${archive_name}"
gunzip "${archive_name}"
# here you have to be sure that archive_name is correct
raw_name_2=${archive_name%%.*}
mlogvis --no-browser "${raw_name_2}"
It is not going to solve your issue, but probably will take you closer to the right path.

Referencing a file in an aws emr script run by script-runner.jar

I am creating an amazon emr cluster where one of the steps is a bash script run by script-runner.jar:
aws emr create cluster ... --steps '[ ... {
"Args":["s3://bucket/scripts/script.sh"],
"Type":"CUSTOM_JAR",
"ActionOnFailure":"TERMINATE_CLUSTER",
"Jar":"s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
}, ... ]'...
as described in https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html
script.sh needs other files in its commands: think awk ... -f file, sed ... -f file, psql ... -f file, etc.
On my laptop with both script.sh and files in my working directory, everything works just fine. However, after I upload everything to s3://bucket/scripts, the cluster creation fails with:
file: No such file or directory
Command exiting with ret '1'
I have found the workaround posted below, but I don't like it for the reasons specified. If you have a better solution, please post it, so that I can accept it.
I am using the following work around in script.sh:
# Download the SQL file to a tmp directory.
tmpdir=$(mktemp -d "${TMPDIR:-/tmp/}$(basename $0).XXXXXXXXXXXX")
aws s3 cp s3://bucket/scripts/file ${tmpdir}
# Run my command
xxx -f ${tmpdir}/file
# Clean up
rm -r ${tmpdir}
This approach works but:
Running script.sh locally means that I have to upload file to s3 first, which makes development harder.
There are actually a few files involved...

run .sh script via Jenkins to execute aws command error

my problem is that i try to execute shell script to copy created files from msbuild to AWS s3 via Jenkins.
Then i add new build step "Execute Shell" and set to execute shell script by command: sh publishS3.sh nothing happens and files doesn't apper in s3 bucket.
my Jenkins use Local Windows Server.
Then i try to execute the shell script by typing sh publishS3.sh in Jenkins local directory all ok , files was copyed secessfully to s3 bucket , but if i try to do it from jenkins nothing was happen. My publishS3.sh script is:
#!/bin/bash
aws s3 cp Com.VistaDraft.Common.dll s3://download.vistadraft.com/MVP
i was tryed to to check witch output i receive after execute by adding at the end command > output.txt but Jenkins generate an empty file. If i try to do the same locally i was receive an message that i secessfully copyed files to s3. i Set the shell script path of jenkins C:\Program Files\Git\git-bash.exe and using git-bash.exe locally too. Maybe whom know where is a problem ? Please suggest.
You could try to add -ex in the first line of the script to allow you to see what it's doing and ease the debugging:
#!/bin/bash -ex
# rest of script
Make sure the aws tool is in the PATH of the environment where Jenkins runs your script. It might help if you specify full path to the command.
You could put which aws in your script to see what's going on.

Resources