Run a shell script file with Airflow on Google Cloud Composer

Run a shell script file with Airflow on Google Cloud Composer - bash

I have several multi-purpose shell scripts stored in .sh files. My intention is to build a few Airflow DAGs on Cloud Composer that will leverage these scripts. The DAGs would be made mostly of BashOperators that call the scripts with specific arguments.
Here's a simple example, greeter.sh:
#!/bin/bash
echo "Hello, $1!"
I can run it locally like this:
bash greeter.sh world
> Hello, world!
Let's write a simple DAG:
# import and define default_args
dag = DAG('bash_test',
description='Running a local bash script',
default_args=default_args,
schedule_interval='0,30 5-23 * * *',
catchup=False,
max_active_runs=1)
bash_task = BashOperator(
task_id='run_command',
bash_command=f"bash greeter.sh world",
dag=dag
)
But where to put the script greeter.sh? I tried putting it both in the dags/ folder and the data/ folder, at first level or nested within a dependencies/ directory. I also tried writing the address as ./greeter.sh. Pointless: I can never find the file.
I also tried using sh in place of bash and I get a different error: sh: 0: Can't open greeter.sh. But this error also appears when the file is not there, so it's the same issue. Same with any attempt to run chmod +rx.
How can I make my file available to Airflow?

The comments on this question revealed the answer.
The address for the dags_folder is stored in the DAGS_FOLDER environment variable.
To get the right address for a script stored in dags_folder/:
import os
DAGS_FOLDER = os.environ["DAGS_FOLDER"]
file = f"{DAGS_FOLDER}/greeter.sh"

Related

Can we run a bash script from a repo?

I am at beginner level. I am not sure if this is feasible or not. I have a bash script on bitbucket repo which does some kind of setup. To run that bash script I have to download that locally and run the .sh file. Is there any way I can run the script through bitbucket repo without downloading?

You'll always need to download the file (i.e: retrieve it from the server), but you can produce a pipline to retrieve-and-execute in one. The simplest would be:
curl ${url} | bash
You'll need to locate the URL that presents the raw file (rather than the HTML web page). For BitBucket this will look something like below. You can substitue ${commit_id} for a branch or tag name instead.
https://bitbucket.org/${user}/${repo}/raw/${commit_id}/${file}
Beware however that this often causes raised eyebrows from a security point of view, especially if retrieving the file via HTTP (rather than HTTPS), as you're basically running unknown code on your computer. Using sudo in this pipeline is even more concerning.
The user needs to be prepared to trust whatever is stored in the repository, so make sure that you only allow trusted users to push (or merge), and make sure that you review changes to the file in question carefully.
You should also be aware that when running a script like this (equally for bash ${file} or bash < ${file}), the shebang will not be respected - it will just be seen as a comment and ignored.
If, for example you script begins as below (-e to exit on error, and -u to handle undefined variables as an error) then these flags will not be set.
#!/bin/bash -eu
# ... body of script ...
When "executing" the file directly (i.e: chmod +x ./my_script.sh, ./my_script.sh), the kernel process the shebang and invokes /bin/bash -eu... but when executing the script via one of the above methods, the bash invocation is in the pipeline.
Instead, it is preferable to set these flags in the body of your script, so that the method of execution doesn't matter:
#!/bin/bash
set -eu
# ... body of script ...

Pentaho running script from outside spoon directory

I have a shell script that should run the pentaho transforamtion job but it fails with the following error:
/data/data-integration/spoon.sh: 1: /data/data-integration/spoon.sh: ldconfig: not found
Here's the shell script which sits in:
/home/tureprw01/
and the script:
#!/bin/sh
NOW=$(date +"%Y%m%d_%H%M%S")
/data/data-integration/./pan.sh -file=/data/reporting_scripts/op/PL_Op.ExtlDC.kjb >> /home/tureprw01/logs/PL_Op.ExtDC/$NOW.log
I'm completely green in terms of Java but need to make it work somehow

Using command line executions for Pan / Kitchen is simple, This Documentation should help you create the Batch/SH command and make it work.
Though i see you are using variable creation on the command line, personally i do not know if the Batch/SH variable is passed down correctly to the PDI parameters, you'd have to test that yourself, or use this variable definition within the PDI structure, not as a named parameter.

use this :
!/bin/sh
NOW=$(date +"%Y%m%d_%H%M%S")
cd /data/reporting_scripts/op/
/data/data-integration/spoon.sh -main org.pentaho.di.pan.Pan -initialDir /data/data-integration -file=/data/reporting_scripts/op/PL_Op.ExtlDC.kjb

#!/bin/bash
# use for jobs if you want to run transform change :
# "org.pentaho.di.kitchen.Kitchen" to "org.pentaho.di.pan.Pan" and insert ktr file
export PENTAHO_JAVA_HOME=/root/app/jdk1.8.0_91
export JAVA_HOME=/root/app/jdk1.8.0_91
cd /{kjb path}/;
/{spoon path}/spoon.sh -main org.pentaho.di.kitchen.Kitchen -initialDir /{kjb path}//{kjb file}.kjb -repo=//{kjb path}/{resource file}.xml -logfile=/{log file}.log -dir=/{kjb path}

bash commands to remote hosts - errors with writing local output files

I'm trying to run several sets of commands in parallel on a few remote hosts.
I've created a script that constructs these commands, and then writes the output in a local file, something along the lines of:
ssh <me>#<ip1> "command" 2> ./path/to/file/newFile1.txt & ssh <me>#<ip2>
"command" 2> ./path/to/file/newFile2.txt & ssh <me>#<ip2> "command" 2>
./path/to/file/newFile3.txt; ...(same repeats itself, with new commands and new
file names)...
My issue is that, when my script runs these commands, I am getting the following errors:
bash: ./path/to/file/newFile1.txt: No such file or directory
bash: ./path/to/file/newFile2.txt: No such file or directory
bash: ./path/to/file/newFile3.txt: No such file or directory
...
These files do NOT exist but will be written. That being said, the directory paths are valid.
The strange thing is that, if I copy and paste the whole big command, then it works without any issue. I'd rather have it automated tho ;).
Any ideas?
Edit - more information:
My filesystem is the following:
- home
- User
- Desktop
- Servers
- Outputs
- ...
I am running the bash script from home/User/Desktop/Servers.
The script creates the commands that need to be run on the remote servers. First thing first, the script creates the directories where the files will be stored.
outputFolder="./Outputs"
...
mkdir -p ${outputFolder}/f{fileNumb}
...
The script then continues to create the commands that will be called on remotes hosts, and their respective outputs will be placed in the created directories.
The directories are there. Running the commands gives me the errors, however printing and then copying the commands into the same location works for some reason. I have also tried to give the full path to directory, still same issue.
Hope I've been a bit clearer.

If this is the exact error message you get:
bash: ./path/to/file/newFile1.txt: No such file or directory
Then you'll note that there's an extra space between the colon and the dot, so it's actually trying to open a file called " ./path/to/file/newFile1.txt" (without the quotes).
However, to accomplish that, you'd need to use quotes around the filename in the redirection, as in
something ... 2> " ./path/to/file/newFile1.txt"
Or the first character would have to something else than a regular space. A non-breaking space perhaps, possible something that some editor might create if you hit alt-space or such.

I don't believe you've shown enough to correctly answer the question.
This doesn't look like a problem with ssh, but the way you are calling the (ssh) commands.
You say that you are writing the commands into a file... presumably you are then running that file as a script. Could you show the code you use to do that. I believe that's your problem.
I suspect you have made a false assumption about the way the working directory changes when you run a script. It doesn't. You are listing relative paths, so its important to know what they are relative to. That is the most likely reason for it working when you copy and paste it... You are executing from a different working directory.

I am new to bash scripting and was building my script based on another one I had seen. I was "running" the command by simply calling the variable where the command was stored:
$cmd
Solved by using:
eval $cmd
instead. My bad, should have given the full script from the start.

Bash script runs in shell, gives "not found" error in crontab

I am using an EC2 instance, crontab, and slack-cleaner to delete all Slack messages older than 48 hours. To do this, I created delete_slack.sh (I've deleted my slack api token):
for CHANNEL in random general
do
slack-cleaner --token <MY TOKEN> --message --channel $CHANNEL --user "*" --before $(date -d '48 hour ago' "+%Y%m%d") --perform
done
Then I created a crontab line to run it every minute (once it works I'll change the timing to once a day) and had cron spit out the results to a log file:
* * * * * /home/ubuntu/delete_slack/delete_slack.sh >> /var/log/delete_slack.log 2>&1
To test, I ran sh /home/ubuntu/delete_slack/delete_slack.sh >> /var/log/delete_slack.log 2>&1 in the shell and it works fine. However, when I let the crontab run I get an error in the log file:
/home/ubuntu/delete_slack/delete_slack.sh: 3: /home/ubuntu/delete_slack/delete_slack.sh: slack-cleaner: not found
Any ideas? I've been banging my head against this all afternoon.

Sounds like the PATH you get via cron and the PATH you get through your login are different.
Either set the PATH in your script or use the absolute path to slack-cleaner
The PATH tells the shell which directories to search for executables (including scripts). You can echo $PATH to compare your path to the one cron gives and confirm that this is the issue.
If using the absolute path works, that is simplest, but if slack-cleaner uses other exes itself, setting the path may be better.
If you want to go the "modify PATH" method then you want to append the correct path to existing PATH and not completely overwrite it. i.e. export PATH=$PATH:/path/to/slack-cleaner-dir. You can always use which slack-cleaner to find out the correct path. NOTE: you want the directory without "slack-cleaner" appended to the end.

ALWAYS use full path in crons and you'll save a lot of time.
If you don't like export PATH=... then just use /path/to/slack-cleaner-dir instead.

Just load your profile before running the command to be in the exact same situation as when you launch it from your shell :
* * * * * . ~/.profile;/home/ubuntu/delete_slack/delete_slack.sh >> /var/log/delete_slack.log 2>&1
As I read that you're a bit new to this, here are just some more explanations about the profile :
The profile is a file loaded automatically when you connect in shell with your user.
The file is hidden in your home directory, to see it, you can launch :
ls -la ~
If you're in bash, the file will be named .bash_profile, if you're in shell or ksh, it will be named .profile
Hope it helped !

How to test things in crontab

This keeps happening to me all the time:
1) I write a script(ruby, shell, etc).
2) run it, it works.
3) put it in crontab so it runs in a few minutes so I know it runs from there.
4) It doesnt, no error trace, back to step 2 or 3 a 1000 times.
When I ruby script fails in crontab, I can't really know why it fails cause when I pipe output like this:
ruby script.rb >& /path/to/output
I sorta get the output of the script, but I don't get any of the errors from it and I don't get the errors coming from bash (like if ruby is not found or file isn't there)
I have no idea what environmental variables are set and whether or not it's a problem. Turns out that to run a ruby script from crontab you have to export a ton of environment variables.
Is there a way for me to just have crontab run a script as if I ran it myself from my terminal?
When debugging, I have to reset the timer and go back to waiting. Very time consuming.
How to test things in crontab better or avoid these problems?

"Is there a way for me to just have crontab run a script as if I ran it myself from my terminal?"
Yes:
bash -li -c /path/to/script
From the man page:
[vindaloo:pgl]:~/p/test $ man bash | grep -A2 -m1 -- -i
-i If the -i option is present, the shell is interactive.
-l Make bash act as if it had been invoked as a login shell (see
INVOCATION below).

G'day,
One of the basic problems with cron is that you get a minimal environment being set by cron. In fact, you only get four env. var's set and they are:
SHELL - set to /bin/sh
LOGNAME - set to your userid as found in /etc/passwd
HOME - set to your home dir. as found in /etc/passwd
PATH - set to "/usr/bin:/bin"
That's it.
However, what you can do is take a snapshot of the environment you want and save that to a file.
Now make your cronjob source a trivial shell script that sources this env. file and then executes your Ruby script.
BTW Having a wrapper source a common env. file is an excellent way to enforce a consistent environment for multiple cronjobs. This also enforces the DRY principle because it gives you just one point to update things as required, instead of having to search through a bunch of scripts and search for a specific string if, say, a logging location is changed or a different utility is now being used, e.g. gnutar instead of vanilla tar.
Actually, this technique is used very successfully with The Build Monkey which is used to implement Continuous Integration for a major software project that is common to several major world airlines. 3,500kSLOC being checked out and built several times a day and over 8,000 regression tests run once a day.
HTH
'Avahappy,

Run a 'set' command from inside of the ruby script, fire it from crontab, and you'll see exactly what's set and what's not.

To find out the environment in which cron runs jobs, add this cron job:
{ echo "\nenv\n" && env|sort ; echo "\nset\n" && set; } | /usr/bin/mailx -s 'my env' you#example.com
Or send the output to a file instead of email.

You could write a wrapper script, called for example rbcron, which looks something like:
#!/bin/bash
RUBY=ruby
export VAR1=foo
export VAR2=bar
export VAR3=baz
$RUBY "$*" 2>&1
This will redirect standard error from ruby to the standard output. Then you run rbcron in your cron job, and the standard output contains out+err of ruby, but also the "bash" errors existing from rbcron itself. In your cron entry, redirect 2>&1 > /path/to/output to get output+error messages to go to /path/to/output.

If you really want to run it as yourself, you may want to invoke ruby from a shell script that sources your .profile/.bashrc etc. That way it'll pull in your environment.
However, the downside is that it's not isolated from your environment, and if you change that, you may find your cron jobs suddenly stop working.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Run a shell script file with Airflow on Google Cloud Composer - bash

The comments on this question revealed the answer. The address for the dags_folder is stored in the DAGS_FOLDER environment variable. To get the right address for a script stored in dags_folder/: import os DAGS_FOLDER = os.environ["DAGS_FOLDER"] file = f"{DAGS_FOLDER}/greeter.sh"

Related

Can we run a bash script from a repo?

Pentaho running script from outside spoon directory

bash commands to remote hosts - errors with writing local output files

Bash script runs in shell, gives "not found" error in crontab

How to test things in crontab

Categories

Resources