spark-submit command not found in airflow

spark-submit command not found in airflow - bash

I am trying to run my spark job in airflow, when I executed this command spark-submit --class dataload.dataload_daily /home/ubuntu/airflow/dags/scripts/data_to_s3-assembly-0.1.jar in terminal, it works fine without any issue.
However, I am doing the same here in airflow, but keep getting the error
/tmp/airflowtmpKQMdzp/spark-submit-scalaWVer4Z: line 1: spark-submit:
command not found
t1 = BashOperator(task_id = 'spark-submit-scala',
bash_command = 'spark-submit --class dataload.dataload_daily \
/home/ubuntu/airflow/dags/scripts/data_to_s3-assembly-0.1.jar',
dag=dag,
retries=0,
start_date=datetime(2018, 4, 14))
I have my spark path mentioned in bash_profile,
export SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.7
export PATH="$SPARK_HOME/bin/:$PATH"
sourced this file as well. Not sure how to debug this, can anyone help me on this?

You could start with bash_command = 'echo $PATH' to see if your path is being updated correctly.
This is because you are metioning editing the bash_profile, but as far as I know Airflow is being run as another user. Since the other user has no changes in the bash_profile, the path to Spark might be missing.
As mentioned here (How do I set an environment variable for airflow to use?) you could try setting the path in .bashrc.

Related

How to execute gcloud command in bash script from crontab -e

I am trying execute some gcloud commands in bash script from crontab. The script execute sucessfully from command shell but not from the cron job.
I have tried with:
Settng the full path to gcloud like:
/etc/bash_completion.d/gcloud
/home/Arturo/.config/gcloud
/usr/bin/gcloud
/usr/lib/google-cloud-sdk/bin/gcloud
Setting in the begin the script:
/bin/bash -l
Setting in the crontab:
51 21 30 5 6 CLOUDSDK_PYTHON=/usr/bin/python2.7;
/home/myuser/folder1/myscript.sh param1 param2 param3 -f >>
/home/myuser/folder1/mylog.txt`
Setting inside the script:
export CLOUDSDK_PYTHON=/usr/bin/python2.7
Setting inside the script:
sudo ln -s /home/myuser/google-cloud-sdk/bin/gcloud /usr/bin/gcloud
Version Ubuntu 18.04.3 LTS
command to execute: gcloud config set project myproject
but nothing is working, maybe I am doing something wrongly. I hope you can help me.

You need to set your user in your crontab, for it to run the gcloud command. As well explained in this other post here, you need to modify your crontab to fetch the data in your Cloud SDK, for the execution to occur properly - it doesn't seem that you have made this configuration.
Another option that I would recommend you to try out, it's using a Cloud Scheduler to run your gcloud commands. This way, you can use gcloud for your cron jobs in a more integrated and easy way. You can verify more information about this option here: Creating and configuring cron jobs
Let me know if the information helped you!

I found my error, the problem here was only in the command: "gcloud dns record-sets transaction start", the others command was executing sucesfully but only no logging nothing, by that I though that was not executng the other commands. This Command create a temp file ex. transaction.yaml and that file could not be created in the default path for gcloud(snap/bin), but the log simply dont write any thing!. I had to specify the path and name for that file with the flag --transaction-file=mytransaction.yaml. Thanks for your supprot and ideas

I have run into the same issue before. I fixed it by forcing the profile to load in my script.sh,loading the gcloud environment variables with it. Example below:
#!/bin/bash
source /etc/profile
gcloud config set project myprojectecho
echo "Project set to myprojectecho."
I hope this can help others in the future with similar issues, as this also helped me when trying to set GKE nodes from 0-4 on a schedule.

Adding the below line to the shell script fixed my issue
#Execute user profile
source /root/.bash_profile

Pass environment variable from command line to yarn

I have a code that reads port number from environment variable or from config. Code looks like this
const port = process.env.PORT || serverConfig.port;
await app.listen(port);
To run app without defining environment variable, I run following yarn command.
yarn start:dev
This command works successfully in Linux shell and Windows command line.
Now, I want to pass environment variable. I tried following,
PORT=2344 yarn start:dev
This commands works successfully in Linux shell but failing in Windows command line. I tried following ways but couldn't get it to work.
Tried: PORT=2344 yarn start:dev
I got error: 'PORT' is not recognized as an internal or external command,
operable program or batch file.
Tried: yarn PORT=2344 start:dev
I got error: yarn run v1.17.3
error Command "PORT=2344" not found.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
Any idea please? I know, I can define environment variables from System Properties in Windows. But any way if I can do it from command line?

i'd suggest you use the NPM module called cross-env. it allows adding particular env variables on the command line regardless of platform. with that said, you may try:
$ cross-env PORT=2344 yarn start:dev

You can chain commands on the Windows command prompt with &(or &&). To set an environment variable you need to use the set command.
The result should look like this: set PORT=1234 && yarn start:dev.

Found a solution for this problem in Windows command prompt.
Create a .env file in project root folder (outside src folder).
Define PORT in it. In my case, contents of .env file will be,
PORT=2344
Run yarn start:dev
Application will use port number that you have specified in .env file.

Put .env file at root. Then following command will expose content of .env file and then run yarn start command
$ source .env && yarn start
or this command
$ export $(cat .env) && yarn start
If update any variable in .env then close the terminal and open new terminal window and can again run above command. Or else can also run unset command to remove existing var.
unset VAR_NAME

You can use popular package dotenv:
create a file .env in root directory
put all your env vars
e.g.:
ENV=DEVELOPMENT
run your code like this
$ node -r dotenv/config your_script.js
here the explanation:
[https://github.com/motdotla/dotenv#preload]

To define environment variables in the Windows command prompt we can use the set command, you can then split your call into two lines.
set PORT=2344
yarn start:dev
The set command persists within the current command prompt, so you only need to run it once.
The equivalent command in bash is 'export'.

FYI (not a direct answer). I was attempting this in VS Code - passing .env variables through yarn to a JavaScript app. Google had very few examples so I'm sharing this for posterity as it's somewhat related.
The following simply substitutes text normally placed directly into the package.json or script file. Use this to quickly obfuscate or externalize your delivery configurations.
In Environment Variable File (.env)
PORT=2344
In Yarn File (package.json)
source .env; yarn ./start.sh --port $PORT
In Yarn Script (start.sh)
#!/bin/bash
while [ $? != 0 ]; do
node dist/src/index.js $1; #replace with your app call#
done
The app then accepts port as a variable. Great for multi-tenant deployments.

unrecognized arguments when executing script via crontab

I have my crontab set up as follows (this is inside a docker container).
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
SHELL=/bin/bash
5 * * * * bash /usr/local/bin/process-logs > /proc/1/fd/1 2>/proc/1/fd/
The /usr/local/bin/process-logs is designed to expose some MongoDB logs using mtools to a simple web server.
The problematic part of the script is fairly simple. raw_name is archive_name without the file extension.
aws s3 cp "s3://${s3_bucket}/${file_name}" "${archive_name}"
gunzip "${archive_name}"
mlogvis --no-browser "${raw_name}"
If I manually run the command as specified in the crontab config above
bash /usr/local/bin/process-logs > /proc/1/fd/1 2>/proc/1/fd/2
It all works as expected (this is the expected output from mlogvis)
...
copying /usr/local/lib/python3.5/dist-packages/mtools/data/index.html to /some/path/mongod.log-20190313-1552456862.html
...
When the script gets triggered via crontab it throws the following error
usage: mlogvis [-h] [--version] [--no-progressbar] [--no-browser] [--out OUT]
[--line-max LINE_MAX]
mlogvis: error: unrecognized arguments: mongod.log-20190313-1552460462
The mlogvis command that caused the following error (actual values not parameters)
mlogvis --no-browser "mongod.log-20190313-1552460462"
Again if I run this command myself it all works as expected.
mlogvis: http://blog.rueckstiess.com/mtools/mlogvis.html
I don't believe this to be an issue with the file not having correct permissions or not existing as mlogvis produces a different error in these conditions. I've also tested with removing '-' from the file name thinking it might be trying to parse these as arguments but it made no difference.
I know cron execution doesn't have the same execution environment as the user I tested the script as. I've set the PATH to be the same as the user and when the container starts up I execute env >> /etc/environment so all the environment vars and properly set.
Does anyone know of a way to debug this or has anyone encountered similar? All other components of the script are functioning except mlogvis which is core to the purpose of this job.
Summary of what I've tried as a fix:
Set environment and PATH for cron execution to be the same as the user I tested the script as
Replace - in file name(s) to see if it was parsing the parts as arguments
hardcode a filename with full permissions to see if it was permissions related
Manually run the script -> this works
Manually run the mlogvis command in isolation -> this works

try to load /home/user/.bash_profile before executing script and try again. I suspect that you have missing PATH or other environment variable which is not set.
source /home/user/.bash_profile

Please post your complete script, because while executing via crontab,
you have to be sure your raw_name variable was properly created. As
it seems to depend on archive_name, posting some more context can
help us to help you.
In any case, if you are using bash, you can try something like :
aws s3 cp "s3://${s3_bucket}/${file_name}" "${archive_name}"
gunzip "${archive_name}"
# here you have to be sure that archive_name is correct
raw_name_2=${archive_name%%.*}
mlogvis --no-browser "${raw_name_2}"
It is not going to solve your issue, but probably will take you closer to the right path.

Run an shell script on startup (not login) on Ubuntu 14.04

I have a build server. I'm using the Azure Build Agent script. It's a shell script that will run continuously while the server is up. Problem is that I cannot seem to get it to run on startup. I've tried /etc/init.d and /etc/rc.local and the agent is not being run. Nothing concerning the build agent in the boot logs.
For /etc/init.d I created the script agent.sh which contains:
#!/bin/bash
sh ~/agent/run.sh
Gave it the proper permissions chmod 755 agent.shand moved it to /etc/init.d.
and for /etc/rc.local, I just appended the following
sh ~/agent/run.sh &
before exit 0.
What am I doing wrong?
EDIT: added examples.
EDIT 2: Just noticed that the init.d README says that shell scripts need to start with #!/bin/sh and not #!/bin/bash. Also used absolute path, but no change.
FINAL EDIT: As #ewrammer suggested, I used cron and it worked. crontab -e and then #reboot /home/user/agent/run.sh.

It is hard to see what is wrong if you are not posting what you have done, but why not add it as a cron job with #reboot as pattern? Then cron will run the script every time the computer starts.

Just in case, using a supervisor could be a good idea, In Ubuntu 14 you don't have systemd but you can choose from others https://en.wikipedia.org/wiki/Process_supervision.
If using immortal, after installing it, you just need to create a run.yml file in /etc/immortal with something like:
cmd: /path/to/command
log:
file: /var/log/command.log
This will start your script/command on every start, besides ensuring your script/app is always up and running.

How do I launch pyspark and arrive in an ipython shell

When I launch pyspark, spark loads properly, however I end up in a standard python shell environment.
Using Python version 2.7.13 (default, Dec 20 2016 23:05:08)
SparkSession available as 'spark'.
>>>
I want to launch into the ipython interpreter.
IPython 5.1.0 -- An enhanced Interactive Python.
In [1]:
How do I do that? I tried modifying my .bashprofile in this way and using the alias:
# Spark variables
export SPARK_HOME="/Users/micahshanks/spark-2.1.0-bin-hadoop2.7"
export PYTHONPATH="/Users/micahshanks/spark-2.1.0-bin-hadoop2.7/python/:"
# Spark 2
export PYSPARK_DRIVER_PYTHON=ipython
export PATH=$SPARK_HOME/bin:$PATH
# export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
alias sudo='sudo '
alias pyspark="/Users/micahshanks/spark-2.1.0-bin-hadoop2.7/bin/pyspark \
--conf spark.sql.warehouse.dir='file:///tmp/spark-warehouse' \
--packages com.databricks:spark-csv_2.11:1.5.0 \
--packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
--packages org.apache.hadoop:hadoop-aws:2.7.3 \
--packages org.mongodb.spark:mongo-spark-connector_2.10:2.0.0"
I also tried navigating to spark home where pyspark is located and directly launching from there, but again I arrive in the python interpreter.
I found this post: How to load IPython shell with PySpark and the accepted answer looked promising, but am activating python 2 environment (source activate py2) before launching spark and changing my bash profile in this way attempts to start spark with python 3 which I'm not setup to do (throws errors).
I'm using spark 2.1

Spark 2.1.1
For some reason typing sudo ./bin/pyspark changes the file permissions of metastore_db/db.lck that cause running ipython and pyspark not to work. From the decompressed root directory try:
sudo chown -v $(id -un) metastore_db/db.lck
export PYSPARK_DRIVER_PYTHON=ipython
./bin/pyspark
Another solution is to just re-download and decompress from spark.apache.org. Navigate to the root of the decompressed directory and then:
export PYSPARK_DRIVER_PYTHON=ipython
./bin/pyspark
And it should work.

Since asking this question I found a helpful solution is to write bash scripts that load Spark in a specific way. Doing this will give you an easy way to start Spark in different environments (for example ipython and a jupyter notebook).
To do this open a blank script (using whatever text editor you prefer), for example one called ipython_spark.sh
For this example I will provide the script I use to open spark with the ipython interpreter:
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=ipython
${SPARK_HOME}/bin/pyspark \
--master local[4] \
--executor-memory 1G \
--driver-memory 1G \
--conf spark.sql.warehouse.dir="file:///tmp/spark-warehouse" \
--packages com.databricks:spark-csv_2.11:1.5.0 \
--packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
--packages org.apache.hadoop:hadoop-aws:2.7.3
Note that I have SPARK_HOME defined in my bash_profile, but you could just insert the whole path to wherever pyspark is located on your computer
I like to put all scripts like this in one place so I put this file in a folder called "scripts"
Now for this example you need to go to your bash_profile and enter the following lines:
export PATH=$PATH:/Users/<username>/scripts
alias ispark="bash /Users/<username>/scripts/ipython_spark.sh"
These paths will be specific to where you put ipython_spark.sh
and then you might need to update permissions:
$ chmod 711 ipython_spark.sh
and source your bash_profile:
$ source ~/.bash_profile
I'm on a mac, but this should all work for linux as well, although you will be updating .bashrc instead of bash_profile most likely.
What I like about this method is that you can write up multiple scripts, with different configurations and open spark accordingly. Depending on if you are setting up a cluster, need to load different packages, or change the number of cores spark has at it's disposal, etc. you can either update this script, or make new ones. Note that PYSPARK_DRIVER_PYTHON= is the correct syntax for Spark > 1.2
I am using Spark 2.2

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

spark-submit command not found in airflow - bash

Related

How to execute gcloud command in bash script from crontab -e

Pass environment variable from command line to yarn

unrecognized arguments when executing script via crontab

Run an shell script on startup (not login) on Ubuntu 14.04

How do I launch pyspark and arrive in an ipython shell

Categories

Resources