How do I launch pyspark and arrive in an ipython shell - bash

When I launch pyspark, spark loads properly, however I end up in a standard python shell environment.
Using Python version 2.7.13 (default, Dec 20 2016 23:05:08)
SparkSession available as 'spark'.
>>>
I want to launch into the ipython interpreter.
IPython 5.1.0 -- An enhanced Interactive Python.
In [1]:
How do I do that? I tried modifying my .bashprofile in this way and using the alias:
# Spark variables
export SPARK_HOME="/Users/micahshanks/spark-2.1.0-bin-hadoop2.7"
export PYTHONPATH="/Users/micahshanks/spark-2.1.0-bin-hadoop2.7/python/:"
# Spark 2
export PYSPARK_DRIVER_PYTHON=ipython
export PATH=$SPARK_HOME/bin:$PATH
# export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
alias sudo='sudo '
alias pyspark="/Users/micahshanks/spark-2.1.0-bin-hadoop2.7/bin/pyspark \
--conf spark.sql.warehouse.dir='file:///tmp/spark-warehouse' \
--packages com.databricks:spark-csv_2.11:1.5.0 \
--packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
--packages org.apache.hadoop:hadoop-aws:2.7.3 \
--packages org.mongodb.spark:mongo-spark-connector_2.10:2.0.0"
I also tried navigating to spark home where pyspark is located and directly launching from there, but again I arrive in the python interpreter.
I found this post: How to load IPython shell with PySpark and the accepted answer looked promising, but am activating python 2 environment (source activate py2) before launching spark and changing my bash profile in this way attempts to start spark with python 3 which I'm not setup to do (throws errors).
I'm using spark 2.1

Spark 2.1.1
For some reason typing sudo ./bin/pyspark changes the file permissions of metastore_db/db.lck that cause running ipython and pyspark not to work. From the decompressed root directory try:
sudo chown -v $(id -un) metastore_db/db.lck
export PYSPARK_DRIVER_PYTHON=ipython
./bin/pyspark
Another solution is to just re-download and decompress from spark.apache.org. Navigate to the root of the decompressed directory and then:
export PYSPARK_DRIVER_PYTHON=ipython
./bin/pyspark
And it should work.

Since asking this question I found a helpful solution is to write bash scripts that load Spark in a specific way. Doing this will give you an easy way to start Spark in different environments (for example ipython and a jupyter notebook).
To do this open a blank script (using whatever text editor you prefer), for example one called ipython_spark.sh
For this example I will provide the script I use to open spark with the ipython interpreter:
#!/bin/bash
export PYSPARK_DRIVER_PYTHON=ipython
${SPARK_HOME}/bin/pyspark \
--master local[4] \
--executor-memory 1G \
--driver-memory 1G \
--conf spark.sql.warehouse.dir="file:///tmp/spark-warehouse" \
--packages com.databricks:spark-csv_2.11:1.5.0 \
--packages com.amazonaws:aws-java-sdk-pom:1.10.34 \
--packages org.apache.hadoop:hadoop-aws:2.7.3
Note that I have SPARK_HOME defined in my bash_profile, but you could just insert the whole path to wherever pyspark is located on your computer
I like to put all scripts like this in one place so I put this file in a folder called "scripts"
Now for this example you need to go to your bash_profile and enter the following lines:
export PATH=$PATH:/Users/<username>/scripts
alias ispark="bash /Users/<username>/scripts/ipython_spark.sh"
These paths will be specific to where you put ipython_spark.sh
and then you might need to update permissions:
$ chmod 711 ipython_spark.sh
and source your bash_profile:
$ source ~/.bash_profile
I'm on a mac, but this should all work for linux as well, although you will be updating .bashrc instead of bash_profile most likely.
What I like about this method is that you can write up multiple scripts, with different configurations and open spark accordingly. Depending on if you are setting up a cluster, need to load different packages, or change the number of cores spark has at it's disposal, etc. you can either update this script, or make new ones. Note that PYSPARK_DRIVER_PYTHON= is the correct syntax for Spark > 1.2
I am using Spark 2.2

Related

How to load full and same shell from SSH as from manual login? [duplicate]

I have been trying to resolve problems to be able to run openmpi on multiple nodes.
Initially I had a problem with $PATH and $LD_LIBRARY_PATH variables not being updated from .bashrc file by openmpi session, so I manually added --prefix /path/to/openmpi to resolve this issue.
Turns out that even the anaconda path variables are not being loaded as well. So ultimately I need ~/.bashrc file to be sourced from my home directory. How can I do that? Can anyone help me out please?
UPDATE 01:
I wrote a simple shell script to check the version of python
python --version
and tried to run it with openmpi on local as well as remote machine as follows:
mpirun --prefix /home/usama/.openmpi --hostfile hosts -np 4 bash script
And it returns
Python 2.7.12
Python 3.6.8 :: Anaconda, Inc.
Python 3.6.8 :: Anaconda, Inc.
Python 2.7.12
Confirming my suspicion that whatever openmpi is doing to run remote processes doesn't invoke / set proper environment variables from the ~/.bashrc file. Any help from someone who has worked with multi-node openmpi?
UPDATE 02:
A simple ssh environment grep tell me that my environment variables are not updated which might be the cause of the problem. (I have even tried to set it up in ~/.ssh/environment file)
$ ssh remote-node env | grep -i path
It seems to be loading only the /etc/environment file with only basic paths setup. How to I rectify this?
maybe you should run like this.I guess.
two ways help you!
first:
mpirun --prefix /home/usama/.openmpi --hostfile hosts -np 4 . ~/.bashrc && bash script
second:
## 1. add this line to the script
. ~/.bashrc
## 2. run command as you do
mpirun --prefix /home/usama/.openmpi --hostfile hosts -np 4 bash script

How to run Robot Framework's `robot` command from a shell/bash script on Windows?

As a user I want to execute Robot Framework's robot command with some command line options. I put everything in a script to avoid retyping the long command each time - see example below. On Linux an Mac OS I can execute this script from any terminal emulator, i.e.
# Linux
. run_local_tests.sh
# Mac OS
./run_local_tests.sh
On Windows an application (VSCode Editor) associated with .sh file type is opened instead of executing the robot command or an error like robot: command not found is returned
# Windows
.\run_local_tests.sh
# OR
run_local_tests.sh
# OR
bash run_local_tests.sh
shell script - filename: run_local_tests.sh
#!/bin/bash
# Set desired loglevel: NONE (less details), INFO, DEBUG, TRACE (most details)
export LOG_LEVEL=TRACE
# RUN CONTRIBUTION SERVICE TESTS
robot -i CONTRIBUTION -e circleci \
--outputdir results \
--log NONE \
--report NONE \
--output XML/CONTRIBUTION.xml \
--noncritical not-ready \
--flattenkeywords for \
--flattenkeywords foritem \
--flattenkeywords name:_resources.* \
--loglevel $LOG_LEVEL \
--name CONTRI \
robot/CONTRIBUTION_TESTS/
Renaming the script from .sh to .bat doen't help :(
entering bash, then activating venv and calling the script doesn't work
What other options are there (without installing additional tools like Cygwin etc.)?
I'm actually trying to answer the same question in the opposite direction (how to trigger/run them on my machine as .sh). Looks like we may help each other out. 8)
I believe this is what you're looking for:
Your file would be run_local_tests.bat
Contents:
#echo off
cd C:\path\to\robot\project
call robot -d relative/path/to/test/output/dir relative/path/to/run_local_tests.bat
Of course you can use any other valid robot cli syntax in the call also. You may have to make it executable too. I'm not sure.

Correct way to source .bashrc for non-interactive shell

I have been trying to resolve problems to be able to run openmpi on multiple nodes.
Initially I had a problem with $PATH and $LD_LIBRARY_PATH variables not being updated from .bashrc file by openmpi session, so I manually added --prefix /path/to/openmpi to resolve this issue.
Turns out that even the anaconda path variables are not being loaded as well. So ultimately I need ~/.bashrc file to be sourced from my home directory. How can I do that? Can anyone help me out please?
UPDATE 01:
I wrote a simple shell script to check the version of python
python --version
and tried to run it with openmpi on local as well as remote machine as follows:
mpirun --prefix /home/usama/.openmpi --hostfile hosts -np 4 bash script
And it returns
Python 2.7.12
Python 3.6.8 :: Anaconda, Inc.
Python 3.6.8 :: Anaconda, Inc.
Python 2.7.12
Confirming my suspicion that whatever openmpi is doing to run remote processes doesn't invoke / set proper environment variables from the ~/.bashrc file. Any help from someone who has worked with multi-node openmpi?
UPDATE 02:
A simple ssh environment grep tell me that my environment variables are not updated which might be the cause of the problem. (I have even tried to set it up in ~/.ssh/environment file)
$ ssh remote-node env | grep -i path
It seems to be loading only the /etc/environment file with only basic paths setup. How to I rectify this?
maybe you should run like this.I guess.
two ways help you!
first:
mpirun --prefix /home/usama/.openmpi --hostfile hosts -np 4 . ~/.bashrc && bash script
second:
## 1. add this line to the script
. ~/.bashrc
## 2. run command as you do
mpirun --prefix /home/usama/.openmpi --hostfile hosts -np 4 bash script

spark-submit command not found in airflow

I am trying to run my spark job in airflow, when I executed this command spark-submit --class dataload.dataload_daily /home/ubuntu/airflow/dags/scripts/data_to_s3-assembly-0.1.jar in terminal, it works fine without any issue.
However, I am doing the same here in airflow, but keep getting the error
/tmp/airflowtmpKQMdzp/spark-submit-scalaWVer4Z: line 1: spark-submit:
command not found
t1 = BashOperator(task_id = 'spark-submit-scala',
bash_command = 'spark-submit --class dataload.dataload_daily \
/home/ubuntu/airflow/dags/scripts/data_to_s3-assembly-0.1.jar',
dag=dag,
retries=0,
start_date=datetime(2018, 4, 14))
I have my spark path mentioned in bash_profile,
export SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.7
export PATH="$SPARK_HOME/bin/:$PATH"
sourced this file as well. Not sure how to debug this, can anyone help me on this?
You could start with bash_command = 'echo $PATH' to see if your path is being updated correctly.
This is because you are metioning editing the bash_profile, but as far as I know Airflow is being run as another user. Since the other user has no changes in the bash_profile, the path to Spark might be missing.
As mentioned here (How do I set an environment variable for airflow to use?) you could try setting the path in .bashrc.

Running Bash scripts in iPython

I wrote a bash script, run.sh which has a python command with multiple options -
python train.py --lr 0.01 \
--momentum 0.5 \
--num_hidden 3 \
--sizes 100,100,100 \
--activation sigmoid \
--loss sq \
--opt adam \
--batch_size 20 \
--anneal true
I tried running this command in iPython -
!./run.sh
However, in iPython I'm not able to access the variables of the python script train.py. Is there some way to run the bash script in iPython so that I can access the variables? I don't want to copy paste the above command from the bash script each and every time.
I'm currently using iPython 5.1.0 on macOS Sierra.
The python process that runs your script train.py and the python process you're using at the ipython command line are two separate processes. It makes sense that one doesn't know about the variables of the other. There is probably some fancy way to connect the two but I suspect from the way you described the problem that it's not worth the work.
Here's an easier way to get access: you could replace python train.py in your script with python -i train.py. This way you will go into interactive mode in the process that runs your script after it is done, and anything defined at the main level will be accessible. You could insert a call to pdb.set_trace() in your script to stop at an arbitrary point.

Resources