This question already has answers here:
Using conda activate or specifying python path in bash script?
(2 answers)
Closed 6 months ago.
I am trying to submit a SLURM job on a computing cluster CentOS7. The job is a python file (cifar100-vgg16.py) which needs tensorflow-gpu 2.8.1, which I've installed in a conda environment (tf_gpu). The bash script I'm submitting to SLURM (our job scheduler) is copied below. The SLURM output file shows that the environment being used is the base conda environment Python/3.6.4-foss-2018a (with tensorflow 1.10.1), not tf_gpu. Please advise on how to solve.
Bash script:
#!/bin/bash --login
########## SBATCH Lines for Resource Request ##########
#SBATCH --time=00:10:00 # limit of wall clock time - how long the job will run (same as -t)
#SBATCH --nodes=1 # the number of node requested.
#SBATCH --ntasks=1 # the number of task to run
#SBATCH --cpus-per-task=1 # the number of CPUs (or cores) per task (same as -c)
#SBATCH --mem-per-cpu=2G # memory required per allocated CPU (or core) - amount of memory (in bytes)
#SBATCH --job-name test2 # you can give your job a name for easier identification (same as -J)
########## Command Lines to Run ##########
conda activate tf_gpu
python cifar100-vgg16.py
SLURM output file:
> /opt/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Tensorflow version 1.10.1
Keras version 2.1.6-tf
Scikit learn version 0.20.0
Traceback (most recent call last):
File "cifar100-vgg16.py", line 39, in <module>
print("Number of GPUs Available:", len(tensorflow.config.experimental.list_physical_devices('GPU')))
AttributeError: module 'tensorflow' has no attribute 'config'
There is a mistake in your job script. Replace conda activate tf_gpu with source activate tf_gpu.
Also, I guess you need to load the module so that you can use it. It would be something like module load anaconda check module avail for the list of available modules.
But looks like you HPC doesn't need a module load as it identifies conda without module load.
EDIT: FlytingTeller said that source activate is replaced with conda activate in 2017. I know this.
I don't know if this works on HPCs. To prove my point here is the output of Swansea SUNBIRD, when I use conda activate.
(base) hell#Dell-Precision-T7910:~$ ssh sunbird
Last login: Wed Aug 10 15:30:29 2022 from en003013.swan.ac.uk
====================== Supercomputing Wales - Sunbird ========================
This system is for authorised users, if you do not have authorised access
please disconnect immediately, and contact Technical Support.
-----------------------------------------------------------------------------
For user guides, documentation and technical support:
Web: http://portal.supercomputing.wales
-------------------------- Message of the Day -------------------------------
SUNBIRD has returned to service unchanged. Further information on
the maintenance outage and future work will be distributed soon.
===============================================================================
[s.1915438#sl2 ~]$ module load anaconda/3
[s.1915438#sl2 ~]$ conda activate base
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
If your shell is Bash or a Bourne variant, enable conda for the current user with
$ echo ". /apps/languages/anaconda3/etc/profile.d/conda.sh" >> ~/.bashrc
or, for all users, enable conda with
$ sudo ln -s /apps/languages/anaconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh
The options above will permanently enable the 'conda' command, but they do NOT
put conda's base (root) environment on PATH. To do so, run
$ conda activate
in your terminal, or to put the base environment on PATH permanently, run
$ echo "conda activate" >> ~/.bashrc
Previous to conda 4.4, the recommended way to activate conda was to modify PATH in
your ~/.bashrc file. You should manually remove the line that looks like
export PATH="/apps/languages/anaconda3/bin:$PATH"
^^^ The above line should NO LONGER be in your ~/.bashrc file! ^^^
[s.1915438#sl2 ~]$ source activate base
(base) [s.1915438#sl2 ~]$
Here is the output of Cardiff HAWK when I use conda activate.
(base) hell#Dell-Precision-T7910:~$ ssh cardiff
Last login: Tue Aug 2 09:32:44 2022 from en003013.swan.ac.uk
======================== Supercomputing Wales - Hawk ==========================
This system is for authorised users, if you do not have authorised access
please disconnect immediately, and contact Technical Support.
-----------------------------------------------------------------------------
For user guides, documentation and technical support:
Web: http://portal.supercomputing.wales
-------------------------- Message of the Day -------------------------------
- WGP Gluster mounts are now RO on main login nodes.
- WGP RW access is via Ser Cymru system or dedicated access VM.
===============================================================================
[s.1915438#cl1 ~]$ module load anaconda/
anaconda/2019.03 anaconda/2020.02 anaconda/3
anaconda/2019.07 anaconda/2021.11
[s.1915438#cl1 ~]$ module load anaconda/2021.11
INFO: To setup environment run:
eval "$(/apps/languages/anaconda/2021.11/bin/conda shell.bash hook)"
or just:
source activate
[s.1915438#cl1 ~]$ conda activate
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run
$ conda init <SHELL_NAME>
Currently supported shells are:
- bash
- fish
- tcsh
- xonsh
- zsh
- powershell
See 'conda init --help' for more information and options.
IMPORTANT: You may need to close and restart your shell after running 'conda init'.
[s.1915438#cl1 ~]$ conda activate base
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run
$ conda init <SHELL_NAME>
Currently supported shells are:
- bash
- fish
- tcsh
- xonsh
- zsh
- powershell
See 'conda init --help' for more information and options.
IMPORTANT: You may need to close and restart your shell after running 'conda init'.
[s.1915438#cl1 ~]$ source activate base
(2021.11)[s.1915438#cl1 ~]$
The conda versions are certainly after 2020 not 2017. SInce, the question was about a HPC cluster, that I why said replace conda activate with source activate, to activate the first conda environment.
Anyone with a possible explanation?
EDIT 2: I think I have an explanation.
[s.1915438#sl2 ~]$ cat ~/.bashrc
# .bashrc
# Dynamically generated by: genconfig (Do not edit!)
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# Load saved modules
module load null
# Personal settings file
if [ -f $HOME/.myenv ]
then
source $HOME/.myenv
fi
The ~/.bashrc does not contain the path to conda.sh. I think this is true for many HPCs.
I'm trying to create a docker image which sets a custom tomcat port (I know you can set an external port with the docker flag "-p 8888:8080" but for my use case I want to change the internal port as well).
When I try to start catalina.sh the run argument is being ignored for some reason.
Dockerfile:
# Tomcat 8 alpine dockerfile copied here (URL below)... minus the CMD line at the end
# https://github.com/docker-library/tomcat/blob/5f1abae99c0b1ebbd4f020bc4b5696619d948cfd/8.0/jre8-alpine/Dockerfile
ADD server.xml $CATALINA_HOME/conf/server.xml
ADD start-tomcat.sh /start-tomcat.sh
RUN chmod +x /start-tomcat.sh
ENTRYPOINT ["/bin/sh","/start-tomcat.sh"]
The tomcat file, server.xml, is the same as the default except for the line:
<Connector port="${port.http.nonssl}" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" />
start-tomcat.sh:
#!/bin/sh
export JAVA_OPTS=-Dport.http.nonssl=${PORT}
catalina.sh run
The image builds successfully, but when I run with
docker run -p 8888:8888 -e PORT=8888 customtomcat
I just get a list of catalina.sh commands as if I didn't give it an argument. I've also tried
/usr/local/tomcat/bin/catalina.sh run
sh -c "catalina.sh run"
sh -c "/usr/local/tomcat/bin/catalina.sh run"
cd /usr/local/tomcat/bin
./catalina.sh run
I'm pretty sure I'm missing something simple here. I'd guess it has something to do with the syntax, but maybe it has something to do with docker or alpine that I'm not aware of. This is my first time using alpine linux.
---Edit 1---
To explain my use case... I'm setting PORT after the docker image is created because it's being set by an apache mesos task. For my purposes I need to run the docker container (from marathon) in host mode, not bridged mode.
---Edit 2---
I modified things to only focus on my main issue. The docker file now only has the following appended to the end:
ADD start-tomcat.sh /start-tomcat.sh
RUN chmod +x /start-tomcat.sh
ENTRYPOINT ["/bin/sh","/start-tomcat.sh"]
And start-tomcat.sh:
#!/bin/bash
catalina.sh run
Still no luck.
Update: for the "catalina.sh run" to fail with an invalid option, first check for linefeeds from a Windows system. They'll cause errors when the shell script is read in a Linux environment.
Looking at catalina.sh, I believe you want CATALINA_OPTS, not JAVA_OPTS:
# Control Script for the CATALINA Server
#
# Environment Variable Prerequisites
#
# Do not set the variables in this script. Instead put them into a script
# setenv.sh in CATALINA_BASE/bin to keep your customizations separate.
#
# CATALINA_HOME May point at your Catalina "build" directory.
#
# CATALINA_BASE (Optional) Base directory for resolving dynamic portions
# of a Catalina installation. If not present, resolves to
# the same directory that CATALINA_HOME points to.
#
# CATALINA_OUT (Optional) Full path to a file where stdout and stderr
# will be redirected.
# Default is $CATALINA_BASE/logs/catalina.out
#
# CATALINA_OPTS (Optional) Java runtime options used when the "start",
# "run" or "debug" command is executed.
# Include here and not in JAVA_OPTS all options, that should
# only be used by Tomcat itself, not by the stop process,
# the version command etc.
# Examples are heap size, GC logging, JMX ports etc.
#
# CATALINA_TMPDIR (Optional) Directory path location of temporary directory
# the JVM should use (java.io.tmpdir). Defaults to
# $CATALINA_BASE/temp.
#
# JAVA_HOME Must point at your Java Development Kit installation.
# Required to run the with the "debug" argument.
#
# JRE_HOME Must point at your Java Runtime installation.
# Defaults to JAVA_HOME if empty. If JRE_HOME and JAVA_HOME
# are both set, JRE_HOME is used.
#
# JAVA_OPTS (Optional) Java runtime options used when any command
# is executed.
# Include here and not in CATALINA_OPTS all options, that
# should be used by Tomcat and also by the stop process,
# the version command etc.
# Most options should go into CATALINA_OPTS.
I am trying to setup hadoop on my local machine and was following this. I have setup hadoop home also
This is the command I am trying to run now
hduser#ubuntu:~$ /usr/local/hadoop/bin/start-all.sh
And this is the error I get
-su: /usr/local/hadoop/bin/start-all.sh: No such file or directory
This is what I added to my $HOME/.bashrc file
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
EDIT After trying the solution given by mahendra I am getting the following output
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-mmt-HP-ProBook-430-G3.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hduser-datanode-mmt-HP-ProBook-430-G3.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hduser-secondarynamenode-mmt-HP-ProBook-430-G3.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-mmt-HP-ProBook-430-G3.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-mmt-HP-ProBook-430-G3.out
Try to run :
hduser#ubuntu:~$ /usr/local/hadoop/sbin/start-all.sh
Since start-all.sh and stop-all.sh located in sbin directory while hadoop binary file is located in bin directory.
Also updated your .bashrc for:
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
so that you can directly access start-all.sh
I am trying to create an alias on Hadoop machine and run it from Hive JVM.
When I explicitly run the command from Hive with ! prefix it works, however when I add the alias, source the .bashrc file and call the alias from Hive, I get an error. Example:
.bashrc content:
# Environment variables required by hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export HADOOP_HOME_WARN_SUPPRESS=true
export HADOOP_HOME=/home/hadoop
export PATH=$PATH:/home/hadoop/bin
alias load-table='java -cp /home/hadoop/userlib/MyJar.jar com.MyClass.TableLoader';
Call on Hive:
!load-table;
Output:
Exception raised from Shell command Cannot run program "load-table": error=2, No such file or directory
Aliases have several limitations compared to shell functions (e.g. by default you cannot call an alias from a non-interactive shell).
Define in your ~/.bashrc:
function load-table() {
# Make sure the java executable is accessible
if which java > /dev/null 2>&1; then
java -cp /home/hadoop/userlib/MyJar.jar com.MyClass.TableLoader
else
echo "java not found! Check your PATH!"
fi
}
export -f load-table # to export the function (BASH specific)
Source you .bashrc to apply the changes. Then, call load-table.
I'm trying to modify the hdfs script so that it still functions although not located in $HADOOP_HOME/bin anymore, but when I execute the modified hdfs I get:
hdfs: line 110: exec: org.apache.hadoop.fs.FsShell: not found
line 110 is:
exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$#"
I've highlighted the changes I made to the script:
bin=**"$HADOOP_HOME"/bin # was** `dirname "$0"`
bin=`cd "$bin"; pwd`
./**hdfs-config.sh # was .** "$bin"/hdfs-config.sh
-
$ hadoop version
Hadoop 0.20.3-SNAPSHOT
Subversion http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append -r 1041718
Compiled by hammer on Mon Dec 6 17:38:16 CET 2010
Why don't you simply put a second copy of Hadoop on the system and let it have a different value for HADOOP_HOME?