SLURM sbatch multiple parent jobs in parallel, each with multiple child jobs - parallel-processing

I want to run a fortran code called orbits_01 on SLURM. I want to run multiple jobs simultaneously (i.e. parallelize over multiple cores). After running multiple jobs, each orbits_01 program will call another executable called optimizer, and the optimizer will constantly call another Python script called relax.py. When I submitted the jobs to SLURM by sbatch python main1.py, the jobs failed to even call the optimizer. However, the whole scheme works fine when I ran locally. The local process status is shown below:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
shuha 39395 0.0 0.0 161540 3064 ? S Oct22 0:19 sshd: shuha#pts/72
shuha 39396 0.0 0.0 118252 5020 pts/72 Ss Oct22 0:11 \_ -bash
shuha 32351 0.3 0.0 318648 27840 pts/72 S 02:08 0:00 \_ python3 main1.py
shuha 32968 0.0 0.0 149404 1920 pts/72 R+ 02:10 0:00 \_ ps uxf
shuha 32446 0.0 0.0 10636 1392 pts/72 S 02:08 0:00 ../orbits_01.x
shuha 32951 0.0 0.0 113472 1472 pts/72 S 02:10 0:00 \_ sh -c ./optimizer >& log
shuha 32954 0.0 0.0 1716076 1376 pts/72 S 02:10 0:00 \_ ./optimizer
shuha 32955 0.0 0.0 113472 1472 pts/72 S 02:10 0:00 \_ sh -c python relax.py > relax.out
shuha 32956 99.6 0.0 749900 101944 pts/72 R 02:10 0:02 \_ python relax.py
shuha 32410 0.0 0.0 10636 1388 pts/72 S 02:08 0:00 ../orbits_01.x
shuha 32963 0.0 0.0 113472 1472 pts/72 S 02:10 0:00 \_ sh -c ./optimizer >& log
shuha 32964 0.0 0.0 1716076 1376 pts/72 S 02:10 0:00 \_ ./optimizer
shuha 32965 0.0 0.0 113472 1472 pts/72 S 02:10 0:00 \_ sh -c python relax.py > relax.out
shuha 32966 149 0.0 760316 111992 pts/72 R 02:10 0:01 \_ python relax.py
shuha 32372 0.0 0.0 10636 1388 pts/72 S 02:08 0:00 ../orbits_01.x
shuha 32949 0.0 0.0 113472 1472 pts/72 S 02:10 0:00 \_ sh -c ./optimizer >& log
shuha 32950 0.0 0.0 1716076 1376 pts/72 S 02:10 0:00 \_ ./optimizer
shuha 32952 0.0 0.0 113472 1472 pts/72 S 02:10 0:00 \_ sh -c python relax.py > relax.out
shuha 32953 100 0.0 749892 101936 pts/72 R 02:10 0:03 \_ python relax.py
I have a main Python script called main1.py, which does a for loop to run multiple orbits_01 jobs at the same time. Then the main script will wait for all jobs to finish. Here 3 parent orbits_01 jobs are running in parallel, and each parent job has multiple child jobs. The heavy computations are done by the python code relax.py, so each job should be able to run only using one core. I want to know what is the best way to submit and parallelize multiple parent jobs with multiple child jobs over all cores in one node on SLURM?

Related

.bashrc somehow looping and sourcing itself (fork bomb)

I'm using a web host with an Apache terminal, using it to host a NodeJS application. For the most part everything runs smooth, however when I open the terminal I often get bash: fork: retry: no child processes and bash: fork: retry: resource temporarily unavailable.
I've narrowed down the cause of the problem to my .bashrc file, as when using top I could see that the many excess processes being created were bash instances:
top - 13:41:13 up 71 days, 20:57, 0 users, load average: 1.82, 1.81, 1.72
Tasks: 14 total, 1 running, 2 sleeping, 11 stopped, 0 zombie
%Cpu(s): 11.7 us, 2.7 sy, 0.1 ni, 85.5 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 41034544 total, 2903992 free, 6525792 used, 31604760 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 28583704 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1001511 xxxxxxxx 20 0 11880 3692 1384 S 0.0 0.0 0:00.02 bash
1001578 xxxxxxxx 20 0 11880 2840 524 T 0.0 0.0 0:00.00 bash
1001598 xxxxxxxx 20 0 11880 2672 348 T 0.0 0.0 0:00.00 bash
1001599 xxxxxxxx 20 0 11880 2896 524 T 0.0 0.0 0:00.00 bash
1001600 xxxxxxxx 20 0 11880 2720 396 T 0.0 0.0 0:00.00 bash
1001607 xxxxxxxx 20 0 11880 2928 532 T 0.0 0.0 0:00.00 bash
1001613 xxxxxxxx 20 0 11880 2964 532 T 0.0 0.0 0:00.00 bash
1001618 xxxxxxxx 20 0 11880 2780 348 T 0.0 0.0 0:00.00 bash
1001619 xxxxxxxx 20 0 12012 3024 544 T 0.0 0.0 0:00.00 bash
1001620 xxxxxxxx 20 0 11880 2804 372 T 0.0 0.0 0:00.00 bash
1001651 xxxxxxxx 20 0 12012 2836 352 T 0.0 0.0 0:00.00 bash
1001653 xxxxxxxx 20 0 12016 3392 896 T 0.0 0.0 0:00.00 bash
1004463 xxxxxxxx 20 0 9904 1840 1444 S 0.0 0.0 0:00.00 bash
1005200 xxxxxxxx 20 0 56364 1928 1412 R 0.0 0.0 0:00.00 top
~/.bashrc consists of only:
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# Uncomment the following line if you don't like systemctl's auto-paging feature:
# export SYSTEMD_PAGER=
# User specific aliases and functions
export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvm
[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion" # This loads nvm bash_completion
If I comment out the last 3 lines like so:
#export NVM_DIR="$HOME/.nvm"
#[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvm
#[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion" # This loads nvm bash_completion
Then the terminal functions as expected and no excess processes are created. However I obviously can't use nvm/npm commands while it's disabled as nvm isn't started.
I'm relatively inexperienced with bash and can't seem to figure out why this is happening. It seems that bash is somehow calling itself every time it opens, which creates the loop/fork bomb once the terminal is opened.
How can I prevent this while still being able to use nvm/npm?

Optimizing vncscreenshot scripts

Good Day,
I'm using vncsnapshot http://vncsnapshot.sourceforge.net/ in debian 7 environment to capture screenshots of workstations to monitor staffs desktop activity. This captures screenshot via nmap and saves it to my desired location accessed via internal web-page.
I have scripts like this . The x.x.x.x is the ip-range of the network to capture all open workstations.
#!/bin/bash
nmap -v -p5900 --script=vnc-screenshot-it --script-args vnc-screenshot.quality=30 x.x.x.x
And set-up in crontab to run every 5 mins.
The server has too many running processes because of it. This is the sample of ps command
root 32696 0.0 0.0 4368 0 ? S Feb23 0:00 /bin/bash /var/www/vncsnapshot/.scripts/.account.sh
root 32708 0.0 0.0 14580 4 ? S Feb23 0:00 nmap -v -p5900,5901,5902 --script=vnc-screenshot-mb
root 32717 0.0 0.0 1952 60 ? S Apr10 0:00 sh -c vncsnapshot -cursor -quality 30 x.x.x.x
root 32719 0.0 0.1 11480 4892 ? S Apr10 0:00 vncsnapshot -cursor -quality 30 30 x.x.x.x /var/w
root 32720 0.0 0.0 1952 60 ? S Apr25 0:00 sh -c vncsnapshot -cursor -quality 30 30 x.x.x.x
root 32722 0.0 0.0 1952 4 ? Ss Feb09 0:00 /bin/sh -c /var/www/vncsnapshot/.scripts/.account.sh
root 32723 0.0 0.0 3796 140 ? S Apr25 0:00 vncsnapshot -cursor -quality 30 30 x.x.x.x /var/w
root 32730 0.0 0.0 1952 4 ? Ss Feb08 0:00 /bin/sh -c /var/www/vncsnapshot/.scripts/.account
root 32734 0.0 0.0 4364 0 ? S Feb08 0:00 /bin/bash /var/www/vncsnapshot/.scripts/.account.
root 32741 0.0 0.0 13700 4 ? S Feb08 0:00 nmap -v -p5900 --script=vnc-screenshot-account --
root 32755 0.0 0.0 1952 4 ? Ss Feb08 0:00 /bin/sh -c /var/www/vncsnapshot/.scripts/.account.sh
root 32757 0.0 0.0 1952 4 ? S Feb07 0:00 sh -c vncsnapshot -cursor -quality 30 30 x.x.x.x
root 32760 0.0 0.0 3796 0 ? S Feb07 0:00 vncsnapshot -cursor -quality 30 30 x.x.x.x /var/w
root 32762 0.0 0.0 4368 0 ? S Feb09 0:00 /bin/bash /var/www/vncsnapshot/.scripts/.account.sh
root 32764 0.0 0.0 4368 0 ? S Feb08 0:00 /bin/bash /var/www/vncsnapshot/.scripts/.account.sh
How can I optimize this set-up to close un-nessesary processes that are still running.
Thanks
I split the processes in two part: nmap that regularly scan the network and the vncsnapshot that grab screenshot of a list of previously scanned host.
In my opinion, in this way the things are cleaner.
i haven't test this code
#!/bin/bash
## capture the list of host with vnc port open
list=/dev/shm/list
port=5900
network=192.168.1.*
nmap -n -p${port} --open ${network} -oG - | grep 'open\/tcp' | awk '{print $2}' > ${list}
the other script, check if a process is alive with lock file and in case launch the grab command
#!/bin/bash
list=/dev/shm/list
run=/run/vncscreenshot/
mkdir -p ${run} &>/dev/null
cat ${list} |\
while read host
do
lock="${run}/${host}.lock"
test -e ${lock} && ps -p $(<${lock}) &>/dev/null && continue
vnc-screenshot-it vnc-screenshot.quality=30 ${host} &
echo $! > ${lock}
done

Gitlab and redmine high memory usage

I have a VPS with 1GB memory, Debian 7 stable, Gitlab and Redmine installed without anything else (except normal processes).
This configuration consumes more than 900MB of memory. I already set unicorn workers to 1 but no significant changes. Version of Redmine is 2.5.1.stable, version of Gitlab is 6-9-stable.
I wonder if there a way to reduce the memory consuption and CPU load. I might use nginx instead of apache2 or postgres instead of mysql. What else?
Any suggestion is really appreciated.
Here is the list of running processes:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.2 30176 2160 ? Ss 10:38 0:00 init
root 2 0.0 0.0 0 0 ? S 10:38 0:00 [kthreadd/1723]
root 3 0.0 0.0 0 0 ? S 10:38 0:00 [khelper/1723]
root 227 0.0 0.0 16988 884 ? S 10:38 0:00 upstart-udev-bridge --daemon
root 235 0.0 0.1 21300 1352 ? Ss 10:38 0:00 /sbin/udevd --daemon
root 283 0.0 0.0 21296 1024 ? S 10:38 0:00 /sbin/udevd --daemon
root 284 0.0 0.0 21296 1028 ? S 10:38 0:00 /sbin/udevd --daemon
root 428 0.0 0.0 14936 640 ? S 10:38 0:00 upstart-socket-bridge --daemon
root 1874 0.0 0.1 58740 1652 ? Sl 10:38 0:00 /usr/sbin/rsyslogd -c5
root 1920 0.0 0.0 57568 988 ? Ss 10:38 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
root 1922 0.0 0.0 57568 632 ? S 10:38 0:00 /usr/sbin/saslauthd -a pam -c -m /var/run/saslauthd -n 2
root 1988 0.0 0.2 72552 2648 ? Ss 10:38 0:00 sendmail: MTA: accepting connections
root 2029 0.0 0.1 49888 1244 ? Ss 10:38 0:00 /usr/sbin/sshd
root 2061 0.0 0.0 19520 964 ? Ss 10:38 0:00 /usr/sbin/xinetd -pidfile /var/run/xinetd.pid -stayalive -inetd_compat -inetd_ipv6
root 2113 0.0 2.1 301048 22244 ? Ss 10:38 0:00 /usr/sbin/apache2 -k start
root 2156 0.0 0.0 20364 1044 ? Ss 10:38 0:00 /usr/sbin/cron
root 2206 0.0 0.0 4136 712 ? S 10:38 0:00 /bin/sh /usr/bin/mysqld_safe
root 2322 0.0 0.1 23368 1968 ? Ssl 10:38 0:00 PassengerWatchdog
root 2337 0.5 0.2 100600 2652 ? Sl 10:38 0:19 PassengerHelperAgent
root 2348 0.0 0.9 46372 10412 ? Sl 10:38 0:00 Passenger spawn server
nobody 2353 0.0 0.3 81832 4168 ? Sl 10:38 0:00 PassengerLoggingAgent
mysql 2551 0.0 5.2 464312 55360 ? Sl 10:38 0:01 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysq
root 2552 0.0 0.0 4044 668 ? S 10:38 0:00 logger -t mysqld -p daemon.error
root 2684 0.0 0.7 58116 8300 ? S 10:38 0:00 python /usr/sbin/denyhosts --daemon --purge --config=/etc/denyhosts.conf
redis 2708 0.0 0.1 39964 1676 ? Ssl 10:38 0:00 /usr/bin/redis-server /etc/redis/redis.conf
git 2811 0.5 13.0 377628 136484 ? Sl 10:38 0:19 unicorn_rails master -D -c /home/git/gitlab/config/unicorn.rb -E production
git 2846 0.0 12.3 377628 129148 ? Sl 10:38 0:00 unicorn_rails worker[0] -D -c /home/git/gitlab/config/unicorn.rb -E production
git 2873 0.6 13.6 428528 143532 ? Sl 10:38 0:23 sidekiq 2.17.0 gitlab [0 of 25 busy]
root 2892 0.0 0.2 32712 2248 ? Ss 10:38 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 106:111
root 2913 0.0 0.0 14532 876 tty1 Ss+ 10:38 0:00 /sbin/getty 38400 console
root 2915 0.0 0.0 14532 880 tty2 Ss+ 10:38 0:00 /sbin/getty 38400 tty2
admin 2976 0.1 8.6 265444 90444 ? Sl 10:39 0:03 Passenger ApplicationSpawner: /var/www/redmine
admin 2984 0.0 9.8 282032 103736 ? Sl 10:39 0:00 Rails: /var/www/redmine
admin 2992 0.0 8.1 265444 85744 ? Sl 10:39 0:00 Rails: /var/www/redmine
admin 2998 0.0 8.1 265444 85764 ? Sl 10:39 0:00 Rails: /var/www/redmine
admin 3004 0.0 8.1 265444 85760 ? Sl 10:39 0:00 Rails: /var/www/redmine
admin 3010 0.0 8.1 265444 85760 ? Sl 10:39 0:00 Rails: /var/www/redmine
admin 3016 0.0 9.9 282532 104400 ? Sl 10:39 0:01 Rails: /var/www/redmine
www-data 3026 0.0 1.6 301492 17416 ? S 10:39 0:00 /usr/sbin/apache2 -k start
git 3042 0.0 12.8 313152 134320 ? Sl 10:39 0:00 Rack: /home/git/gitlab
root 3794 0.0 0.3 71248 3628 ? Ss 11:23 0:00 sshd: admin [priv]
admin 3797 0.0 0.1 71248 1824 ? R 11:23 0:00 sshd: admin#pts/0
admin 3798 0.0 0.2 19428 2228 pts/0 Ss 11:23 0:00 -bash
www-data 3922 0.0 1.6 301520 17448 ? S 11:32 0:00 /usr/sbin/apache2 -k start
www-data 3926 0.0 1.6 301472 17328 ? S 11:32 0:00 /usr/sbin/apache2 -k start
www-data 3929 0.0 1.6 301472 17288 ? S 11:32 0:00 /usr/sbin/apache2 -k start
www-data 3930 0.0 1.5 301256 16220 ? S 11:32 0:00 /usr/sbin/apache2 -k start
root 4012 0.0 0.2 72552 2876 ? S 11:38 0:00 sendmail: MTA: ./s59ECXBN022245 example.com.: user open
and this is the result of "free -m":
total used free shared buffers cached
Mem: 1024 962 61 0 0 196
-/+ buffers/cache: 766 257
Swap: 1024 0 1024

Append Output results

I'm running a validation software and I want all of the output sent to a text file and have the results of multiple files placed/appended to the same file. I thought my code was working, but I just discovered I'm only getting the results from 1 file output to the text file.
java -jar /Applications/epubcheck-3.0.1/epubcheck-3.0.1.jar ~/Desktop/Validator/*.epub 2>&1 | tee -a ~/Desktop/Validator/EPUBCHECK3_results.txt
open ~/Desktop/Validator/EPUBCHECK3_results.txt
EDIT
When I run the same .jar file using Windows command line it will process a batch of files and appeand the results appropriately. I would just do this, but it would mean having to switch work stations and transferring files to validate them. I would like to get this running through the Unix shell on my Mac system so that I don't have to do unnecessary work. Command line that IS working below:
FOR /f %%1 in ('dir /b "C:\Users\scrawfo\Desktop\epubcheck\drop epubs here\*.epub"') do (
echo %%1 >> epubcheck.txt
java -jar "C:\Users\scrawfo\Desktop\epubcheck\epubcheck-3.0.jar" "C:\Users\scrawfo\Desktop\epubcheck\drop epubs here\%%1" 2>> epubcheck.txt
echo. >> epubcheck.txt)
notepad epubcheck.txt
del epubcheck.txt
syntax provided by you is correct there might be some problem with java output or something Try Executing it without redirection
cat test
Output:-
This is Test File ...............
Next Executed Command with same syntax
ps l 2>&1 | tee -a test
Output:-
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME
COMMAND 4 0 3287 1 20 0 4060 572 n_tty_ Ss+ tty2
0:00 /sbin/mingetty /dev/tty2 4 0 3289 1 20 0 4060 572
n_tty_ Ss+ tty3 0:00 /sbin/mingetty /dev/tty3 4 0 3291
1 20 0 4060 576 n_tty_ Ss+ tty4 0:00 /sbin/mingetty
/dev/tty4 4 0 3295 1 20 0 4060 576 n_tty_ Ss+ tty5
0:00 /sbin/mingetty /dev/tty5 4 0 3297 1 20 0 4060 572
n_tty_ Ss+ tty6 0:00 /sbin/mingetty /dev/tty6 4 0 19086
1 20 0 4060 572 n_tty_ Ss+ tty1 0:00 /sbin/mingetty
/dev/tty1 4 0 20837 20833 20 0 108432 2148 wait Ss pts/0
0:00 -bash 4 0 21471 20837 20 0 108124 1036 - R+ pts/0
0:00 ps l 0 0 21472 20837 20 0 100908 664 pipe_w S+ pts/0
0:00 tee -a test
Checked File
cat test
Output:-(Appended properly)
This is Test File ...............
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 4 0
3287 1 20 0 4060 572 n_tty_ Ss+ tty2 0:00
/sbin/mingetty /dev/tty2 4 0 3289 1 20 0 4060 572
n_tty_ Ss+ tty3 0:00 /sbin/mingetty /dev/tty3 4 0 3291
1 20 0 4060 576 n_tty_ Ss+ tty4 0:00 /sbin/mingetty
/dev/tty4 4 0 3295 1 20 0 4060 576 n_tty_ Ss+ tty5
0:00 /sbin/mingetty /dev/tty5 4 0 3297 1 20 0 4060 572
n_tty_ Ss+ tty6 0:00 /sbin/mingetty /dev/tty6 4 0 19086
1 20 0 4060 572 n_tty_ Ss+ tty1 0:00 /sbin/mingetty
/dev/tty1 4 0 20837 20833 20 0 108432 2148 wait Ss pts/0
0:00 -bash 4 0 21471 20837 20 0 108124 1036 - R+ pts/0
0:00 ps l 0 0 21472 20837 20 0 100908 664 pipe_w S+ pts/0
0:00 tee -a test

How do I put an already running CHILD process under nohup

My question is very similar to that posted in: How do I put an already-running process under nohup?
Say I execute foo.sh from my command line, and it in turn executes another shell script, and so on. For example:
foo.sh
\_ bar.sh
\_ baz.sh
Now I press Ctrl+Z to suspend "foo.sh". It is listed in my "jobs -l".
How do I disown baz.sh so that it is no longer a grandchild of foo.sh? If I type "disown" then only foo.sh is disowned from its parent, which isn't exactly what i want. I'd like to kill off the foo.sh and bar.sh processes and only be left with baz.sh.
My current workaround is to "kill -18" (resume) baz.sh and go on with my work, but I would prefer to kill the aforementioned processes. Thanks.
Use ps to get the PID of bar.sh, and kill it.
imac:barmar $ ps -l -t p0 -ww
UID PID PPID F CPU PRI NI SZ RSS WCHAN S ADDR TTY TIME CMD
501 3041 3037 4006 0 31 0 2435548 760 - Ss 8c6da80 ttyp0 0:00.74 /bin/bash --noediting -i
501 68228 3041 4006 0 31 0 2435544 664 - S 7cbc2a0 ttyp0 0:00.00 /bin/bash ./foo.sh
501 68231 68228 4006 0 31 0 2435544 660 - S c135a80 ttyp0 0:00.00 /bin/bash ./bar.sh
501 68232 68231 4006 0 31 0 2435544 660 - S a64b7e0 ttyp0 0:00.00 /bin/bash ./baz.sh
501 68233 68232 4006 0 31 0 2426644 312 - S f9a1540 ttyp0 0:00.00 sleep 100
0 68243 3041 4106 0 31 0 2434868 480 - R+ a20ad20 ttyp0 0:00.00 ps -l -t p0 -ww
imac:barmar $ kill 68231
./foo.sh: line 3: 68231 Terminated ./bar.sh
[1]+ Exit 143 ./foo.sh
imac:barmar $ ps -l -t p0 -ww
UID PID PPID F CPU PRI NI SZ RSS WCHAN S ADDR TTY TIME CMD
501 3041 3037 4006 0 31 0 2435548 760 - Ss 8c6da80 ttyp0 0:00.74 /bin/bash --noediting -i
501 68232 1 4006 0 31 0 2435544 660 - S a64b7e0 ttyp0 0:00.00 /bin/bash ./baz.sh
501 68233 68232 4006 0 31 0 2426644 312 - S f9a1540 ttyp0 0:00.00 sleep 100
0 68248 3041 4106 0 31 0 2434868 480 - R+ 82782a0 ttyp0 0:00.00 ps -l -t p0 -ww

Resources