Parallel building/deployment - shell

I'm not sure what tags to put on this :
I have an Xcode project with multiple schemes that outputs multiple apps. I have a script archive_all.sh that is setting everything up to build and deploy (to TestFlight) each app (13 at the moment) by calling archive.sh. I tried (stupid me) to do : sh archive.sh & in the loop, but my laptop handled it hardly, and I plan to have much more than 13 apps to deployed in the future.
Is there a way, preferably in shell script to set up a queue of executable to call ? My laptop could probably handle 3-4 calls to archive.sh at a time.

Try ppss which supports both Linux and Mac OS X. It will auto-detect the number of cores in your CPU and execute tasks efficiently on those cores.

GNU Parallel is made for these kind of jobs.
parallel archive.sh {} ::: app1 app2 ... app15
This will archive.sh for each app by running one per CPU core.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Related

How to utilise GNU parallel efficiently?

I have a script say parallelise.sh, whose contents are 10 different python calls shown below:
python3.8 script1.py
python3.8 script2.py
.
.
.
python3.8 script10.py
Now, I use GNU parallel
nohup parallel -j 5 < parallellise.sh &
It starts as expected; 5 different processors are being used and the first 5 scripts, script_1.py ... script_5.py are running. Now I notice that some of them (say two of them script_1.py and script_2.py) complete very fast, whereas the others need more time to complete.
Now, there are unused resources (2 processors) while waiting for the remaining 3 scripts (script_3.py, script_4.py, and script_5.py) to complete so that the next 5 can be loaded. Is there a way to use these resources by loading new ones as existing commands get completed?
For information: My OS is CentOS
As #RenaudPacalet says there is nothing else to do.
So there is something in your scripts which causes this not to happen.
To help debug you can use:
parallel --lb --tag < parallellise.sh
and maybe add a "Starting X" line at the beginning of scriptX.py and a "Finishing X" line at the end of scriptX.py so you can see that the scripts are indeed finishing.
Without knowing anything about scriptX.py it is impossible to say what is causing this.
(Instead of nohup consider using tmux or screen so you can have the jobs run in the background but always check in on them and see their output. nohup is not ideal for debugging).

Julia Command Line Running Processes in Parallel

I have a Julia script that converts csvs to a binary format. Trust me it's great. I also have many (seemingly innumerable) csvs that I want to process. It's a shared network and so I can only process five files at a clip without savagely burdening the CPU and making my coworkers irate and potentially unstable. Accordingly, I want to run the script in groups of five, wait for them to finish, and then run the next batch as background processes until it's Miller time all using Julia's wonderful run() function ala:
julia csvparse3.jl /home/file1.csv > /dev/null 2>&1 &
I'm fairly certain that I could sidestep all of this by using addprocs() and pmap() if I made my parsing script into a Julia module/function. However, the reason I'm asking this is because I don't know what I would then do if my original script was written in Fortran or even worse Python? Is there a way for me to achieve my aforementioned goals for an arbitrary number of external programs, ascertain when the processes are finished, and start anew in the context of a simple loop? Many thanks.
With GNU Parallel you can run:
parallel -j5 julia csvparse3.jl ::: /home/*.csv > /dev/null 2>&1
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

running multiple instances of a single MPI executable in parallel on a cluster using mpirun options?

I'm trying to write a shell script to perform some kind of algorithm, and a part of it requires parallel execution of an MPI executable across multiple input files on a grid engine cluster. From what I read, it seems like mpirun supports MPMD execution by using the colon sign or using the application context/schema file and then perform mpirun --app my_appfile. And below is what my my_appfile looks like,
-np 12 /path/to/executable /path/to/dir1/input1
-np 12 /path/to/executable /path/to/dir2/input2
-np 12 /path/to/executable /path/to/dir3/input3
...
-np 12 /path/to/executable /path/to/dir10/input10
I was trying to parallely execute 10 instances of the same executable and assign the resources in the cluster accordingly (120 processes in this case in SGE's orte parallel environment).
However, there was a problem. Each input file was written to generate an output in the same directory as each particular input file. As I submitted the job (the submission script contains only the mpirun --app my_appfile line), it shows only the output from input1 in dir1, but not the rest. So I wonder what is the problem here. Is it the problem with mpirun options or the problem with how the cluster does the job? Any help would be highly appreciated. Thank you!

shell script to loop and start processes in parallel?

I need a shell script that will create a loop to start parallel tasks read in from a file...
Something in the lines of..
#!/bin/bash
mylist=/home/mylist.txt
for i in ('ls $mylist')
do
do something like cp -rp $i /destination &
end
wait
So what I am trying to do is send a bunch of tasks in the background with the "&" for each line in $mylist and wait for them to finish before existing.
However, there may be a lot of lines in there so I want to control how many parallel background processes get started; want to be able to max it at say.. 5? 10?
Any ideas?
Thank you
Your task manager will make it seem like you can run many parallel jobs. How many you can actually run to obtain maximum efficiency depends on your processor. Overall you don't have to worry about starting too many processes because your system will do that for you. If you want to limit them anyway because the number could get absurdly high you could use something like this (provided you execute a cp command every time):
...
while ...; do
jobs=$(pgrep 'cp' | wc -l)
[[ $jobs -gt 50 ]] && (sleep 100 ; continue)
...
done
The number of running cp commands will be stored in the jobs variable and before starting a new iteration it will check if there are too many already. Note that we jump to a new iteration so you'd have to keep track of how many commands you already executed. Alternatively you could use wait.
Edit:
On a side note, you can assign a specific CPU core to a process using taskset, it may come in handy when you have fewer more complex commands.
You are probably looking for something like this using GNU Parallel:
parallel -j10 cp -rp {} /destination :::: /home/mylist.txt
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Using maximum remote servers

Im trying to distribute commands to 100 remote computers, but noticed that the commands are only being sent to 16 remote computers. My local machine has 16 cores. Why is parallel only using 16 remote computers instead of 100?
parallel --eta --sshloginfile list_of_100_remote_computers.txt < list_of_commands.txt
I do believe you will need to specify the number of parallel jobs to be executed.
According to the Parallel MAN:
--jobs N
-j N
--max-procs N
-P N
Number of jobslots. Run up to N jobs in parallel. 0 means as many as possible. Default is 100% which will run one job per CPU core.
And keep this in mind:
When you start more than one job with the -j option, it is reasonable
to assume that each job might not take exactly the same amount of time
to complete. If you care about seeing the output in the order that
file names were presented to Parallel (instead of when they
completed), use the --keeporder option.
Parallel Multicore at the Command Line with GNU Parallel, Admin Magazine
If the remote machines are 32 cores then you run 16*32 jobs. By default GNU Parallel uses a file handle for STDOUT and STDERR in total 16*32*2 file handles = 1024 file handles.
If you have a default GNU/Linux system you will be hitting the 1024 file handle limit.
If --ungroup runs more jobs, then that is a clear indication that you have hit the file handle limit. Use ulimit -n to increase the limit.

Resources