Does multiple runs make it parallel? - parallel-processing

I have written a short python script to process my big fastq files in size from 5Gb to 35Gb. I am running the script in a Linux server that has many cores. The script is not written in parallel at all and taking about 10 minutes to finish for a single file in average.
If I run the same script on several files like
$ python my_script.py file1 &
$ python my_script.py file2 &
$ python my_script.py file3 &
using the & sign to push back the process.
do those scripts run in parallel and will I save some time?
It seems not to me, since I am using top command to check the processor usage and each ones usage drops as I added new runs or shouldn't it use somewhere close 100% ?
So if they are not running in parallel, is there a way to make the os run them in parallel ?
Thanks for answers

Commands executed this way do indeed run in parallel. The reason why they're not using up 100% of your CPU time might be because they're I/O bound, rather than CPU bound. The description of what the the script does ("big fastq files in size from 5Gb to 35Gb") suggests that this might just be the case.
If you look at the process list given by ps, though, you should see three python processes on there - unless one or more of them will have terminated by the time you run ps.

Time spent in waiting on I/O operations is accounted as a different kind of CPU usage, usually %wa. You are probably just looking at the %us (user CPU time).

Related

How to monitor and control background processes in shell script

I need to write a shell (bash) script that will be executing several Hive queries.
Each of the queries will produce a directory with a lot of files.
After all queries are finished I need to process all these files in a specific order.
I want to run Hive queries in parallel as background processes as each one might take couple of hours.
I would also like to parallelize resulting file processing but there are some culprits, that I don't know how to handle. I.e. I can start processing results of the first and second queries as soon as they are finished, but for the third, I need to hold until first two processors are done. Similarly for the fourth and fifth.
I won't have any problems writing such a program in Java, but how to do it in shell - beats me.
If someone can give me a hint on how can I monitor execution of these components in the shell script, I would appreciate it greatly.

Correct use Bash wait command with unknown processes number

Im writing a bash script that essentially fires off a python script that takes roughly 10 hours to complete, followed by an R script that checks the outputs of the python script for anything I need to be concerned about. Here is what I have:
ProdRun="python scripts/run_prod.py"
echo "Commencing Production Run"
$ProdRun #Runs python script
wait
DupCompare="R CMD BATCH --no-save ../dupCompareTD.R" #Runs R script
$DupCompare
Now my issues is that often the python script can generate a whole heap of different processes on our linux server depending on its input, with lots of different PIDs AND we have heaps of workers using the same server firing off scripts. As far as I can tell from reading, the 'wait' command must wait for all processes to finish or for a specific PID to finish, but when i cannot tell what or how many PIDs will be assigned/processes run, how exactly do I use it?
EDIT: Thank you to all that helped, here is what caused my dilemma for anyone google searching this. I broke up the ProdRun python script into its individual script that it was itself calling, but still had the issue, I think found that one of these scripts was also calling another smaller script that had a "&" at the end of it that was ignoring any commands to wait on it inside the python script itself. Simply removing this and inserting a line of "os.system()" allowed all the code to run sequentially.
It sounds like you are trying to implement a job scheduler with possibly some complex dependencies between different tasks. I recommend to use a job scheduler instead. It allows you to specify to run those jobs whilst also benefitting from features like monitoring, handling exceptional cases, errors, ...
Examples are: the open source rundeck https://github.com/rundeck/rundeck or the commercial one http://www.bmcsoftware.uk/it-solutions/control-m.html
Make your Python program wait on the children it spawns. That's the proper way to fix this scenario. Then you don't have to wait for Python after it finishes (sic).
(Also, don't put your commands in variables.)

Reading file in parallel from multiple processes

I'm running multiple processes in parallel and each of these processes read the same file in parallel. It looks like some of the processes see a corrupted version of the file if I increase the number of processes to > 15 or so. What is the recommended way of handling such a scenario?
More details:
The file being read in parallel is actually a perl script. The multiple jobs are python processes, and each of them launch this perl script independently with different input parameters. When the number of jobs is increased, some of these jobs give errors that the perl script has invalid syntax (which is not true). Hence, I suspect that some of these jobs read in corrupted versions of the perl script.
I'm running all of this on a 32core machine.
If any process is also writing to the file, then you need to enforce some synchronization, for example with a global named mutex.
If there is no asynchronous writing going on, I would not expect to see corruption during the reads. Are you opening the files with "r" access? If you're still encountering troubles, it might be worth experimenting with reducing read buffer size. Or call out to a native win32 API for the file access.
Good luck!

Can I program to chose a free CPU, I have multiple, to run my shell script?

I have a shell script. In that script I am starting 6 new processes. My system has 4 CPUs.
If I run the shell script the new processes spawned are automatically allocated to the one of the CPUs by default by the operating system. Now, I want to reduce the total time of running of my script. Is there a way that I can check a processor's free utilization and then chose one to run my process?
I do not want to run a process on a CPU which is >75% utilized. I would wait instead and run on a CPU which is <75% utilized.
I need to program my script in such a way that it should check the 4 CPUs' utilization and then run the process on the chosen CPU.
Can someone please help me with an example?
I recommend GNU Parallel:
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.
In addition, use nice.
You can tell the scheduler that a certain CPU should be used, using the taskset command:
taskset -c 1 process
will tell the scheduler that process should run on CPU1.
However, I think in most cases the built-in Linux scheduler should work well.

Trouble with parallel make not always starting another job when one finishes

I'm working on a system with four logical CPS (two dual-core CPUs if it matters). I'm using make to parallelize twelve trivially parallelizable tasks and doing it from cron.
The invocation looks like:
make -k -j 4 -l 3.99 -C [dir] [12 targets]
The trouble I'm running into is that sometimes one job will finish but the next one won't startup even though it shouldn't be stopped by the load average limiter. Each target takes about four hours to complete and I'm wondering if this might be part of the problem.
Edit: Sometimes a target does fail but I use the -k option to have the rest of the make still run. I haven't noticed any correlation with jobs failing and the next job not starting.
I'd drop the '-l'
If all you plan to run the the system is this build I think the -j 4 does what you want.
Based on my memory, if you have anything else running (crond?), that can push the load average over 4.
GNU make ref
Does make think one of the targets is failing? If so, it will stop the make after the running jobs finish. You can use -k to tell it to continue even if an error occurs.
#BCS
I'm 99.9% sure that the -l isn't causeing the problem because I can watch the load average on the machine and it drops down to about three and sometimes as low as one (!) without starting the next job.

Resources