Parallel processing using xargs - takes too much time ( ~8 hrs) on some servers
I have a script that scans an entire file system and does some processing on a selective bunch of files. I am using xargs to do this in parallel. Using xargs instead of using GNU parallel is because I will have to run this script on 100s of servers and installing the utility on all the servers is not an option.
All the servers have the below configuration
Architecture: x86_64
CPU(s): 24
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
I tried increasing the number of processes but beyond a point that doesn't help. I read somewhere that if the script is I/O bound, its better to keep the number of processes equal to the number of cores. Is that true?
find . -type f ! -empty -print0 | xargs -L1 -P 10 -0 "./process.sh"
I believe the above code will make my script I/O bound?
I have to scan the entire file system. How do I optimize the code so I can significantly reduce the processing time.
Also, my code only needs to handle parallel processing of files in a file system. Processing the servers in parallel is taken care of.
You need to find where your bottleneck is.
From your question it is unclear that you have found where your bottleneck is.
If it is CPU then you can use our 100 servers with GNU Parallel without install GNU Parallel on all of them (are you by the way aware of parallel --embed available since 20180322?)
You simply prefix the sshlogins with number of CPU threads and /. So for 24 threads:
find ... |
parallel -S 24/server1,24/server2,24/server3 command
If your bottleneck is your disk, then using more servers will not help.
Then it is better to get a faster disk (e.g. SSD, mirrored disks, RAM-disks and similar).
The optimal number of threads to use on a disk can in practice not be predicted. It can only be measured. I have had a 40 spindle RAID system where the optimal number was 10 threads.
Related
I am trying to use the following command:
ls -1d a/*.jpg | parallel convert -resize 300x300 {}'{=s/\..*//=}'.png
However one problem that I didn't succeed to solve is to have the files to be output to folder b and not in the same folder
Spent quite some times looking for an answer but didn't find any where the files are piped through ls command. (thousands of pictures). I would like to keep the same tools (ls pipe, parallel and convert - or mogrify if better)
First, with mogrify:
mkdir -p b # ensure output directory exists
magick mogrify -path b -resize 300x300 a/*.jpg
This creates a single mogrify process that does all the files without the overhead of creating a new process for each image. It is likely to be faster if you have a smallish number of images. The advantage of this method is that it doesn't require you to install GNU Parallel. The disadvantage is that there is no parallelism.
Second, with GNU Parallel:
mkdir -p b # ensure output directory exists
parallel --dry-run magick {} b/{/} ::: a/*.jpg
Here {/} means "the filename with the directory part removed" and GNU Parallel does it all nicely and simply for you.
If your images are large, say 8-100 megapixels, it will definitely be worth using the JPEG "shrink-on-load" feature to reduce disk i/o and memory pressure like this:
magick -define jpeg:size=512x512 ...
in the above command.
This creates a new process for each image, and is likely to be faster if you have lots of CPU cores and lots of images. If you have 12 CPU cores it will keep all 12 busy till all your images are done - you could change the number or percentage of used cores with -j parameter. The slight performance hit is that a new convert process is created for each image.
Probably the most performant option is to use GNU Parallel for parallelism along with mogrify to amortize process creation across more images, say 32, like this:
mkdir -p b
parallel -n 32 magick mogrify -path b -resize 300x300 ::: a/*.jpg
Note: You should try to avoid parsing the output of ls, it is error prone. I mean avoid this:
ls file*.jpg | parallel
You should prefer feeding in filenames like this:
parallel ... ::: file*.jpg
Note: There is a -X option for GNU Parallel which is a bit esoteric and likely to only come into its own with hundreds/thousands/millions of images. That would pass as many filenames as possible (in view of command-line length limitations) to each mogrify process. And amortise the process startup costs across more files. For 99% of use cases the answers I have given should be performant enough.
Note: If your machine doesn't have multiple cores, or your images are very large compared to the installed RAM, or your disk subsystem is slow, your mileage will vary and it may not be worth parallelising your code. Measure and see!
Sorry if this is a repeat question and I know there are a lot of similar questions out there, but I am really struggling to find a simple answer that works.
I want to run an executable many times eg.
seq 100 | xargs -Iz ./program
However I would like to run this over multiple cores on my machine (currently on a macbook pro so 4 cores) to speed things up.
I have tried using gnu parallel by looking at other answers on here as that seems to be what I want and have it installed but I can't work out how parallel works and what arguments I need in what order. Non of the readme is helping as it is trying to do much more complicated things than I want to.
Could anyone help me?
Thanks
So, in order to run ./program 100 times, with GNU Parallel all you need is:
parallel -N0 ./program ::: {1..100}
If your CPU has 8 cores, it will keep 8 running in parallel till all jobs are done. If you want to run, say 12, in parallel:
parallel -j 12 -N0 ./program ::: {1..100}
Given: 2 Ubuntu 16.04 machines with multiple CPU cores.
I want to execute multiple instances of program fixed_arg arg2on the machines, passing one file name per call as arg2 to the program.
So far, working with xargs, this works on a single machine:
find . -iname "*.ext" -print | xargs -n1 -P12 program fixed_arg
(This will find all files with extension "ext" in the current directory (.), print one file per line (-print), and call xargs to call program 12 times in parallel (-P12) with only one argument arg2per call (-n1). Note the white space on the end of the whole command.)
I want to use multiple machines on which I installed the "mpich" package from the official Ubuntu 16.04 repositories.
I just do not know how to make mpiexec to run my program with only one argument on multiple machines.
I do know that mpiexec will accept a list of arguments, but my list will be in the range of 800 to 2000 files which so far has been to long for any program.
Any help is appreciated.
You just selected wrong instrument (Or give us more details about your target program). MPI (mpich implementation, mpiexec and mpirun commands) is not for starting unrelated programs on multiple hosts, it is for starting one program with exactly same source code in the way, when program knows now many copies are there (up to 100 and more thousands) to do well-defined point-to-point and collective message passing between copies. It is instrument to parallel some scientific codes like computation over huge array which can't computed on single machine or even can't fit into its memory.
Better instrument for you can be GNU parallel (https://www.gnu.org/software/parallel/); and if you have one or two machines or it is just several runs, it is easier to manually split your file list in two parts, and run two parallel or xargs on every machine (by hand or with ssh using authorized_keys). I'll assume that all files are accessible from both machines at the same path (NFS share or something like; no magic tool like mpi or gnu parallel will forward files for you, but batch some modern batch processing system may):
find . -iname "*.ext" -print > list
l=$(wc -l < list)
sp=$((l/2))
split -l $sp list
cat xaa | xargs -n1 -P12 program fixed_arg &
cat xab | ssh SECOND_HOST xargs -n1 -P12 program fixed_arg &
wait
Or just learn about multi-host usage of GNU parallel: https://www.gnu.org/software/parallel/man.html
-S #hostgroup Distribute jobs to remote computers. The jobs will be run on a list of remote computers. GNU parallel will determine the number of CPU cores on the remote computers and run the number of jobs as specified by -j.
EXAMPLE: Using remote computers
It also has a magic of sending files to remote machine with --transferfile filename option if you have no shared FS between two ubuntus.
After reading answer to this question:
Make "make" default to "make -j 8"
I am wondering if there is way to make the -j option automatically use the correct number of compile threads?
So I say make. And the make command itself uses 6 or 4 or 8 threads depending on the hardware?
make does not look up the number of cores by itself if you just use make -j -- instead, it parallelizes to the max. However, you should be able to determine the number of cores by
grep -c "^processor" /proc/cpuinfo
or (as per Azor-Ahai's comment, if available on your system)
nproc
Hence:
make -j $(nproc)
See "How How to obtain the number of CPUs/cores in Linux from the command line?" for more details. Also, see GNU make: should the number of jobs equal the number of CPU cores in a system?
A frequent problem I encounter is having to run some script with 50 or so different parameterizations. In the old days, I'd write something like (e.g.)
for i in `seq 1 50`
do
./myscript $i
done
In the modern era though, all my machines can handle 4 or 8 threads at once. The scripts aren't multithreaded, so what I want to be able to do is run 4 or 8 parameterizations at a time, and to automatically start new jobs as the old ones finish. I can rig up a haphazard system myself (and have in the past), but I suspect that there must be a linux utility that does this already. Any suggestions?
GNU parallel does this. With it, your example becomes:
parallel ./myscript -- `seq 1 50`