How to convert images with parallel from one directory to another - parallel-processing

I am trying to use the following command:
ls -1d a/*.jpg | parallel convert -resize 300x300 {}'{=s/\..*//=}'.png
However one problem that I didn't succeed to solve is to have the files to be output to folder b and not in the same folder
Spent quite some times looking for an answer but didn't find any where the files are piped through ls command. (thousands of pictures). I would like to keep the same tools (ls pipe, parallel and convert - or mogrify if better)

First, with mogrify:
mkdir -p b # ensure output directory exists
magick mogrify -path b -resize 300x300 a/*.jpg
This creates a single mogrify process that does all the files without the overhead of creating a new process for each image. It is likely to be faster if you have a smallish number of images. The advantage of this method is that it doesn't require you to install GNU Parallel. The disadvantage is that there is no parallelism.
Second, with GNU Parallel:
mkdir -p b # ensure output directory exists
parallel --dry-run magick {} b/{/} ::: a/*.jpg
Here {/} means "the filename with the directory part removed" and GNU Parallel does it all nicely and simply for you.
If your images are large, say 8-100 megapixels, it will definitely be worth using the JPEG "shrink-on-load" feature to reduce disk i/o and memory pressure like this:
magick -define jpeg:size=512x512 ...
in the above command.
This creates a new process for each image, and is likely to be faster if you have lots of CPU cores and lots of images. If you have 12 CPU cores it will keep all 12 busy till all your images are done - you could change the number or percentage of used cores with -j parameter. The slight performance hit is that a new convert process is created for each image.
Probably the most performant option is to use GNU Parallel for parallelism along with mogrify to amortize process creation across more images, say 32, like this:
mkdir -p b
parallel -n 32 magick mogrify -path b -resize 300x300 ::: a/*.jpg
Note: You should try to avoid parsing the output of ls, it is error prone. I mean avoid this:
ls file*.jpg | parallel
You should prefer feeding in filenames like this:
parallel ... ::: file*.jpg
Note: There is a -X option for GNU Parallel which is a bit esoteric and likely to only come into its own with hundreds/thousands/millions of images. That would pass as many filenames as possible (in view of command-line length limitations) to each mogrify process. And amortise the process startup costs across more files. For 99% of use cases the answers I have given should be performant enough.
Note: If your machine doesn't have multiple cores, or your images are very large compared to the installed RAM, or your disk subsystem is slow, your mileage will vary and it may not be worth parallelising your code. Measure and see!

Related

How can I pass a file argument to my bash script twice and modify the filename

We have a large number of files in a directory which need to be processed by a program called process, which takes two aguments infile and outfile.
We want to name the outfile after infile and add a suffix. E.g. for processing a single file, we would do:
process somefile123 somefile123-processed
How can we process all files at once from the command line?
(This is in a bash command line)
As #Cyrus says in the comments, the usual way to do that would be:
for f in *; do process "$f" "${f}-processed" & done
However, that may be undesirable and lead to a "thundering herd" of processes if you have thousands of files, so you might consider GNU Parallel which is:
more controllable,
can give you progress reports,
is easier to type
and by default runs one process per CPU core to keep them all busy. So, that would become:
parallel process {} {}-processed ::: *

Parallel processing using xargs - optimizing shell script

Parallel processing using xargs - takes too much time ( ~8 hrs) on some servers
I have a script that scans an entire file system and does some processing on a selective bunch of files. I am using xargs to do this in parallel. Using xargs instead of using GNU parallel is because I will have to run this script on 100s of servers and installing the utility on all the servers is not an option.
All the servers have the below configuration
Architecture: x86_64
CPU(s): 24
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
I tried increasing the number of processes but beyond a point that doesn't help. I read somewhere that if the script is I/O bound, its better to keep the number of processes equal to the number of cores. Is that true?
find . -type f ! -empty -print0 | xargs -L1 -P 10 -0 "./process.sh"
I believe the above code will make my script I/O bound?
I have to scan the entire file system. How do I optimize the code so I can significantly reduce the processing time.
Also, my code only needs to handle parallel processing of files in a file system. Processing the servers in parallel is taken care of.
You need to find where your bottleneck is.
From your question it is unclear that you have found where your bottleneck is.
If it is CPU then you can use our 100 servers with GNU Parallel without install GNU Parallel on all of them (are you by the way aware of parallel --embed available since 20180322?)
You simply prefix the sshlogins with number of CPU threads and /. So for 24 threads:
find ... |
parallel -S 24/server1,24/server2,24/server3 command
If your bottleneck is your disk, then using more servers will not help.
Then it is better to get a faster disk (e.g. SSD, mirrored disks, RAM-disks and similar).
The optimal number of threads to use on a disk can in practice not be predicted. It can only be measured. I have had a 40 spindle RAID system where the optimal number was 10 threads.

mpich pass argument per CPU used

Given: 2 Ubuntu 16.04 machines with multiple CPU cores.
I want to execute multiple instances of program fixed_arg arg2on the machines, passing one file name per call as arg2 to the program.
So far, working with xargs, this works on a single machine:
find . -iname "*.ext" -print | xargs -n1 -P12 program fixed_arg
(This will find all files with extension "ext" in the current directory (.), print one file per line (-print), and call xargs to call program 12 times in parallel (-P12) with only one argument arg2per call (-n1). Note the white space on the end of the whole command.)
I want to use multiple machines on which I installed the "mpich" package from the official Ubuntu 16.04 repositories.
I just do not know how to make mpiexec to run my program with only one argument on multiple machines.
I do know that mpiexec will accept a list of arguments, but my list will be in the range of 800 to 2000 files which so far has been to long for any program.
Any help is appreciated.
You just selected wrong instrument (Or give us more details about your target program). MPI (mpich implementation, mpiexec and mpirun commands) is not for starting unrelated programs on multiple hosts, it is for starting one program with exactly same source code in the way, when program knows now many copies are there (up to 100 and more thousands) to do well-defined point-to-point and collective message passing between copies. It is instrument to parallel some scientific codes like computation over huge array which can't computed on single machine or even can't fit into its memory.
Better instrument for you can be GNU parallel (https://www.gnu.org/software/parallel/); and if you have one or two machines or it is just several runs, it is easier to manually split your file list in two parts, and run two parallel or xargs on every machine (by hand or with ssh using authorized_keys). I'll assume that all files are accessible from both machines at the same path (NFS share or something like; no magic tool like mpi or gnu parallel will forward files for you, but batch some modern batch processing system may):
find . -iname "*.ext" -print > list
l=$(wc -l < list)
sp=$((l/2))
split -l $sp list
cat xaa | xargs -n1 -P12 program fixed_arg &
cat xab | ssh SECOND_HOST xargs -n1 -P12 program fixed_arg &
wait
Or just learn about multi-host usage of GNU parallel: https://www.gnu.org/software/parallel/man.html
-S #hostgroup Distribute jobs to remote computers. The jobs will be run on a list of remote computers. GNU parallel will determine the number of CPU cores on the remote computers and run the number of jobs as specified by -j.
EXAMPLE: Using remote computers
It also has a magic of sending files to remote machine with --transferfile filename option if you have no shared FS between two ubuntus.

acroread pdf to postscript conversion too slow

I am converting pdf file in postscript file using acroread command.
The conversion is successfull but it is too slow and almost uses 100% of CPU,
because of this my application hangs for some time and thus no user is able to do
anything.
The code i am using is:-
processBuilder = new ProcessBuilder("bash","-c","acroread -toPostScript -size "+width+"x"+height+" -optimizeForSpeed sample.pdf");
pp = processBuilder.start();
pp.waitFor();
Is there a way to speed up the process and make it to use less percentage of CPU.
Please help!!!!
I'd suggest you start by using strace on the command line to diagnose the problem.
strace -tt -f acroread -toPostScript -size 1000x2500 -optimizeForSpeed sample.pdf.
I suspect you may find it spends a lot of time reading font files.
If you have a choice then poppler or Xpdf or even ghostscript should be more supported and performant options, especially considering acroread is now unsupported on linux.

One-liner to split very large directory into smaller directories on Unix

How do you to split a very large directory, containing potentially millions of files, into smaller directories of some custom defined maximum number of files, such as 100 per directory, on UNIX?
Bonus points if you know of a way to have wget download files into these subdirectories automatically. So if there are 1 million .html pages at the top-level path at www.example.com, such as
/1.html
/2.html
...
/1000000.html
and we only want 100 files per directory, it will download them to folders something like
./www.example.com/1-100/1.html
...
./www.example.com/999901-1000000/1000000.html
Only really need to be able to run the UNIX command on the folder after wget has downloaded the files, but if it's possible to do this with wget as it's downloading I'd love to know!
Another option:
i=1;while read l;do mkdir $i;mv $l $((i++));done< <(ls|xargs -n100)
Or using parallel:
ls|parallel -n100 mkdir {#}\;mv {} {#}
-n100 takes 100 arguments at a time and {#} is the sequence number of the job.
You can run this through a couple of loops, which should do the trick (at least for the numeric part of the file name). I think that doing this as a one-liner is over-optimistic.
#! /bin/bash
for hundreds in {0..99}
do
min=$(($hundreds*100+1))
max=$(($hundreds*100+100))
current_dir="$min-$max"
mkdir $current_dir
for ones_tens in {1..100}
do
current_file="$(($hundreds*100+$ones_tens)).html"
#touch $current_file
mv $current_file $current_dir
done
done
I did performance testing by first commenting out mkdir $current_dir and mv $current_file $current_dir and uncommenting touch $current_file. This created 10000 files (one-hundredth of your target of 1000000 files). Once the files were created, I reverted to the script as written:
$ time bash /tmp/test.bash 2>&1
real 0m27.700s
user 0m26.426s
sys 0m17.653s
As long as you aren't moving files across file systems, the time for each mv command should be constant, so you should see similar or better performance. Scaling this up to a million files would give you around 27700 seconds, i.e. 46 minutes. There are several avenues for optimization, such as moving all files for a given directory in one command, or removing the inner for loop.
Doing the 'wget' to grab a million files is going to take far longer than this, and is almost certainly going to require some optimization; preserving bandwidth in http headers alone will cut down run time by hours. I don't think that a shell script is probably the right tool for that job; using a library such as WWW::Curl on cpan will be much easier to optimize.
To make ls|parallel more practical to use, add a variable assignment to the destination dir:
DST=../brokenup; ls | parallel -n100 mkdir -p $DST/{#}\;cp {} $DST/{#}
Note: cd <src_large_dir> before executing.
The DST defined above will contain a copy of the current directory's files, but a maximum of 100 per subdirectory.

Resources