Splitting command line args with GNU parallel - bash

Using GNU parallel: http://www.gnu.org/software/parallel/
I have a program that takes two arguments, e.g.
$ ./prog file1 file2
$ ./prog file2 file3
...
$ ./prog file23456 file23457
I'm using a script that generates the file name pairs, however this poses a problem because the result of the script is a single string - not a pair. like:
$ ./prog "file1 file2"
GNU parallel seems to have a slew of tricks up its sleeves, I wonder if there's one for splitting text around separators:
$ generate_file_pairs | parallel ./prog ?
# where ? is text under consideration, like "file1 file2"
The easy work around is to split the args manually in prog, but I'd like to know if it's possible in GNU parallel.

You are probably looking for --colsep.
generate_file_pairs | parallel --colsep ' ' ./prog {1} {2}
Read man parallel for more. And watch the intro video if you have not already done so http://www.youtube.com/watch?v=OpaiGYxkSuQ

You are looking for -n option of parallel. This is what you are looking for:
./generate_file_pairs | parallel -n 2 ./prog {}
Excerpt from GNU Parallel Doc:
-n max-args
Use at most max-args arguments per command line. Fewer than max-args
arguments will be used if the size (see the -s option) is exceeded,
unless the -x option is given, in which case GNU parallel will exit.

Quite late to the party here, but I bump into this problem fairly often and found a nice easy solution
Before passing the arg list to parallel, just replace all the spaces with newlines. I've found tr to be the fastest for this kind of stuff
Not working
echo "1 2 3 4 5" | parallel echo --
-- 1 2 3 4 5
Working
echo "1 2 3 4 5" | tr ' ' '\n' | parallel echo --
-- 1
-- 2
-- 3
-- 4
-- 5
Protip: before actually running the parallel command, I do 2 things to check that the arguments have been split correctly.
Prepend echo in front of your bash command. This means that any commands that will eventually be executed will be printed for you to check first
Add a marker in the echo, this checks that the parallel split is actually working
> Note, this works best with small/medium argument lists. If the argument list is very large, probably best to just use a for loop to echo each argument to parallel

In Parallel's manual, it is said:
If no command is given, the line of input is executed ... GNU parallel can often be used as a substitute for xargs or cat | bash.
So take a try of:
generate command | parallel
Try to understand the output of this:
for i in {1..5};do echo "echo $i";done | parallel

Related

Terminal Command to run tests using GNU Parallel

I have a folder of problems which are like this:
problem1, domain 1
problem2, domain 2
problem3, domain 3
I want to use GNU Parallel to run a bunch of problems like this. This is short version of what I have tried:
seq 01 20 | parallel -k -j6 java pddl/benchmarks_STRIPS/psr/p{}-domain.pddl -f pddl/benchmarks_STRIPS/psr/p{}.pddl
I want some sort of command that will tell GNU parallel that domain 1 is to be compiled with problem 1, domain 2 is with problem 2 etc..
Is there a way to do this using GNU or should I write each one out individually?
I think it may be a problem with zero-padding, as my seq command doesn't zero-pad numbers.
If you have bash 4+ (I think that's the correct version), you can use:
echo {01..20} | parallel ...
Or, if you have an older bash, you could use something like:
printf "%02d\n" {1..20} | parallel ...
I assume the pXX-domain.pddl files exist. You can use GNU Parallel's {= =} syntax to compute the pXX name:
parallel -k -j6 java {} -f '{= s/-domain(\.pddl)$/$1/ =}' ::: pddl/benchmarks_STRIPS/psr/p*-domain.pddl
Or if the opposite is true:
parallel -k -j6 java '{= s/(\.pddl)$/-domain$1/ =}' -f {} ::: pddl/benchmarks_STRIPS/psr/p??.pddl
Requires GNU Parallel 20140722.
This way you do not need to know in advance which files exist.

How to split files up and process them in parallel and then stitch them back? unix

I have a text file infile.txt as such:
abc what's the foo bar.
foobar hello world, hhaha cluster spatio something something.
xyz trying to do this in parallel
kmeans you're mean, who's mean?
Each line in the file will be processed by this perl command into the out.txt
`cat infile.txt | perl dosomething > out.txt`
Imagine if the textfile is 100,000,000 lines. I want to parallelize the bash command so i tried something like this:
$ mkdir splitfiles
$ mkdir splitfiles_processed
$ cd splitfiles
$ split -n3 ../infile.txt
$ for i in $(ls); do "cat $i | perl dosomething > ../splitfiles_processed/$i &"; done
$ wait
$ cd ../splitfiles_processed
$ cat * > ../infile_processed.txt
But is there a less verbose way to do the same?
The answer from #Ulfalizer gives you a good hint about the solution, but it lacks some details.
You can use GNU parallel (apt-get install parallel on Debian)
So your problem can be solved using the following command:
cat infile.txt | parallel -l 1000 -j 10 -k --spreadstdin perl dosomething > result.txt
Here is the meaning of the arguments:
-l 1000: send 1000 lines blocks to command
-j 10: launch 10 jobs in parallel
-k: keep sequence of output
--spreadstdin: sends the above 1000 line block to the stdin of the command
I've never tried it myself, but GNU parallel might be worth checking out.
Here's an excerpt from the man page (parallel(1)) that's similar to what you're currently doing. It can split the input in other ways too.
EXAMPLE: Processing a big file using more cores
To process a big file or some output you can use --pipe to split up
the data into blocks and pipe the blocks into the processing program.
If the program is gzip -9 you can do:
cat bigfile | parallel --pipe --recend '' -k gzip -9 >bigfile.gz
This will split bigfile into blocks of 1 MB and pass that to gzip -9
in parallel. One gzip will be run per CPU core. The output of gzip -9
will be kept in order and saved to bigfile.gz
Whether this is worthwhile depends on how CPU-intensive your processing is. For simple scripts you'll spend most of the time shuffling data to and from the disk, and parallelizing won't get you much.
You can find some introductory videos by the GNU Parallel author here.
Assuming your limiting factor is NOT your disk, you can do this in perl with fork() and specifically Parallel::ForkManager:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my $max_forks = 8; #2x procs is usually optimal
sub process_line {
#do something with this line
}
my $fork_manager = Parallel::ForkManager -> new ( $max_forks );
open ( my $input, '<', 'infile.txt' ) or die $!;
while ( my $line = <$input> ) {
$fork_manager -> start and next;
process_line ( $line );
$fork_manager -> finish;
}
close ( $input );
$fork_manager -> wait_all_children();
The downside of doing something like this though is that of coalescing your output. Each parallel task doesn't necessarily finish in the sequence it started, so you have all sorts of potential problems regarding serialising the results.
You can work around these with something like flock but you need to be careful, as too many locking operations can take away your parallel advantage in the first place. (Hence my first statement - if your limiting factor is disk IO, then parallelism doesn't help very much at all anyway).
There's various possible solutions though - so much the wrote a whole chapter on it in the perl docs: perlipc - but keep in mind you can retrieve data with Parallel::ForkManager too.

Joining every group of N lines into one with bash

I would like to join every group of N lines in the output of another command using bash.
Are there any standard linux commands i can use to achieve this?
Example:
./command
46.219464 0.000993
17.951781 0.002545
15.770583 0.002873
87.431820 0.000664
97.380751 0.001921
25.338819 0.007437
Desired output:
46.219464 0.000993 17.951781 0.002545
15.770583 0.002873 87.431820 0.000664
97.380751 0.001921 25.338819 0.007437
If your output has consistent number of fields, you can use xargs -n N to group on X elements per line:
$ ...command... | xargs -n4
46.219464 0.000993 17.951781 0.002545
15.770583 0.002873 87.431820 0.000664
97.380751 0.001921 25.338819 0.007437
From man xargs:
-n max-args, --max-args=max-args
Use at most max-args arguments per command line. Fewer than max-args
arguments will be used if the size (see the -s option) is exceeded,
unless the -x option is given, in which case xargs will exit.
Seems like you're trying to join every two lines with the delimiter \t(tab). If yes then you could try the below paste command,
command | paste -d'\t' - -
If you want space as delimiter then use -d<space>,
command | paste -d' ' - -

xargs input involving spaces

I am working on a Mac using OSX and I'm using bash as my shell. I have a script that goes something to the effect of:
VAR1="pass me into parallel please!"
VAR2="oh me too, and there's actually a lot of us, but its best we stay here too"
printf "%s\n" {0..249} | xargs -0 -P 8 -n 1 . ./parallel.sh
I get the error: xargs: .: Permission denied. The purpose is to run a another script in parallel (called parallel.sh) which get's fed the numbers 0-249. Additionally I want to make sure that parallel can see and us VAR1 and VAR2. But when I try to source the script parallel with . ./parallel, xargs doesn't like that. The point of sourcing is because the script has other variables I wish parallel to have access to.
I have read something about using print0 since xargs separates it's inputs by spaces, but I really didn't understand what -print0 does and how to use it. Thanks for any help you guys can offer.
If you want the several processes running the script, then they can't be part of the parent process and therefore they can't access the exact same variables. However, if you export your variables, then each process can get a copy of them:
export VAR1="pass me into parallel please!"
export VAR2="oh me too, and there's actually a lot of us, but its best we stay here too"
printf "%s\n" {0..249} | xargs -P 8 -n 1 ./parallel.sh
Now you can just drop the extra dot since you aren't sourcing the parallel.sh script, you are just running it.
Also there is no need to use -0 since your input is just a series of numbers, one on each line.
To avoid the space problem I'd use new line character as separator for xargs with the -d option:
xargs -d '\n' ...
i think you have permission issues , try getting a execute permission for that file "parallel.sh"
command works fine for me :
Kaizen ~/so_test $ printf "%s\n" {0..4} | xargs -0 -P 8 -n 1 echo
0
1
2
3
4
man find :
-print0
True; print the full file name on the standard output, followed by a
null character (instead of the newline character that -print uses).
This allows file names that contain newlines or other types of white space to be correctly interpreted by programs that process the find
output. This option corresponds to the -0 option of xargs.
for print0 use : check the link out : there is a question for it in stack overflow
Capturing output of find . -print0 into a bash array
The issue of passing arguments is related to xarg's interpretation of white space. From the xargs man page:
-0 Change xargs to expect NUL (``\0'') characters as separators, instead of spaces and newlines.
The issue of environment variables can be solved by using export to make the variables available to subprocesses:
say.sh
echo "$1 $V"
result
bash$ export V=whatevs
bash$ printf "%s\n" {0..3} | xargs -P 8 -n 1 ./say.sh
1 whatevs
2 whatevs
0 whatevs
3 whatevs

Using GNU Parallel With Split

I'm loading a pretty gigantic file to a postgresql database. To do this I first use split in the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Parallel and psql copy.
The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell split to print the file name to std output each time it finishes writing a file so I can pipe it to Parallel and it starts loading the files at the time split finish writing it. Something like this:
split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}
I have read the split man pages and I can't find anything. Is there a way to do this with split or any other tool?
You could let parallel do the splitting:
<2011.psv parallel --pipe -N 50000000 ./carga_postgres.sh
Note, that the manpage recommends using --block over -N, this will still split the input at record separators, \n by default, e.g.:
<2011.psv parallel --pipe --block 250M ./carga_postgres.sh
Testing --pipe and -N
Here's a test that splits a sequence of 100 numbers into 5 files:
seq 100 | parallel --pipe -N23 'cat > /tmp/parallel_test_{#}'
Check result:
wc -l /tmp/parallel_test_[1-5]
Output:
23 /tmp/parallel_test_1
23 /tmp/parallel_test_2
23 /tmp/parallel_test_3
23 /tmp/parallel_test_4
8 /tmp/parallel_test_5
100 total
If you use GNU split, you can do this with the --filter option
‘--filter=command’
With this option, rather than simply writing to each output file, write through a pipe to the specified shell command for each output file. command should use the $FILE environment variable, which is set to a different output file name for each invocation of the command.
You can create a shell script, which creates a file and start carga_postgres.sh at the end in the background
#! /bin/sh
cat >$FILE
./carga_postgres.sh $FILE &
and use that script as the filter
split -l 50000000 --filter=./filter.sh 2011.psv

Resources