Terminal Command to run tests using GNU Parallel - terminal

I have a folder of problems which are like this:
problem1, domain 1
problem2, domain 2
problem3, domain 3
I want to use GNU Parallel to run a bunch of problems like this. This is short version of what I have tried:
seq 01 20 | parallel -k -j6 java pddl/benchmarks_STRIPS/psr/p{}-domain.pddl -f pddl/benchmarks_STRIPS/psr/p{}.pddl
I want some sort of command that will tell GNU parallel that domain 1 is to be compiled with problem 1, domain 2 is with problem 2 etc..
Is there a way to do this using GNU or should I write each one out individually?

I think it may be a problem with zero-padding, as my seq command doesn't zero-pad numbers.
If you have bash 4+ (I think that's the correct version), you can use:
echo {01..20} | parallel ...
Or, if you have an older bash, you could use something like:
printf "%02d\n" {1..20} | parallel ...

I assume the pXX-domain.pddl files exist. You can use GNU Parallel's {= =} syntax to compute the pXX name:
parallel -k -j6 java {} -f '{= s/-domain(\.pddl)$/$1/ =}' ::: pddl/benchmarks_STRIPS/psr/p*-domain.pddl
Or if the opposite is true:
parallel -k -j6 java '{= s/(\.pddl)$/-domain$1/ =}' -f {} ::: pddl/benchmarks_STRIPS/psr/p??.pddl
Requires GNU Parallel 20140722.
This way you do not need to know in advance which files exist.

Related

Is there a better way to 'use arguments in a pipes sequence' with xargs?

On several occasions I want to use xargs generated arguments and use them in an elaborated command pipeline (piping the results through several steps and using).
I typically end up using a bash -c with xargs-generated-argument as only argument. E.g:
find *.obj | xargs -I{} bash -c 'sha256sum $0 | tee $0.sha256' {}
Is there any better or more concise way to do this?
I think it is more concise with GNU Parallel as follows:
find "*.obj" -print0 | parallel -0 sha256sum {} \| tee {}.sha256
In addition, it is:
potentially more performant since it works in parallel across all CPU cores
debuggable by using parallel --dry-run ...
more informative by using --bar or --eta to get a progress bar or ETA for completion
more flexible, since it gives you many predefined variables, such as {.} meaning the current file minus its extension, {/} meaning "base name" of current file, {//} meaning directory of current file and so on

bash: split ascii file into n parts; iterate over ONLY those files

I have an ASCII file of a few thousand lines, processed one line at a time by a bash script. Because the processing is embarrassingly parallel, I'd like to split the file into parts of roughly the same size, preserving line breaks, one part per CPU core. Unfortunately the file suffixes made by split r/numberOfCores aren't easily iterated over.
split --numeric-suffixes=1 r/42 ... makes files foo.01, foo.02, ..., foo.42, which can be iterated over with for i in `seq -w 1 42 ` because -w adds a leading zero). But if the 42 changes to something smaller than 10, the files still have the leading zero but the seq doesn't, so it fails. This concern is valid, because nowadays some PCs have fewer than 10 cores, some more than 10. A ghastly workaround:
[[ $numOfCores < 10 ]] && optionForSeq="" || optionForSeq="-w"
The naive solution for f in foo.* is risky: the wildcard might match files other than the ones that split made.
An ugly way to make the suffixes seq-friendly, but with the same risk:
split -n r/numOfCores infile foo.
for i in `seq 1 $numOfCores`; do
mv `ls foo.* | head -1` newPrefix.$i
done
for i in `seq 1 $numofCores`; do
... newPrefix.$i ...
done
Is there a cleaner, robust way of splitting the file into n parts, where 1<=n<=64 isn't known until runtime, and then iterating over those parts? split only into a freshly created directory?
(Edit: To clarify "if the 42 changes to something smaller than 10," the same code should work on a PC with 8 cores and on another PC with 42 cores.)
A seq-based solution is clunky. A wildcard-based solution is risky. Is there an alternative to split? (csplit with line numbers would be even clunkier.) A gawk one-liner?
How about using a format string with seq?
$ seq -f '%02g' 1 4
01
02
03
04
$ seq -f '%02g' 1 12
01
02
03
...
09
10
11
12
With GNU bash 4:
Use printf to format your numbers:
for ((i=1;i<=4;i++)); do printf -v num "%02d" $i; echo "$num"; done
Output:
01
02
03
04
Are you sure this is not a job for GNU Parallel?
cat file | parallel --pipe -N1 myscript_that_reads_one_line_from_stdin
This way you do not need to have the temporary files at all.
If your script can read more than one line (so it is in practice a UNIX filter), then this should be very close to optimal:
parallel --pipepart -k --roundrobin -a file myscript_that_reads_from_stdin
It will spawn one job per core and split file into one part per core on the fly. If some lines are harder to process than others (i.e. you can get "stuck" for a while on a single line), then this solution might be better:
parallel --pipepart -k -a file myscript_that_reads_from_stdin
It will spawn one job per core and split file into 10 part per core on the fly, thus running on average 10 jobs per core in total.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
Get the filenames with ls and then use a regex:
for n in $(ls foo.* |grep "^foo\.[0-9][0-9]*$") ; do

Fastest possible grep

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I'd like to know if there is any tip to make grep as fast as possible. I have a rather large base of text files to search in the quickest possible way. I've made them all lowercase, so that I could get rid of -i option. This makes the search much faster.
Also, I've found out that -F and -P modes are quicker than the default one. I use the former when the search string is not a regular expression (just plain text), the latter if regex is involved.
Does anyone have any experience in speeding up grep? Maybe compile it from scratch with some particular flag (I'm on Linux CentOS), organize the files in a certain fashion or maybe make the search parallel in some way?
Try with GNU parallel, which includes an example of how to use it with grep:
grep -r greps recursively through directories. On multicore CPUs GNU
parallel can often speed this up.
find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}
This will run 1.5 job per core, and give 1000 arguments to grep.
For big files, it can split it the input in several chunks with the --pipe and --block arguments:
parallel --pipe --block 2M grep foo < bigfile
You could also run it on several different machines through SSH (ssh-agent needed to avoid passwords):
parallel --pipe --sshlogin server.example.com,server2.example.net grep foo < bigfile
If you're searching very large files, then setting your locale can really help.
GNU grep goes a lot faster in the C locale than with UTF-8.
export LC_ALL=C
Ripgrep claims to now be the fastest.
https://github.com/BurntSushi/ripgrep
Also includes parallelism by default
-j, --threads ARG
The number of threads to use. Defaults to the number of logical CPUs (capped at 6). [default: 0]
From the README
It is built on top of Rust's regex engine. Rust's regex engine uses
finite automata, SIMD and aggressive literal optimizations to make
searching very fast.
Apparently using --mmap can help on some systems:
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
Not strictly a code improvement but something I found helpful after running grep on 2+ million files.
I moved the operation onto a cheap SSD drive (120GB). At about $100, it's an affordable option if you are crunching lots of files regularly.
If you don't care about which files contains the string, you might want to separate reading and grepping into two jobs, since it might be costly to spawn grep many times – once for each small file.
If you've one very large file:
parallel -j100% --pipepart --block 100M -a <very large SEEKABLE file> grep <...>
Many small compressed files (sorted by inode)
ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j80% --group "gzcat {}" | parallel -j50% --pipe --round-robin -u -N1000 grep <..>
I usually compress my files with lz4 for maximum throughput.
If you want just the filename with the match:
ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j100% --group "gzcat {} | grep -lq <..> && echo {}
Building on the response by Sandro I looked at the reference he provided here and played around with BSD grep vs. GNU grep. My quick benchmark results showed: GNU grep is way, way faster.
So my recommendation to the original question "fastest possible grep": Make sure you are using GNU grep rather than BSD grep (which is the default on MacOS for example).
I personally use the ag (silver searcher) instead of grep and it's way faster, also you can combine it with parallel and pipe block.
https://github.com/ggreer/the_silver_searcher
Update:
I now use https://github.com/BurntSushi/ripgrep which is faster than ag depending on your use case.
One thing I've found faster for using grep to search (especially for changing patterns) in a single big file is to use split + grep + xargs with it's parallel flag. For instance:
Having a file of ids you want to search for in a big file called my_ids.txt
Name of bigfile bigfile.txt
Use split to split the file into parts:
# Use split to split the file into x number of files, consider your big file
# size and try to stay under 26 split files to keep the filenames
# easy from split (xa[a-z]), in my example I have 10 million rows in bigfile
split -l 1000000 bigfile.txt
# Produces output files named xa[a-t]
# Now use split files + xargs to iterate and launch parallel greps with output
for id in $(cat my_ids.txt) ; do ls xa* | xargs -n 1 -P 20 grep $id >> matches.txt ; done
# Here you can tune your parallel greps with -P, in my case I am being greedy
# Also be aware that there's no point in allocating more greps than x files
In my case this cut what would have been a 17 hour job into a 1 hour 20 minute job. I'm sure there's some sort of bell curve here on efficiency and obviously going over the available cores won't do you any good but this was a much better solution than any of the above comments for my requirements as stated above. This has an added benefit over the script parallel in using mostly (linux) native tools.
cgrep, if it's available, can be orders of magnitude faster than grep.
MCE 1.508 includes a dual chunk-level {file, list} wrapper script supporting many C binaries; agrep, grep, egrep, fgrep, and tre-agrep.
https://metacpan.org/source/MARIOROY/MCE-1.509/bin/mce_grep
https://metacpan.org/release/MCE
One does not need to convert to lowercase when wanting -i to run fast. Simply pass --lang=C to mce_grep.
Output order is preserved. The -n and -b output is also correct. Unfortunately, that is not the case for GNU parallel mentioned on this page. I was really hoping for GNU Parallel to work here. In addition, mce_grep does not sub-shell (sh -c /path/to/grep) when calling the binary.
Another alternate is the MCE::Grep module included with MCE.
A slight deviation from the original topic: the indexed search command line utilities from the googlecodesearch project are way faster than grep: https://github.com/google/codesearch:
Once you compile it (the golang package is needed), you can index a folder with:
# index current folder
cindex .
The index will be created under ~/.csearchindex
Now you can search:
# search folders previously indexed with cindex
csearch eggs
I'm still piping the results through grep to get colorized matches.

Splitting command line args with GNU parallel

Using GNU parallel: http://www.gnu.org/software/parallel/
I have a program that takes two arguments, e.g.
$ ./prog file1 file2
$ ./prog file2 file3
...
$ ./prog file23456 file23457
I'm using a script that generates the file name pairs, however this poses a problem because the result of the script is a single string - not a pair. like:
$ ./prog "file1 file2"
GNU parallel seems to have a slew of tricks up its sleeves, I wonder if there's one for splitting text around separators:
$ generate_file_pairs | parallel ./prog ?
# where ? is text under consideration, like "file1 file2"
The easy work around is to split the args manually in prog, but I'd like to know if it's possible in GNU parallel.
You are probably looking for --colsep.
generate_file_pairs | parallel --colsep ' ' ./prog {1} {2}
Read man parallel for more. And watch the intro video if you have not already done so http://www.youtube.com/watch?v=OpaiGYxkSuQ
You are looking for -n option of parallel. This is what you are looking for:
./generate_file_pairs | parallel -n 2 ./prog {}
Excerpt from GNU Parallel Doc:
-n max-args
Use at most max-args arguments per command line. Fewer than max-args
arguments will be used if the size (see the -s option) is exceeded,
unless the -x option is given, in which case GNU parallel will exit.
Quite late to the party here, but I bump into this problem fairly often and found a nice easy solution
Before passing the arg list to parallel, just replace all the spaces with newlines. I've found tr to be the fastest for this kind of stuff
Not working
echo "1 2 3 4 5" | parallel echo --
-- 1 2 3 4 5
Working
echo "1 2 3 4 5" | tr ' ' '\n' | parallel echo --
-- 1
-- 2
-- 3
-- 4
-- 5
Protip: before actually running the parallel command, I do 2 things to check that the arguments have been split correctly.
Prepend echo in front of your bash command. This means that any commands that will eventually be executed will be printed for you to check first
Add a marker in the echo, this checks that the parallel split is actually working
> Note, this works best with small/medium argument lists. If the argument list is very large, probably best to just use a for loop to echo each argument to parallel
In Parallel's manual, it is said:
If no command is given, the line of input is executed ... GNU parallel can often be used as a substitute for xargs or cat | bash.
So take a try of:
generate command | parallel
Try to understand the output of this:
for i in {1..5};do echo "echo $i";done | parallel

Is it possible to get the segment number in an xargs invocation

Xargs can be used to cut up the contents of standard input into manageable pieces and invoke a command on each such piece. But is it possible to know which piece it is ? To give an example:
seq 1 10 | xargs -P 2 -n 2 mycommand
will call
mycommand 1 2 &
mycommand 3 4 &
mycommand 5 6 &
mycommand 7 8 &
mycommand 9 10 &
But I would like to know in my "mycommand" script that
mycommand 1 2
is processing the first piece/segment, and so on. Is it possible to access that information ?
p.s. In the simple example above I can just look at the numbers and tell. But for arbitrary lists how does one access the information without actually injecting piece# in the input stream ?
I only see you can do this if you change your input and add the sequence number:
seq 1 10 | perl -ne '$. % 2 and print (($.+1)/2,"\n"); print' | xargs -n3 ...
It is unclear why you need this, but if your final goal is to keep the output in the same order as the input, it may be easier to use GNU Parallel:
seq 1 10 | parallel -j+0 -n2 -k mycommand
Watch the intro video for GNU Parallel to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ
Since version 20101113 GNU Parallel has $PARALLEL_SEQ which is set to the sequence number of the command:
seq 1 10 | parallel -j+0 -n2 -k mycommand \$PARALLEL_SEQ
This may be exactly what you are looking for.

Resources