Is it possible to get the segment number in an xargs invocation - xargs

Xargs can be used to cut up the contents of standard input into manageable pieces and invoke a command on each such piece. But is it possible to know which piece it is ? To give an example:
seq 1 10 | xargs -P 2 -n 2 mycommand
will call
mycommand 1 2 &
mycommand 3 4 &
mycommand 5 6 &
mycommand 7 8 &
mycommand 9 10 &
But I would like to know in my "mycommand" script that
mycommand 1 2
is processing the first piece/segment, and so on. Is it possible to access that information ?
p.s. In the simple example above I can just look at the numbers and tell. But for arbitrary lists how does one access the information without actually injecting piece# in the input stream ?

I only see you can do this if you change your input and add the sequence number:
seq 1 10 | perl -ne '$. % 2 and print (($.+1)/2,"\n"); print' | xargs -n3 ...
It is unclear why you need this, but if your final goal is to keep the output in the same order as the input, it may be easier to use GNU Parallel:
seq 1 10 | parallel -j+0 -n2 -k mycommand
Watch the intro video for GNU Parallel to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ
Since version 20101113 GNU Parallel has $PARALLEL_SEQ which is set to the sequence number of the command:
seq 1 10 | parallel -j+0 -n2 -k mycommand \$PARALLEL_SEQ
This may be exactly what you are looking for.

Related

`sort -t` doesn't work properly with string input

I want to sort some space separated numbers in bash. The following doesn't work, however:
sort -dt' ' <<< "3 5 1 4"
Output is:
3 5 1 4
Expected output is:
1
3
4
5
As I understand it the -t option should use it's argument as a delimiter. Why isn't my code working? I know I can tr the spaces to newlines, but I'm working on a code golf thing and want to be able to do it without any other utility.
EDIT: everybody is answering by splitting the spaces to lines. I do not want to do this. I already know how to do this with other utilities. I am specifically asking how to do this with sort, and sort only. If -t doesn't delimit input, what does it do?
Use process substitution with printf to have each input number in a separate line, otherwise sort gets only one line to sort:
sort <(printf "%s\n" 3 5 1 4)
1
3
4
5
While doing so -dt '' is not needed.
After searching around, I have discovered what -t is for. It is for delimiting a file if you want to sort by a certain part of each line, for e.g, if you have
Hello,56
Cat,81
Book,14
Nope,62
And you want to sort by the number, you would to -t',' to delimit by the comma and then use -k to select which part to sort by. It is for field delimiting, not record delimiting like I thought.
Since sort only separates fields on a single line, you have no choice but to pipe data into sort -dt such as this method using IFS:
#!/bin/bash
clear
var="8 3 5 1 4 7 2 9 6"
old_IFS="$IFS"
function main() {
IFS=" "
printf "%s\n" $var | sort -d
}
main
This will give an obvious output of:
1
2
3
4
5
6
7
8
9
If this is not the way you wish to use sort well you have already answered your own question by doing a bit of digging on the issue which, if you would have done so before, would have saved time for the others giving answers as well as yours.

tracking status/progress in gnu parallel

I've implemented parallel in one of our major scripts to perform data migrations between servers. Presently, the output is presented all at once (-u) in pretty colors, with periodic echos of status from the function being executed depending on which sequence is being run (e.g. 5/20: $username: rsyncing homedir or 5/20: $username: restoring account). These are all echoed directly to the terminal running the script, and accumulate there. Depending on the length of time a command is running, however, output can end up well out of order, and long running rsync commands can be lost in the shuffle. Butm I don't want to wait for long running processes to finish in order to get the output of following processes.
In short, my issue is keeping track of which arguments are being processed and are still running.
What I would like to do is send parallel into the background with (parallel args command {#} {} ::: $userlist) & and then track progress of each of the running functions. My initial thought was to use ps and grep liberally along with tput to rewrite the screen every few seconds. I usually run three jobs in parallel, so I want to have a screen that shows, for instance:
1/20: user1: syncing homedir
current file: /home/user1/www/cache/file12589015.php
12/20: user12: syncing homedir
current file: /home/user12/mail/joe/mailfile
5/20: user5: collecting information
current file:
I can certainly get the above status output together no problem, but my current hangup is separating the output from the individual parallel processes into three different... pipes? variables? files? so that it can be parsed into the above information.
Not sure if this is much better:
echo hello im starting now
sleep 1
# start parallel and send the job to the background
temp=$(mktemp -d)
parallel --rpl '{log} $_="Working on#arg"' -j3 background {} {#} ">$temp/{1log} 2>&1;rm $temp/{1log}" ::: foo bar baz foo bar baz one two three one two three :::+ 5 6 5 3 4 6 7 2 5 4 6 2 &
while kill -0 $! 2>/dev/null ; do
cd "$temp"
clear
tail -vn1 *
sleep 1
done
rm -rf "$temp"
It make a logfile for each job. Tails all logfiles every second and removes the logfile when a jobs is done.
The logfiles are named 'working on ...'.
I believe that this is close to what I need, though it isnt very tidy and probably isnt optimal:
#!/bin/bash
background() { #dummy load. $1 is text, $2 is number, $3 is position
echo $3: starting sleep...
sleep $2
echo $3: $1 slept for $2
}
progress() {
echo starting progress loop for pid $1...
while [ -d /proc/$1 ]; do
clear
tput cup 0 0
runningprocs=`ps faux | grep background | egrep -v '(parallel|grep)'`
numprocs=`echo "$runningprocs" | wc -l`
for each in `seq 1 ${numprocs}`; do
line=`echo "$runningprocs" | head -n${each} | tail -n1`
seq=`echo $line | rev | awk '{print $3}' | rev`
# print select elements from the ps output
echo working on `echo $line | rev | awk '{print $3, $4, $5}' | rev`
# print the last line of the log for that sequence number
cat logfile.log | grep ^$seq\: | tail -n1
echo
done
sleep 1
done
}
echo hello im starting now
sleep 1
export -f background
# start parallel and send the job to the background
parallel -u -j3 background {} {#} '>>' logfile.log ::: foo bar baz foo bar baz one two three one two three :::+ 5 6 5 3 4 6 7 2 5 4 6 2 &
pid=$!
progress $pid
echo finished!
I'd rather not depend on scraping all information from ps and would prefer to get the actual line output of each parallel process, but a guy's gotta do what a guy's gotta do. regular output sent to a logfile for parsing later on.

Terminal Command to run tests using GNU Parallel

I have a folder of problems which are like this:
problem1, domain 1
problem2, domain 2
problem3, domain 3
I want to use GNU Parallel to run a bunch of problems like this. This is short version of what I have tried:
seq 01 20 | parallel -k -j6 java pddl/benchmarks_STRIPS/psr/p{}-domain.pddl -f pddl/benchmarks_STRIPS/psr/p{}.pddl
I want some sort of command that will tell GNU parallel that domain 1 is to be compiled with problem 1, domain 2 is with problem 2 etc..
Is there a way to do this using GNU or should I write each one out individually?
I think it may be a problem with zero-padding, as my seq command doesn't zero-pad numbers.
If you have bash 4+ (I think that's the correct version), you can use:
echo {01..20} | parallel ...
Or, if you have an older bash, you could use something like:
printf "%02d\n" {1..20} | parallel ...
I assume the pXX-domain.pddl files exist. You can use GNU Parallel's {= =} syntax to compute the pXX name:
parallel -k -j6 java {} -f '{= s/-domain(\.pddl)$/$1/ =}' ::: pddl/benchmarks_STRIPS/psr/p*-domain.pddl
Or if the opposite is true:
parallel -k -j6 java '{= s/(\.pddl)$/-domain$1/ =}' -f {} ::: pddl/benchmarks_STRIPS/psr/p??.pddl
Requires GNU Parallel 20140722.
This way you do not need to know in advance which files exist.

Print out a statement before each output of my script

I have a script that checks each file in a folder for the word "Author" and then prints out the number of times, one line per file, in order from highest to lowest. In total I have 825 files. An example output would be
53
22
17
I want to make it so that I print out something before each number on every line. This will be the following hotel_$i so the above example would now be:
hotel_1 53
hotel_2 22
hotel_3 17
I have tried doing this using a for loop in my shell script:
for i in {1..825}
do
echo "hotel_$i"
find . -type f -exec bash -c 'grep -wo "Author" {} | wc -l' \; | sort -nr
done
but this basically prints out hotel_1, then does the search and sort for all 825 files, then hotel_2 repeats the search and sort and so on. How do I make it so that it prints before every output?
You can use the paste command, which combines lines from different files:
paste <(printf 'hotel_%d\n' {1..825}) \
<(find . -type f -exec bash -c 'grep -wo "Author" {} | wc -l' \; | sort -nr)
(Just putting this on two lines for readability, can be a one-liner without the \.)
This combines paste with process substitution, making the output of a command look like a file (a named pipe) to paste.
The first command prints hotel_1, hotel_2 etc. on a separate line each, and the second command is your find command.
For short input files, the output looks like this:
hotel_1 7
hotel_2 6
hotel_3 4
hotel_4 3
hotel_5 3
hotel_6 2
hotel_7 1
hotel_8 0
hotel_9 0

Splitting command line args with GNU parallel

Using GNU parallel: http://www.gnu.org/software/parallel/
I have a program that takes two arguments, e.g.
$ ./prog file1 file2
$ ./prog file2 file3
...
$ ./prog file23456 file23457
I'm using a script that generates the file name pairs, however this poses a problem because the result of the script is a single string - not a pair. like:
$ ./prog "file1 file2"
GNU parallel seems to have a slew of tricks up its sleeves, I wonder if there's one for splitting text around separators:
$ generate_file_pairs | parallel ./prog ?
# where ? is text under consideration, like "file1 file2"
The easy work around is to split the args manually in prog, but I'd like to know if it's possible in GNU parallel.
You are probably looking for --colsep.
generate_file_pairs | parallel --colsep ' ' ./prog {1} {2}
Read man parallel for more. And watch the intro video if you have not already done so http://www.youtube.com/watch?v=OpaiGYxkSuQ
You are looking for -n option of parallel. This is what you are looking for:
./generate_file_pairs | parallel -n 2 ./prog {}
Excerpt from GNU Parallel Doc:
-n max-args
Use at most max-args arguments per command line. Fewer than max-args
arguments will be used if the size (see the -s option) is exceeded,
unless the -x option is given, in which case GNU parallel will exit.
Quite late to the party here, but I bump into this problem fairly often and found a nice easy solution
Before passing the arg list to parallel, just replace all the spaces with newlines. I've found tr to be the fastest for this kind of stuff
Not working
echo "1 2 3 4 5" | parallel echo --
-- 1 2 3 4 5
Working
echo "1 2 3 4 5" | tr ' ' '\n' | parallel echo --
-- 1
-- 2
-- 3
-- 4
-- 5
Protip: before actually running the parallel command, I do 2 things to check that the arguments have been split correctly.
Prepend echo in front of your bash command. This means that any commands that will eventually be executed will be printed for you to check first
Add a marker in the echo, this checks that the parallel split is actually working
> Note, this works best with small/medium argument lists. If the argument list is very large, probably best to just use a for loop to echo each argument to parallel
In Parallel's manual, it is said:
If no command is given, the line of input is executed ... GNU parallel can often be used as a substitute for xargs or cat | bash.
So take a try of:
generate command | parallel
Try to understand the output of this:
for i in {1..5};do echo "echo $i";done | parallel

Resources