LCov option processing for removing multiple patterns - bash

I need to create a set of regex patterns to be used within the --remove option of the lcov command, in order to remove some pointless entries in the coverage file.
Reading the manpage of lcov it seems the list of patterns shall be handed to it as a space-separated single-quoted strings (as in 'patte*n1' 'pa*ter*2' 'p*3' and so on).
I was able to write a small bash script which generates exactly the list I need, in the form required by the command. If I issue
export LIST=$( ./myscript.sh )
and then, do a echo $LIST, I get the list I expect to
'path/file1' 'path/file2' 'path/file3'
(the regex patterns list is comprised of the ending part of some patterns to be removed from the analysis).
The problem arises when I pass this list to the command:
lcov --remove coverage_report.info '/usr/*' $( echo $LIST ) --output-file coverage_report.info.cleaned
in order to remove both /usr/* and the files from my list, but it does not work as expected: no path from my path list is actually removed. I tried different forms of it:
echo $LIST | xargs -i lcov --remove coverage_report.info {} --output-file coverage_report.info.cleaned
If I take the output of echo $LIST and copy/paste it directly on the command line, the command actually works and removes all the path I'd like to get rid of. My impression is that I'm not aware of all inner aspects of some commands' options processing and the order of evaluation of nested commands.
Thanks in advance to any people willing to help!

Ok,
I finally got it done by issuing
echo $LIST | xargs lcov --output-file coverage_report.info.cleaned --remove coverage_report.info
I switched to the echo ... | xargs way of doing things in order to force the evaluation order of the commands, but the -i way of doing that was not appropriate. I had the idea that the -i switch purpose was to replace the output of the previous command in every place of the current one by means of the {} token, but apparently it has some more consequences on the output processing.

With lcov 1.12 and bash 4.3.46(1), the suggestion above did not work for me. The list needed to be expanded as follows:
echo ${LIST[#]} | xargs lcov -o coverage_report.info.cleaned -r coverage_report.info
Moreover, my list was a list of double-quoted strings, e.g.
LIST=( '"path/file1"' '"path/file2"' '"path/file3"' )

Related

How to get optional quotes for rsync in script

I'm trying to programmatically use rsync and all is well until you have spaces in the paths. Then I have to quote the path in the script which is also fine until it's optional.
In this case --link-dest can be optional and I've tried variations to accommodate it in both cases of it being there and not but I'm having problems when paths need to be quoted.
One option would be to use a case statement and just call rsync with two different lines but there are other rsync options that may or may not be there and the combinations could add up quickly so a better solution would be nice. Since this is my machine and no one but me has access to it I'm not concerned about security so if an eval or something is the only way that's fine.
#!/bin/bash
src='/home/x/ll ll'
d1='/home/x/rsy nc/a'
d2='/home/x/rsy nc/b'
rsync -a "$src" "$d1"
# Try 1
x1='--link-dest='
x2='/home/x/rsy nc/a'
rsync -a $x1"$x2" "$src" "$d2"
This works until you don't have a --link-dest which will look like this:
x1=''
x2=''
rsync -a $x1"$x2" "$src" "$d2"
And rsync takes the empty "" as the source so it fails.
# Try 2
x1='--link-dest="'
x2='/home/x/rsy nc/a"'
rsync -a $x1$x2 "$src" "$d2"
This fails because double quotes in x1 & x2 are taken as part of the path and so it creates a path not found error but it does of course work like this:
x1=''
x2=''
rsync -a $x1$x2 "$src" "$d2"
I've tried a couple of other variations but they all come back to the two issues above.
You can use Bash arrays to solve this kind of problem.
# inputs come from wherever
src='/home/x/ll ll'
d1='/home/x/rsy nc/a'
d2='/home/x/rsy nc/b'
x2='something'
# or
x2=''
# build the command
my_cmd=(rsync -a)
if [[ $x2 ]]; then
my_cmd+=("--link-dest=$x2")
fi
my_cmd+=("$src" "$d2")
# run the command
"${my_cmd[#]}"
First I'm constructing the command bit by bit, with quotes carefully used to preserve multi-word options as such, storing each option/argument as an element in my array, including the rsync command itself.
Then I invoke it with the weird "${my_cmd[#]}" syntax, which means expand all the elements of the array, without re-splitting multi-word elements. Note that the quotes are important: if you remove them, the multiple word values in the array get split again and you're back to square one.
The syntax "${my_cmd[*]}" also exists, but that one joins the whole array into a single word, again now what you want. You need the [#] to get each array element expanded as its own word in the result.
Google "bash arrays" for more details.
Edit: conditional pipes
If you want to pipe the output throw some other command on some condition, you can just pipe it into an if statement. This works:
"${my_cmd[#]}" |
if [[ $log_file]]; then
tee "$log_file"
fi
and tees the output to $log_file if that variable is defined and non-empty.

Replication and expansion of program flags in BASH script

I am working with a program that combines individuals files, and I am incorporating this program into a BASH pipeline that I'm putting together. The program requires a flag for each file, like so:
program -V file_1.g.vcf -V file_2.g.vcf -V file_3.g.vcf -O combined_output.g.vcf
In order to allow the script to work with any number of samples, I would like to read the individual files names within a directory, and expand the path for each file after a '-V' flag.
I have tried adding the file paths to a variable with the following, but have not had success with proper expansion:
GVCFS=('-V' `ls gvcfs/*.g.vcf`)
Any help is greatly appreciated!
You can do this by using a loop to populate an array with the options:
options=()
for file in gvcfs/*.g.vcf; do # Don't parse ls, just use a direct wildcard expression
options+=(-V "${file##*/}") # If you want the full path, leave off ##*/
done
program "${options[#]}" -O combined_output.g.vcf
printf can help:
options=( $(printf -- "-V %s " gvcfs/*.g.vcf ) )
Though this will not deal gracefully with whitespace in filenames.
Also consider realpath to generate absolute filenames.

Is Recursive Grep Really Better?; How to Improve PBS-based Bash Script?; and other Questions

I work in a research group and we use the PBS queuing system. I'm no PBS master, but I wanted to script a search for if a job was running. To do this I first grab a string of all the jobs by using the results of a qstat call as my argument to qstat -f and then taking the detailed list of all jobs and searching it for the submitted file path. The current kludge stands as follows
dump=`qstat -f `qstat``
if grep -q \
"/${compounds[$i]}/D0_${j}_z_$((k*30))/scripts/jobscript_minim" \
<<<$dump; then
echo "Minimize is running!"
fi
Suggestions for improvement?
Also, I've been told that $() is cleaner than ``. But when I try:
dump="$(qstat -f "$(qstat)")"
...my program fails. Why is this? Am I misunderstanding how to nest shell calls with $()?? Or is it something to do with how I'm passing the list of queue jobs from qstat to qstat -f? Should I be using awk or something to grab the jobs from the qstat command and then somehow pass them as args to qstat -f?
Also should I be using recursive grep? Some people tell me its "saner" but I'm not sure what that means. Is it more portable? Is it faster? Does it need less trips to the therapist?
What is the reason you should use it?
Alright... managed to come up with a clean solution...
search_dir="${compounds[${i}]}/D0_${j}_z_$[30*k]"
if [ ! -z "$(qstat -f $(qstat | grep -F jmick | awk '{print $1}')|\
grep -F "$search_dir"|head -n 1)" ]
then
...since the directory I'm searching for is kind of long I assign it to a variable. I run the inner command substitution to get only the jobs with my user name, then run the outer command substitution to print full details on those jobs and then grep through those details for my directory. In case it finds it early I included a head to try to short circuit the command.
The question of what's the point of recursive grep, though, still stands.
A recursive grep will search multiple files in all the subdirectories. Without using recursion it will search a file or files only in the current (or specified) directory. I can't see how one would be any "saner" than the other. They each have their particular applications.
By the way, you should really split your questions into specific issues rather than posting them together - even if they have something in common. This site works better when you do it that way.
Try without the quotes:
dump=$(qstat -f $(qstat))
dump=`qstat -f `qstat`` is equivalent to dump=$(qstat -f )qstat$() which is equivalent to dump="$(qstat -f)qstat".
qstat -f "$(qstat)" calls qstat with two arguments: the option -f, and the output from qstat lumped together as a single word. dump="$(qstat -f "$(qstat)")" sets dump to the output of the outer qstat command.
qstat -f $(qstat) calls qstat with any number of arguments starting from 1, depending on the output from qstat: first the output of qstat is split into separate words at each whitespace sequence, then each word that looks like a glob pattern (i.e. contains *, ? or [) that matches at least one file is replaced by the list of matching file names. All these words and file names become individual arguments to the outer qstat.

shell scripting: search/replace & check file exist

I have a perl script (or any executable) E which will take a file foo.xml and write a file foo.txt. I use a Beowulf cluster to run E for a large number of XML files, but I'd like to write a simple job server script in shell (bash) which doesn't overwrite existing txt files.
I'm currently doing something like
#!/bin/sh
PATTERN="[A-Z]*0[1-2][a-j]"; # this matches foo in all cases
todo=`ls *.xml | grep $PATTERN -o`;
isdone=`ls *.txt | grep $PATTERN -o`;
whatsleft=todo - isdone; # what's the unix magic?
#tack on the .xml prefix with sed or something
#and then call the job server;
jobserve E "$whatsleft";
and then I don't know how to get the difference between $todo and $isdone. I'd prefer using sort/uniq to something like a for loop with grep inside, but I'm not sure how to do it (pipes? temporary files?)
As a bonus question, is there a way to do lookahead search in bash grep?
To clarify/extend the problem:
I have a bunch of programs that take input from sources like (but not necessarily) data/{branch}/special/{pattern}.xml and write output to another directory results/special/{branch}-{pattern}.txt (or data/{branch}/intermediate/{pattern}.dat, e.g.). I want to check in my jobfarming shell script if that file already exists.
So E transforms data/{branch}/special/{pattern}.xml->results/special/{branch}-{pattern}.dat, for instance. I want to look at each instance of the input and check if the output exists. One (admittedly simpler) way to do this is just to touch *.done files next to each input file and check for those results, but I'd rather not manage those, and sometimes the jobs terminate improperly so I wouldn't want them marked done.
N.B. I don't need to check concurrency yet or lock any files.
So a simple, clear way to solve the above problem (in pseudocode) might be
for i in `/bin/ls *.xml`
do
replace xml suffix with txt
if [that file exists]
add to whatsleft list
end
done
but I'm looking for something more general.
#!/bin/sh
shopt -s extglob # allow extended glob syntax, for matching the filenames
LC_COLLATE=C # use a sort order comm is happy with
IFS=$'\n' # so filenames can have spaces but not newlines
# (newlines don't work so well with comm anyhow;
# shame it doesn't have an option for null-separated
# input lines).
files_todo=( **([A-Z])0[1-2][a-j]*.xml )
files_done=( **([A-Z])0[1-2][a-j]*.txt )
files_remaining=( \
$(comm -23 --nocheck-order \
<(printf "%s\n" "${files_todo[#]%.xml}") \
<(printf "%s\n" "${files_done[#]%.txt}") ))
echo jobserve E $(for f in "${files_remaining[#]%.xml}"; do printf "%s\n" "${f}.txt"; done)
This assumes that you want a single jobserve E call with all the remaining files as arguments; it's rather unclear from the specification if such is the case.
Note the use of extended globs rather than parsing ls, which is considered very poor practice.
To transform input to output names without using anything other than shell builtins, consider the following:
if [[ $in_name =~ data/([^/]+)/special/([^/]+).xml ]] ; then
out_name=results/special/${BASH_REMATCH[1]}-${BASH_REMATCH[2]}.dat
else
: # ...handle here the fact that you have a noncompliant name...
fi
The question title suggests that you might be looking for:
set -o noclobber
The question content indicates a wholly different problem!
It seems you want to run 'jobserve E' on each '.xml' file without a matching '.txt' file. You'll need to assess the TOCTOU (Time of Check, Time of Use) problems here because you're in a cluster environment. But the basic idea could be:
todo=""
for file in *.xml
do [ -f ${file%.xml}.txt ] || todo="$todo $file"
done
jobserve E $todo
This will work with Korn shell as well as Bash. In Bash you could explore making 'todo' into an array; that will deal with spaces in file names better than this will.
If you have processes still generating '.txt' files for '.xml' files while you run this check, you will get some duplicated effort (because this script cannot tell that the processing is happening). If the 'E' process creates the corresponding '.txt' file as it starts processing it, that minimizes the chance or duplicated effort. Or, maybe consider separating the processed files from the unprocessed files, so the 'E' process moves the '.xml' file from the 'to-be-done' directory to the 'done' directory (and writes the '.txt' file to the 'done' directory too). If done carefully, this can avoid most of the multi-processing problems. For example, you could link the '.xml' to the 'done' directory when processing starts, and ensure appropriate cleanup with an 'atexit()' handler (if you are moderately confident your processing program does not crash). Or other trickery of your own devising.
whatsleft=$( ls *.xml *.txt | grep $PATTERN -o | sort | uniq -u )
Note this actually gets a symmetric difference.
i am not exactly sure what you want, but you can check for existence of the file first, if it exists, create a new name? ( Or in your E (perl script) you do this check. )
if [ -f "$file" ];then
newname="...."
fi
...
jobserve E .... > $newname
if its not what you want, describe more clearly in your question what you mean by "don't overwrite files"..
for posterity's sake, this is what i found to work:
TMPA='neverwritethis.tmp'
TMPB='neverwritethat.tmp'
ls *.xml | grep $PATTERN -o > $TMPA;
ls *.txt | grep $PATTERN -o > $TMPB;
whatsleft = `sort $TMPA $TMPB | uniq -u | sed "s/%/.xml" > xargs`;
rm $TMPA $TMPB;

calling grep from a bash script

I'm new to bash scripts (and the *nix shell altogether) but I'm trying to write this script to make grepping a codebase easier.
I have written this
#!/bin/bash
args=("$#");
for arg in args
grep arg * */* */*/* */*/*/* */*/*/*/*;
done
when I try to run it, this is what happens:
~/Work/richmond $ ./f.sh "\$_REQUEST\['a'\]"
./f.sh: line 4: syntax error near unexpected token `grep'
./f.sh: line 4: ` grep arg * */* */*/* */*/*/* */*/*/*/*;'
~/Work/richmond $
How do I do this properly?
And, I think a more important question is, how can I make grep recurse through subdirectories properly like this?
Any other tips and/or pitfalls with shell scripting and using bash in general would also be appreciated.
The syntax error is because you're missing do. As for searching recursively if your grep has the -R option you would do:
#!/bin/bash
for arg in "$#"; do
grep -R "$arg" *
done
Otherwise you could use find:
#!/bin/bash
for arg in "$#"; do
find . -exec grep "$arg" {} +
done
In the latter example, find will execute grep and replace the {} braces with the file names it finds, starting in the current directory ..
(Notice that I also changed arg to "$arg". You need the dollar sign to get the variable's value, and the quotes tell the shell to treat its value as one big word, even if $arg contains spaces or newlines.)
On recusive grepping:
Depending on your grep version, you can pass -R to your grep command to have it search Recursively (in subdirectories).
The best solution is stated above, but try putting your statement in back ticks:
`grep ...`
You should use 'find' plus 'xargs' to do the file searching.
for arg in "$#"
do
find . -type f -print0 | xargs -0 grep "$arg" /dev/null
done
The '-print0' and '-0' options assume you're using GNU grep and ensure that the script works even if there are spaces or other unexpected characters in your path names. Using xargs like this is more efficient than having find execute it for each file; the /dev/null appears in the argument list so grep always reports the name of the file containing the match.
You might decide to simplify life - perhaps - by combining all the searches into one using either egrep or grep -E. An optimization would be to capture the output from find once and then feed that to xargs on each iteration.
Have a look at the findrepo script which may give you some pointers
If you just want a better grep and don't want to do anything yourself, use ack, which you can get at http://betterthangrep.com/.

Resources