Bash variable contains a value from a set of values - bash

I am developing a "wrapper script" to use as a "logging aid" in Bash.
It should print out information about the call stack at the time it was invoked.
I have done work on it, which follows, but a couple of questions/doubts remain and I'd like to get the best possible answer on them from the experts here.
My code:
################################################################################
# Formats a logging message.
function my_function_format_logging_message() {
local -r TIMESTAMP="$(date '+%H:%M:%S')"
local -r PROCESS="$$" # Deliberately not using $BASHPID, focus: parent process
local -r CALLER="${FUNCNAME[1]}"
local -i call_stack_position=1
if [[ "${CALLER}" == 'my_function_log_trace' ||
"${CALLER}" == 'my_function_log_debug' ||
"${CALLER}" == 'my_function_log_info' ||
"${CALLER}" == 'my_function_log_warning' ||
"${CALLER}" == 'my_function_log_error' ||
"${CALLER}" == 'my_function_log_critical' ]]
then
call_stack_position=$((call_stack_position++))
fi
local -r SOURCE="$(basename "${BASH_SOURCE[$call_stack_position]}")"
local -r FUNCTION="${FUNCNAME[$call_stack_position]}"
local -r LINE="${BASH_LINENO[$call_stack_position-1]}" # Previous function
local -r SEVERITY="$1"
local -r MESSAGE="$2"
# TODO: perform argument validation
printf '%s [PID %s] %s %s %s:%s - %s\n' \
"${TIMESTAMP}" \
"${PROCESS}" \
"${SEVERITY}" \
"${SOURCE}" \
"${FUNCTION}" \
"${LINE}" \
"${MESSAGE}"
}
################################################################################
Usage example:
my_function_format_logging_message CRITICAL Temporarily increasing energy level to 9001
or:
my_function_log_info Dropping back to power level 42
My doubts:
call_stack_position=$((call_stack_position++))
I can't think of a better way to increment this variable, is there a nicer/more readable form of this?
Can I use a better construct to detect if the call was made by a logging method? (e.g. trace, debug, info..). All of those if statements make my eyes hurt.
Am I reinventing the wheel / misusing the tool I'd like to learn? (i.e. shell scripting)
I might be reinventing the wheel, sure, but this is self-training.. to one day stop being a toll booth night-shift worker.
NOTE
I am looking for a match to the specified my_function_log_* names and no others. It is not ok to assume I have that degree of freedom (the many ifs are there for exactly that reason and I am looking for some syntactic sugar or better use of language features to do that type of "set membership" test).

I can suggest this for your first two questions:
if [[ "${CALLER}" == my_function_log_* ]]
then
let call_stack_position++
fi
If you just want a set of values after log_:
if [[ "${CALLER}" =~ my_function_log_(trace|debug|info|warning|error|critical) ]]
then
let call_stack_position++
fi

Bash’s type system, if you even want to call it that, is very rudimentary: strings and integers are its only first class citizens, arrays are a tacked on afterthought whose functionality is nowhere near that of Python sets or Ruby arrays. This being said, there is a poor man’s in operator for arrays that relies on string matching. Given an array of function names:
log_functions=(my_function_log_trace my_function_log_debug my_function_log_info my_function_log_warning my_function_log_error my_function_log_critical)
this:
[[ ${log_functions[*]} =~ \\b$CALLER\\b ]]
will match only members of the array. And as we are talking poor man’s constructs, you can combine the above pattern with boolean control operators into a poor man’s ternary assignment to skip the numerical evaluation altogether:
local -i call_stack_position=$([[ ${log_functions[*]} =~ \\b$CALLER\\b ]] && echo 1 || echo 2)
Caveat: on systems that do not support the GNU extensions to regcomp() (notably OS X and Cygwin), word boundary matching needs to use the somewhat more verbose character class form, i.e.
[[ ${log_functions[*]} =~ [[:\<:]]$CALLER[[:\>:]] ]]
Notes: seeing your code and noting you mentioned you are learning shell scripting, I’d offer two observations unrelated to the question proper:
The brace notation for variable expansion is only required for array access, expansion operations and to disambiguate var names in string concatenation. It is not needed in other cases, i.e. in both your tests and your printf command.
Using expansion string operations is much faster than using externals and thus recommended wherever possible. Instead of using basename, use ${var##*/}.

A more readable way of doing an increment is by incrementing it in numerical context:
(( call_stack_position++ ))
For the matching, you can use a glob in bash:
[[ $CALLER == my_function_log_* ]]
As far as reinventing the wheel, you can use syslog logging from bash using the logger command. The local syslog daemon will handle formatting the log message and writing it to a file.
logger -p local0.info "CRITICAL Temporarily increasing energy level to 9001"
Update, based on comments. You can use an associative array to be more explicit about what you are looking for. It requires bash v4 or higher.
declare -A arr=(
['my_function_log_trace']=1
['my_function_log_debug']=1
['my_function_log_info']=1
['my_function_log_warning']=1
['my_function_log_error']=1
['my_function_log_critical']=1
);
if [[ ${arr[CALLER]} ]]; then
...
fi
You could also use extended globbing for the pattern matching, similar to the regex in perreal's answer, but without regex:
shopt -s extglob
if [[ $CALLER == my_function_log_#(trace|debug|info|warning|error|critical) ]]; then
...
fi

Related

getopts & preventing accidentally interpreting short options as arguments

Here's a toy example that shows what I mean:
while getopts "sf:e:" opt; foundOpts="${foundOpts}${opt}" ; done
echo $foundOpts
problem is that getopts isn't particularly smart when it comes to arguments. for example, ./myscript.sh -ssssf will output option requires an argument -- f. That's good.
But if someone does ./myscript.sh -ssfss by mistake, it parses that like -ss -f ss, which is NOT desirable behavior!
How can I detect this? Ideally, I'd like to force that f parameter to be defined separately, and allow -f foo or -f=foo, but not -sf=foo or -sf foo.
Is this possible without doing something hairy? I imagine I could do something with a RegEx, along the lines of match $# against something like -[^ef[:space:]]*[ef]{1}[ =](.*), but I'm worried about false positives (that regex might not be exactly right)
Thanks!
Here's what I've come up with. It seems to work, but I imagine the RegEx could be simplified. And I imagine there are edge cases I've overlooked.
validOpts="sf:e:"
optsWithArgs=$(echo "$validOpts" | grep -o .: | tr -d ':\n')
# ensure that options that require (or have optional) arguments are not combined with other short options
invalidOptArgFormats=( "[^$optsWithArgs[:space:]]+[$optsWithArgs]" "[$optsWithArgs]([^ =]|[ =]$|$)" )
IFS="|" && invalidOptArgRegEx="-(${invalidOptArgFormats[*]})"
[[ "$#" =~ $invalidOptArgRegEx ]] && echo -e "An option you provided takes an argument; it is required to have its option\n on a separate - to prevent accidentally passing an opt as an argument.\n (You provided ${BASH_REMATCH[0]})" && exit 1

Why won't my for/if loop work for renaming sequence names in a fasta file in bash

I am currently working on a bioinformatics project, and have been assigned the role of editing some genetic sequence files (fasta/.fa) to be viable for the next stage of processing. I am doing this on the command line linux with bash.
With how the files have been obtained, each read within the file has been assigned an arbitrary name following this format for 1-1587663 (denoted x) V1_x.
For the next step of my reads, I need to format these names within the file following a specific naming pattern. This is where all empty spaces must contain a 0. For example, V1_1 must be reformatted to V1_0000001, V1_15 must be reformatted to V1_0000015, V1_1050 must be formatted to V1_0001050, eventually ending with V1_1587663.
I will give an example of how one file is laid out:
V1_1 flag=1 multi=9.0000 len=342\
AAGGAGTGATGGCATGGCGTGGGACTTCTCCACCGACCCCGAGTTCCAGGAGAAGCTCGACTGGGTCGAGCGGTTCTGCCAGGAAAGGGTCGAGCCGCTCGACTATGTGTTTCCCCACGCGGTGCGCTGGCCAGACCCGGTGGTAAAGGCGTACGTCCGCGAACTCCAGCAGGAGGTCAAGGACCAGGGCCTGTGGGCGATCTTCCTCGACCGGGAACTAGGTGGCCCGGGCTTCGGACAGCTCAGGCTGGCTCTGCTCAACGAGGTGATCGGCCGCTATCCCGGCGCGCCCGCGATGTTCGGTGCCGCGGCGCCCGATACCGGGAA
V1_2 flag=1 multi=9.0000 len=330
ATCTTCACCCAGCTCGGCAGCATGTTTCCCGTGGCGATGGAGTGCAGCATCGAGCCCAGGCAGATCACCAGCCCGGCGTCTTTCAACTGCGCGGCGTAGGCGTCCTGCGCCGCGTTCATATCGGTAATCGTATCGGGCAGCGGGCCGTCGTCGCGCAGGCTGCCCGCCAGCACGAACGGAATCCCAGAGCGCACGCATTCGTACAGGATGCCTTCCCGCAGGCATCCGCCCTCCACGGCCTGCCGGACGCTCCCGGCGCGATAGATCGCATTGATGGCGCGCATGTGATTGCGGTGCCCGTGCTCTTCCTGCCTCCCGTCGCTCAGCCGC\
I am currently trying to write a loop which would do this all in one go, as it is a lot of reads and I have multiple of these genetic sequence fasta files.
I don't want to ruin my file so I have created a copy of the file with the first 5000 reads in to test my code.
The code I have been trying to make work is as follows
for i in {1..5000}
do
if [ "$i" -le "9"]; then
sed -i 's/V1_i/V1_000000i/' testfile.fa
elif [["$i" -gt "9"] && ["i" -le "99"]]; then
sed -i s/V1_i/V1_00000i/' testfile.fa
elif [["i" -gt "99"] && ["i" -le "999"]]; then
sed -i s/V1_i/V1_0000i/' testfile.fa
elif [["i" -gt "999"] && ["i" -le "9999"]]; then
sed -i s/V1_i/V1_000i/' testfile.fa
fi
done
I will rewrite the code below to explain what I think each line should be doing
for i in {1..5000} - **Denoting that it should be ran with i standing as 1-5000**
do
if [ "$i" -le "9"]; then **If 'i' is less than 9 then do...**
sed -i 's/V1_i/V1_000000i/' testfile.fa **replace V1_i with V1_000000i within testfile.fa**
elif [["$i" -gt "9"] && ["i" -le "99"]]; then **else if 'i' is more than 9 but equal to or less than 99 then do....**
sed -i s/V1_i/V1_00000i/' testfile.fa **replace V1_i with V1_000000i within testfile.fa**
elif [["i" -gt "99"] && ["i" -le "999"]]; then
sed -i s/V1_i/V1_0000i/' testfile.fa
elif [["i" -gt "999"] && ["i" -le "9999"]]; then
sed -i s/V1_i/V1_000i/' testfile.fa
fi
done
The result I get evertime is 4 lots of 'command not found' as pasted below, per number in the range.
[1: command not found
[[1: command not found
[[1: command not found
[[1: command not found
[2: command not found
[[2: command not found
[[2: command not found
[[2: command not found
etc until 5000
I assume I must have something wrong with how I've written the code, but as someone who is new to this, I can't see what is wrong.
Thank you for reading, if you can help that is very much appreciated. If you need anymore details, I will gladly try and help to the best of my ability. Unfortunately, I can't share the exact files I'm working on (I know this isn't helpful sorry) as I do not have permission.
Shell syntax
The result I get evertime is 4 lots of 'command not found' as pasted
below, per number in the range.
[1: command not found
[[1: command not found
[[1: command not found
[[1: command not found
[2: command not found
[[2: command not found
[[2: command not found
[[2: command not found
etc until 5000
The [ character is not special to the shell. [ and [[ are not operators, but rather an ordinary command and a reserved word, repsectively. They have no involvement in splitting command lines into words. Similar applies to ] and ]] -- the shell does not automatically break words on either side them.
The " character is special to the shell, but it does not create a word boundary. The shell has quoting, but it does not have not quote-delimited strings as a syntactic unit in the sense that some other languages do.
With that in mind, consider this code fragment:
elif [["$i" -gt "9"] && ["i" -le "99"]]; then
Because neither [[ nor " produce a word break, [["$i" expands to a single word, for example [[1, which, given its position, is interpreted as the name of a command to execute. There being no built-in command by that name and no program by that name in the path, executing that command fails with "command not found".
You need to insert whitespace to make separate words separate (but see also below):
elif [[ "$i" -gt "9" ] && [ "i" -le "99" ]]; then
Moreover, again, [ is a command and [[ is a reserved word naming a built-in command. ] is an argument with special meaning to the [ command, and ]] is an argument with special significance to the [[ built-in. Although they (intentionally) have a similar appearance, these are not analogous to parentheses. You don't need to impose grouping here anyway. The && operator already separates commands, and the overall pipeline does not need to be explicitly demarcated as a group. This would be correct and more natural:
elif [[ "$i" -gt "9" ]] && [[ "$i" -le "99" ]]; then
Furthermore, although it is not wrong, it is unnecessary and a bit weird to quote your numbers. The case is amore nuanced for the expansions of $i, since its values are fully under your control, but "always quote your variable expansions" is a pretty good rule until your shell scripting is strong enough for you to decide for yourself when you can do otherwise. So, this is where we arrive:
elif [[ "$i" -gt 9 ]] && [[ "$i" -le 99 ]]; then
You will want to do likewise throughout your script.
But wait, there's more!
I think the changes described above would make your script work, but it would be extremely slow, because it will make 5000 passes through the whole file. And on the whole 1.5M entry file, you would need a version that made 1.5M passes through the whole half-gigabyte-ish of data. It would take years to complete.
That approach is not viable, not really even for the 5000 lines. You need something that will make only a single pass through the file, or at worst a small, fixed number of passes. I think a one-pass approach would be possible with sed, but it would take a complex and very cryptic sed expression. I'm a sed fan, but I would recommend awk for this, or even shell without any external tool.
A pure-shell version could be built with the read and printf built-in commands combined with some of the shell's other features. An awk version could be expressed as a not-overly-complex one-liner. Details of either of these options depends on the file syntax, however, which, as I commented on the question, I think you have misrepresented.

Do test operators -a and -o short circuit?

Do test operators -a and -o short circuit?
I tried if [ 0 -eq 1 -a "" -eq 0 ]; then ... which complained about the syntax of the second conditional. But I can't tell if that's because
-a does not short circuit
or test wants everything properly formatted before it begins and it still short circuits.
The result is leading me to create a nested if when really what I wanted was a situation where the first conditional would guard against executing the second if a particular var had not yet been set...
edit: As for why am I using obsolescent operators, the code has to work everywhere in my environment and I just found a machine where
while [ -L "$file" ] && [ "$n" -lt 10 ] && [ "$m" -eq 0 ]; do
is an infinite loop and changing to the obsolete -a yields good behavior:
while [ -L "$file" -a "$n" -lt 10 -a "$m" -eq 0 ]; do
What should I do? The first expression works on many machines but not this machine which appears to require the second expression instead...
Per the POSIX specification for test:
>4 arguments:
The results are unspecified.
Thus, barring XSI extensions, POSIX says nothing about how this behaves.
Moreover, even on a system with XSI extensions:
expression1 -a expression2: True if both expression1 and expression2 are true; otherwise, false. The -a binary primary is left associative. It has a higher precedence than -o. [Option End]
expression1 -o expression2: True if either expression1 or expression2 is true; otherwise, false. The -o binary primary is left associative. [Option End]
There's no specification with respect to short-circuiting.
If you want short-circuiting behavior -- or POSIX-defined behavior at all -- use && and || to connect multiple, separate test invocations.
Quoting again, from later in the document:
APPLICATION USAGE
The XSI extensions specifying the -a and -o binary primaries and the '(' and ')' operators have been marked obsolescent. (Many expressions using them are ambiguously defined by the grammar depending on the specific expressions being evaluated.) Scripts using these expressions should be converted to the forms given below. Even though many implementations will continue to support these obsolescent forms, scripts should be extremely careful when dealing with user-supplied input that could be confused with these and other primaries and operators. Unless the application developer knows all the cases that produce input to the script, invocations like:
test "$1" -a "$2"
should be written as:
test "$1" && test "$2"
Well, you already know the behaviour, so this question is really about how to interpret those results. But TBH, there aren't many real word scenarios where you'll observe different behaviour.
I created a small test case to check what's going on (at least, on my system, since the other answer suggests it's not standardized):
strace bash -c "if [ 0 -eq 1 -a -e "/nosuchfile" ]; then echo X; fi"
If you check the output you'll see that bash looks for the file, so the answer is:
The operators don't short circuit.

Comparing strings composed from numbers

I'm trying to compare values of two variables both containing strings-as-numbers. For example:
var1="5.4.7.1"
var2="6.2.4.5"
var3="1-4"
var4="1-5"
var5="2.3-3"
var6="2.3.4"
Sadly, I don't even know where to start... Any help will be appreciated!
What I meant is how would I go about comparing the value of $var5 to $var6 and determine with one of them is higher.
EDIT: Better description of the problem.
You can use [[ ${str1} < ${str2} ]] style test. This should work:
function max()
{
[[ "$1" < "$2" ]] && echo $2 || echo $1
}
max=$(max ${var5} ${var6})
echo "max=${max}."
It depends of the required portability of the solution. If you don't care about that and you use a deb based distribution, you can use the dpkg --compare-versions feature.
However, if you need to run your script on distros without dpkg I would use following approach.
The value you need to compare consist of first (the first element) and the rest (all others). The first is usually called the head and the rest - tail, but I deliberately use names first and rest, to not confuse with head(1) and tail(1) tools available on Unix systems.
In case first($var1) is not equal to first($var2) you just compares those firsts elements. If firsts are equal, just recursively run the compare function on rest($var1) and rest($var2). As a border case you need to decide what to do if values are like:
var1 = "2.3.4"
var2 = "2.3"
and in some step you will compare empty and non-empty first.
Hint for implementing first and rest functions:
foo="2.3-4.5"
echo ${foo%%[^0-9][0-9]*}
echo ${foo#[0-9]*[^0-9]}
If those are unclear to you, read man bash section titled Parameter Expansion. Searching the manual for ## string will show you the exact section immediately.
Also, make sure, you are comparing elements numerically not in lexical order. For example compare the result of following commands:
[[ 9 > 10 ]]; echo $?
[[ 9 -gt 10 ]]; echo $?

What are the different uses of the different types of bracing used for conditionals in shell scripts?

I know of at least of 4 ways to test conditions in shell scripts.
[ <cond> ];
[[ <cond> ]];
(( <cond> ));
test <cond>;
I would like to have a comprehensive overview of what the differences between these methods are, and also when to use which of the methods.
I've tried searching the web for an summary but didn't find anything decent. It'd be great to have a decent list up somewhere (stack overflow to the rescue!).
Let's describe them here.
First of all, there are basically 3 different test methods
[ EXPRESSION ], which is exactly the same as test EXPRESSION
[[ EXPRESSION ]]
(( EXPRESSION )), which is exactly the same as let "EXPRESSION"
Let's go into the details:
test
This is the grandfather of test commands. Even if your shell does not support it, there's still a /usr/bin/test command on virtually every unix system. So calling test will either run the built-in or the binary as a fallback. Enter $ type test to see which version is used. Likewise for [.
In most basic cases, this should be sufficient to do your testing.
if [ "$a" = test -o "$a" = Test ];
if test "$a" = test -o "$a" = Test;
If you need more power, then there's...
[[]]
This is a bash special. Not every shell needs to support this, and there's no binary fallback. It provides a more powerful comparison engine, notably pattern matching and regular expression matching.
if [[ "$a" == [Tt]es? ]]; # pattern
if [[ "$a" =~ ^[Tt]es.$ ]]; # RE
(())
This is a bash special used for arithmetic expressions, and is true if the result of the calculation is non-zero. Not every shell needs to support this, and there's no binary fallback.
if (( x * (1 + x++) ));
if let "x * (1 + x++)";
Note that you can omit the $ sign when referencing variables within (( ... )) or let.
On the site linked here, if you scroll down to the [ special character, you will see a separate entry for [[, with a link to the discussion of the differences between them. There is also an entry for (( below those. Hope that helps!

Resources