Idiomatic use of process substitution - bash

I have learned Bash process substitution from Bash's man page. Unfortunately, my unskilled usage of the feature is ugly.
DEV=<(some commands that produce lines of data) && {
while read -u ${DEV##*/} FIELD1 FIELD2 FIELD3; do
some commands that consume the fields of a single line of data
done
}
Do skilled programmers have other ways to do this?
If an executable sample is desired, try this:
DEV=<(echo -ne "Cincinnati Hamilton Ohio\nAtlanta Fulton Georgia\n") && {
while read -u ${DEV##*/} FIELD1 FIELD2 FIELD3; do
echo "$FIELD1 lies in $FIELD2 County, $FIELD3."
done
}
Sample output:
Cincinnati lies in Hamilton County, Ohio.
Atlanta lies in Fulton County, Georgia.
In my actual application, the "some commands" are more complicated, but the above sample captures the essence of the question.
Process substitution <() is required. Alternatives to process substitution would not help.

Redirect into the loop's stdin with the operator <.
while read city county state; do
echo "$city lies in $county County, $state."
done < <(echo -ne "Cincinnati Hamilton Ohio\nAtlanta Fulton Georgia\n")
Output:
Cincinnati lies in Hamilton County, Ohio.
Atlanta lies in Fulton County, Georgia.
Note that in this example, a pipe works just as well.
echo -ne "Cincinnati Hamilton Ohio\nAtlanta Fulton Georgia\n" |
while read city county state
do
echo "$city lies in $county County, $state."
done
Also, uppercase variable names should be reserved for environment variables (like PATH) and other special variables (like RANDOM). And descriptive variable names are always good.

There are few alternative that will be portable. The 'right' choice depends on the specific case. In particular, it depends on the time to produce the input data, and the size of the input. In particular:
If it takes lot of time to process the data, you want to get parallel processing between the data generation, and the 'while' loop. This will result in incremental processing, and not having to wait for all the input data processing, before starting output data processing.
If the input is very large (and does not fit into a shell variable), you might not have a choice but to force an actual pipe. This is also true when the data is binary, Unicode, or similar - where bash variable will not work.
Mapping to the original question - PRODUCE = echo Cincinnati ..., and CONSUME - echo "$city ..."
For the trivial case (small input, fast produce/consume), the following will work. Bash will run them SEQUNIALLY: PRODUCE then CONSUME.
while read ... ; do
CONSUME
done <<< "$(PRODUCE)"
For the complex case (large input, or slow produce & consume), the following can be use to request PARALLEL execution
while read ... ; do
CONSUME
done < <(PRODUCE)
For the PRODUCE code is complex (loops, conditional, etc), or long (multiple lines), consider moving it into a function, instead of in-lining them into the loop command.
function produce {
PRODUCE
}
while read ... ; do
CONSUME
done < <(produce)

Related

Iterating an unusual incremental stepping pattern of numbers in a bash loop

I have some IDs that follow this unusual incremental stepping pattern, as follows:
eg600100.etc
eg600101.etc
eg600102.etc
...
eg600109.etc
eg600200.etc
...
eg600209.etc
eg600300.etc
...
eg600909.etc
eg601000.etc
eg601001.etc
...
eg601009.etc
eg601100.etc
...
eg601200.etc
...
eg601909.etc
As far as I can tell, it's broken up like this:
60|01-19|00-09
I'm wanting to build a loop that can iterate over each potential ID incrementally up to the end of the range (which is 601909).
How do I break the number up into those 3 segments to manage the unusual stepping for a loop?
I've looked at seq, but I can't figure out how to make it accept the unusual stepping here so that it does not give me numbers between increments that do not exist as potential IDs as above.
#!/bin/bash
for id in $(seq -w 601909)
do
echo "Testing current ID, which is eg$id.etc"
done
Any ideas?
Try using Brace Expansion:
printf %s\\n eg60{01..19}{00..09}.etc >file
Run this as a test, but you can iterate over such an expansion:
for id in eg60{01..19}{00..09}.etc; do echo Testing ID: $id; done

Returning values from functions when efficiency matters

It seems to me, there are several ways to return a value from a Bash function.
Approach 1: Use a "local-global" variable, which is defined as local in the caller:
func1() {
a=10
}
parent1() {
local a
func1
a=$(($a + 1))
}
Approach 2: Use command substitution:
func2() {
echo 10
}
parent2() {
a=$(func2)
a=$(($a + 1))
}
How much speedup could one expect from using approach 1 over approach2?
And, I know that it is not good programming practice to use global variables like in approach 1, but could it at some point be justified due to efficiency considerations?
The single most expensive operation in shell scripting is forking. Any operation involving a fork, such as command substitution, will be 1-3 orders of magnitude slower than one that doesn't.
For example, here's a straight forward approach for a loop that reads a bunch of generated files on the form of file-1234 and strips out the file- prefix using sed, requiring a total of three forks (command substitution + two stage pipeline):
$ time printf "file-%s\n" {1..10000} |
while read line; do n=$(echo "$line" | sed -e "s/.*-//"); done
real 0m46.847s
Here's a loop that does the same thing with parameter expansion, requiring no forks:
$ time printf "file-%s\n" {1..10000} |
while read line; do n=${line#*-}; done
real 0m0.150s
The forky version takes 300x longer.
Therefore, the answer to your question is yes: if efficiency matters, you have solid justification for factoring out or replacing forky code.
When the fork count is constant with respect to the input (or it's too messy to make it constant), and the code is still too slow, that's when you should rewrite it in a faster language.
surely approach 1 is much faster than approach 2 because it has not any interrupt (which in turn may need several OS kernel crossing to service) and has only one memory access!!!

Parsing the output of Bash's time builtin

I'm running a C program from a Bash script, and running it through a command called time, which outputs some time statistics for the running of the algorithm.
If I were to perform the command
time $ALGORITHM $VALUE $FILENAME
It produces the output:
real 0m0.435s
user 0m0.430s
sys 0m0.003s
The values depending on the running of the algorithm
However, what I would like to be able to do is to take the 0.435 and assign it to a variable.
I've read into awk a bit, enough to know that if I pipe the above command into awk, I should be able to grab the 0.435 and place it in a variable. But how do I do that?
Many thanks
You must be careful: there's the Bash builtin time and there's the external command time, usually located in /usr/bin/time (type type -a time to have all the available times on your system).
If your shell is Bash, when you issue
time stuff
you're calling the builtin time. You can't directly catch the output of time without some minor trickery. This is because time doesn't want to interfere with possible redirections or pipes you'll perform, and that's a good thing.
To get time output on standard out, you need:
{ time stuff; } 2>&1
(grouping and redirection).
Now, about parsing the output: parsing the output of a command is usually a bad idea, especially when it's possible to do without. Fortunately, Bash's time command accepts a format string. From the manual:
TIMEFORMAT
The value of this parameter is used as a format string specifying how the timing information for pipelines prefixed with the time reserved word should be displayed. The % character introduces an escape sequence that is expanded to a time value or other information. The escape sequences and their meanings are as follows; the braces denote optional portions.
%%
A literal `%`.
%[p][l]R
The elapsed time in seconds.
%[p][l]U
The number of CPU seconds spent in user mode.
%[p][l]S
The number of CPU seconds spent in system mode.
%P
The CPU percentage, computed as (%U + %S) / %R.
The optional p is a digit specifying the precision, the number of fractional digits after a decimal point. A value of 0 causes no decimal point or fraction to be output. At most three places after the decimal point may be specified; values of p greater than 3 are changed to 3. If p is not specified, the value 3 is used.
The optional l specifies a longer format, including minutes, of the form MMmSS.FFs. The value of p determines whether or not the fraction is included.
If this variable is not set, Bash acts as if it had the value
$'\nreal\t%3lR\nuser\t%3lU\nsys\t%3lS'
If the value is null, no timing information is displayed. A trailing newline is added when the format string is displayed.
So, to fully achieve what you want:
var=$(TIMEFORMAT='%R'; { time $ALGORITHM $VALUE $FILENAME; } 2>&1)
As #glennjackman points out, if your command sends any messages to standard output and standard error, you must take care of that too. For that, some extra plumbing is necessary:
exec 3>&1 4>&2
var=$(TIMEFORMAT='%R'; { time $ALGORITHM $VALUE $FILENAME 1>&3 2>&4; } 2>&1)
exec 3>&- 4>&-
Source: BashFAQ032 on the wonderful Greg's wiki.
You could try the below awk command which uses split function to split the input based on digit m or last s.
$ foo=$(awk '/^real/{split($2,a,"[0-9]m|s$"); print a[2]}' file)
$ echo "$foo"
0.435
You can use this awk:
var=$(awk '$1=="real"{gsub(/^[0-9]+[hms]|[hms]$/, "", $2); print $2}' file)
echo "$var"
0.435

script to find similar email users

We have a mail server and I am trying to write a script that will find all users with similar names to avoid malicious users from impersonating legitimate users. For example, a legit user may have the name of james2014#domain.com but a malicious user may register as james20l4#domain.com. The difference, if you notice carefully, is that I replaced the number 'one' with the letter 'l' (el). So I am trying to write something that can consult my /var/vmail/domain/* and find similar names and alert me (the administrator). I will then take the necessary steps to do what I need. Really appreciate any help.
One hacky way to do this is to derive "normalized" versions of your usernames, put those in an associative array as keys mapping to the original input, and use those to find problems.
The example I posted below uses bash associative arrays to store the mapping from normalized name to original name, and tr to switch some characters for other characters (and delete other characters entirely).
I'm assuming that your list of users will fit into memory; you'll also need to tweak the mapping of modified and removed characters to hit your favorite balance between effectiveness and false positives. If your list can't fit in memory, you can use a single file or the filesystem to approximate it, but honestly if you're processing that many names you're probably better off with a non-shell programming language.
Input:
doc
dopey
james2014
happy
bashful
grumpy
james20l4
sleepy
james.2014
sneezy
Script:
#!/bin/bash
# stdin: A list of usernames. stdout: Pairs of names that match.
CHARS_TO_REMOVE="._\\- "
CHARS_TO_MAP_FROM="OISZql"
CHARS_TO_MAP_TO="0152g1"
normalize() {
# stdin: A word. stdout: A modified version of the same word.
exec tr "$CHARS_TO_MAP_FROM" "$CHARS_TO_MAP_TO" \
| tr --delete "$CHARS_TO_REMOVE" \
| tr "A-Z" "a-z"
}
declare -A NORMALIZED_NAMES
while read NAME; do
NORMALIZED_NAME=$(normalize <<< "$NAME")
# -n tests for non-empty strings, as it would be if the name were set already.
if [[ -n ${NORMALIZED_NAMES[$NORMALIZED_NAME]} ]]; then
# This name has been seen before! Print both of them.
echo "${NORMALIZED_NAMES[$NORMALIZED_NAME]} $NAME"
else
# This name has not been seen before. Store it.
NORMALIZED_NAMES["$NORMALIZED_NAME"]="$NAME"
fi
done
Output:
james2014 james20l4
james2014 james.2014

bash while loop with command as part of the expression?

I am trying to read part of a file and stop and a particular line, using bash. I am not very familiar with bash, but I've been reading the manual and various references, and I don't understand why something like the following does not work (but instead produces a syntax error):
while { read -u 4 line } && (test "$line" != "$header_line")
do
echo in loop, line=$line
done
I think I could write a loop that tests a "done" variable, and then do my real tests inside the loop and set "done" appropriately, but I am curious as to 1) why the above does not work, and 2) is there some small correction that would make it work? I still fairly confused about when to use [, (, {, or ((, so perhaps some other combination would work, though I have tried several.
(Note: The "read -u 4 line" works fine when I call it above the loop. I have opened a file on file descriptor 4.)
I think what you want is more like this:
while read -u 4 line && test "$line" != "$header_line"
do
...
done
Braces (the {} characters) are used to separate variables from other parts of a string when whitespace cannot be used. For example, echo "${var}x" will print the value of the variable var followed by an x, but echo "$varx" will print the value of the variable varx.
Brackets (the [] characters) are used as a shortcut for the test program. [ is another name for test, but when test detects that it was called with [ it required a ] as its last argument. The point is clarity.
Parenthesis (the () characters) are used in a number of different situations. They generally start subshells, although not always (I'm not really certain in case #3 here):
Retrieving a single exit code from a series of processes, or a single output stream from a sequence of commands. For example, (echo "Hi" ; echo "Bye") | sed -e "s/Hi/Hello/" will print two lines, "Hello" and "Bye". It is the easiest way to get multiple echo statements to produce a single stream.
Evaluating commands as if they were variables: $(expr 1 + 1) will act like a variable, but will produce the value 2.
Performing math: $((5 * 4 / 3 + 2 % 1)) will evaluate like a variable, but will compute the result of that mathematical expression.
The && operator is a list operator - he seperates two commands and only executes when the first is true, but in this case the first is the while and he is expecting his do stuff. And then he reaches do and the while stuff is already history.
Your intention is to put it into the expression. So you put it together with (). E.g. this a solution with just a small change
while ( read -u 4 line && test "$line" != "$header_line" )

Resources