bash script to modify and extract information - bash

I am creating a bash script to modify and summarize information with grep and sed. But it gets stuck.
#!/bin/bash
# This script extracts some basic information
# from text files and prints it to screen.
#
# Usage: ./myscript.sh </path/to/text-file>
#Extract lines starting with ">#HWI"
ONLY=`grep -v ^\>#HWI`
#replaces A and G with R in lines
ONLYR=`sed -e s/A/R/g -e s/G/R/g $ONLY`
grep R $ONLYR | wc -l

The correct way to write a shell script to do what you seem to be trying to do is:
awk '
!/^>#HWI/ {
gsub(/[AG]/,"R")
if (/R/) {
++cnt
}
END { print cnt+0 }
' "$#"
Just put that in the file myscript.sh and execute it as you do today.
To be clear - the bulk of the above code is an awk script, the shell script part is the first and last lines where the shell just calls awk and passes it the input file names.
If you WANT to have intermediate variables then you can create/print them with:
awk '
!/^>#HWI/ {
only = $0
onlyR = only
gsub(/[AG]/,"R",onlyR)
print "only:", only
print "onlyR:", onlyR
if (/R/) {
++cnt
}
END { print cnt+0 }
' "$#"
The above will work robustly, portably, and efficiently on all UNIX systems.

First of all, and as #fedorqui commented - you're not providing grep with a source of input, against which it will perform line matching.
Second, there are some problems in your script, which will result in unwanted behavior in the future, when you decide to manipulate some data:
Store matching lines in an array, or a file from which you'll later read values. The variable ONLY is not the right data structure for the task.
By convention, environment variables (PATH, EDITOR, SHELL, ...) and internal shell variables (BASH_VERSION, RANDOM, ...) are fully capitalized. All other variable names should be lowercase. Since
variable names are case-sensitive, this convention avoids accidentally overriding environmental and internal variables.
Here's a better version of your script, considering these points, but with an open question regarding what you were trying to do in the last line : grep R $ONLYR | wc -l :
#!/bin/bash
# This script extracts some basic information
# from text files and prints it to screen.
#
# Usage: ./myscript.sh </path/to/text-file>
input_file=$1
# Read lines not matching the provided regex, from $input_file
mapfile -t only < <(grep -v '^\>#HWI' "$input_file")
#replaces A and G with R in lines
for((i=0;i<${#only[#]};i++)); do
only[i]="${only[i]//[AG]/R}"
done
# DEBUG
printf '%s\n' "Here are the lines, after relpace:"
printf '%s\n' "${only[#]}"
# I'm not sure what you were trying to do here. Am I gueesing right that you wanted
# to count the number of R's in ALL lines ?
# grep R $ONLYR | wc -l

Related

Update version number in property file using bash

I am new in bash scripting and I need help with awk. So the thing is that I have a property file with version inside and I want to update it.
version=1.1.1.0
and I use awk to do that
file="version.properties"
awk -F'["]' -v OFS='"' '/version=/{
split($4,a,".");
$4=a[1]"."a[2]"."a[3]"."a[4]+1
}
;1' $file > newFile && mv newFile $file
but I am getting strange result version="1.1.1.0""...1
Could someone help me please with this.
You mentioned in your comment you want to update the file in place. You can do that in a one-liner with perl:
perl -pe '/^version=/ and s/(\d+\.\d+\.\d+\.)(\d+)/$1 . ($2+1)/e' -i version.properties
Explanation
-e is followed by a script to run. With -p and -i, the effect is to run that script on each line, and modify the file in place if the script changes anything.
The script itself, broken down for explanation, is:
/^version=/ and # Do the following on lines starting with `version=`
s/ # Make a replacement on those lines
(\d+\.\d+\.\d+\.)(\d+)/ # Match x.y.z.w, and set $1 = `x.y.z.` and $2 = `w`
$1 . ($2+1)/ # Replace x.y.z.w with a copy of $1, followed by w+1
e # This tells Perl the replacement is Perl code rather
# than a text string.
Example run
$ cat foo.txt
version=1.1.1.2
$ perl -pe '/^version=/ and s/(\d+\.\d+\.\d+\.)(\d+)/$1 . ($2+1)/e' -i foo.txt
$ cat foo.txt
version=1.1.1.3
This is not the best way, but here's one fix.
Test case
I am assuming the input file has at least one line that is exactly version=1.1.1.0.
$ awk -F'["]' -v OFS='"' '/version=/{
> split($4,a,".");
> $4=a[1]"."a[2]"."a[3]"."a[4]+1
> }
> ;1' <<<'version=1.1.1.0'
Output:
version=1.1.1.0"""...1
The """ is because you are assigning to field 4 ($4). When you do that, awk adds field separators (OFS) between fields 1 and 2, 2 and 3, and 3 and 4. Three OFS => """, in your example.
Minimal change
$ awk -F'["]' -v OFS='"' '/version=/{
split($1,a,".");
$1=a[1]"."a[2]"."a[3]"."a[4]+1;
print
}
' <<<'version=1.1.1.0'
version=1.1.1.1
Two changes:
Change $4 to $1
Since the input field separator (-F) is ["], $4 is whatever would be after the third " (if there were any in the input). Therefore, split($4, ...) splits an empty field. The contents of the line, before the first " (if any), are in $1.
print at the end instead of ;1
The 1 after the closing curly brace is the next condition, and there is no action specified. The default action is to print the current line, as modified, so the 1 triggers printing. Instead, just print within your action when you are done processing. That way your action is self-contained. (Of course, if you needed to do other processing, you might want to print later, after that processing.)
You can use the = as the delimiter, like this:
awk -F= -v v=1.0.1 '$1=="version"{printf "version=\"%s\"\n", v}' file.properties

Bash - shorten script with function

In my script I need to get the highest number of a file two times, so i wanted to create a function. This is the command in the script:
First time:
highest=$( ls $path.bak.* | sort -t"." -k2 -n | tail -n1 | sed -r 's/.*\.(.*)/\1/')
Second time:
newhighest=$(ls $path.bak.* | sort -t"." -k2 -n | tail -n1 | sed -r 's/.*\.(.*)/\1/')
Now my question:
How can I shorten this with a function?
Here my Input-Files:
test.bak.1
test.bak.2
test.bak.3
test.bak.4
test.bak.5
test.bak.6
test.bak.7
test.bak.8
test.bak.9
test.bak.10
test.bak.11
Expected return: 11
Written out for readability:
#!/usr/bin/env bash
# ^^^^ - Ensure that this script is run with bash, not /bin/sh
# Enable "extended globs", so we can exclude names that don't end with digits
shopt -s extglob
# since your files are test.bak.*
path=test
get_highest() {
# set the function's argument list
set -- "$path".bak.+([[:digit:]])
# if we have just one valid filename, we know the glob expanded successfully.
# otherwise, no such files exist, so exit the function immediately
[[ -e $1 || -L $1 ]] || return 1
# stream our list of extensions into sort, and let awk find the highest number
printf '%s\n' "${###*.}" | awk '$0>last{last=$0}END{print last}'
}
highest=$(get_highest) || { echo "No backup files found" >&2; exit 1; }
new_highest=$(get_highest) || { echo "No backup files on 2nd pass" >&2; exit 1; }
Note:
Expansions need to be quoted; "$path"/*, not $path/*, or else path="Directory With Spaces/test" would look for files in Spaces/test, after emitting Directory and With as results.
ls should never be used programatically.
extglob syntax allows regex-like capabilities for matching groups of files, letting us assert here that we only consider filenames that end in .bak. followed by a digit.
In general, you should write your scripts to be easy to read and understand as a higher priority than writing them to be short. Your future self (and others who need to maintain code in the future) will thank you.
Because newlines can contain in filenames, they're unsafe to use to separate filenames in a stream; only the NUL character is safe for this use when names are not otherwise quoted or escaped. Thus, when emitting a stream of arbitrary names they should be formatted with the string %s\0 and sorted with the -z argument. However, we're only printing the numeric extensions here, making newlines safe.

Bash command to read a line based on the parameters I pass - perform column-based lookups

I have a file links.txt:
1 a.sh
3 b.sh
6 c.sh
4 d.sh
So, if i pass 1,4 as parameters to another file(master.sh), a.sh and d.sh should be stored in a variable.
sed '3!d' would print the 3rd line, but not the line that starts with 3. For that, you need sed '/^3 /!d'. The problem is you can't combine them for more lines, as this means "Delete everything that doesn't start with a 3", which means all other lines will be missed. So, use sed -n '/^3 /p' instead, i.e. don't print by default and tell sed what lines to print, not what lines to delete.
You can loop over the argument and create a sed script from them that prints the lines, then run sed using this output:
#!/bin/bash
file=$1
shift
for id in "$#" ; do
echo "/^$id /p"
done | sed -nf- "$file"
Run as script.sh filename 3 4.
If you want to remove the id from the output, you can either use
cut -f2 -d' '
or you can modify the generated sed script to do the work
echo "/^$id /s/.* //p"
i.e. only print if the substitution was successful.
This loops through each argument and greps for it in the links file. The result is piped into cut where we specify the delimiter as a space with -d flag and the field number as 2 with -f flag. Finally this is appended to the array called files.
links="links.txt"
files=()
for arg in $#; do
files=("${files[#]}" `grep "^$arg" "$links" | cut -d" " -f2`)
done;
echo ${files[#]}
Usage:
$ ./master.sh 1 4
a.sh d.sh
Edit:
As pointed out by mklement0, the solution above reads the file once per arg. The following first builds the pattern then reads the file just once.
links="links.txt"
pattern="^$1\s"
for arg in ${#:2}; do
pattern+="|^$arg\s"
done
files=$(grep -E "$pattern" "$links" | cut -d" " -f2)
echo ${files[#]}
Usage:
$ ./master.sh 1 4
a.sh d.sh
Here is another example with grep and cut:
#!/bin/bash
for line in $(grep "$1\|$2" links.txt|cut -d' ' -f2)
do
echo $line
done
Example of usage:
./master.sh 1 4
a.sh
d.sh
Why not just stores the values and call them at will:
items=()
while read -r num file
do
items[num]="$file"
done<links.txt
for arg
do
echo "${items[arg]}"
done
Now you can use the items array any time you like :)
The following awk solution:
preserves the argument order; that is, the results reflect the order in which the lookup values were specified (as opposed to the order in which the lookup values happen to occur in the file).
If that is not important (i.e., if outputting the results in file order is acceptable), the readarray technique below can be combined with this one-liner, which is a generalized variant of Panta's answer:
grep -f <(printf "^%s\n" "$#") links.txt | cut -d' ' -f2-
performs well, because the input file is only read once; the only requirement is that all key-value pairs fit into memory as a whole (as a single associative Awk array (dictionary)).
works with any lookup values that don't have embedded whitespace.
Similarly, the assumption is that the output column values (containing values such as a.sh in the sample input) have no embedded whitespace. awk doesn't handle quoted fields well, so more work would be needed.
#!/bin/bash
readarray -t files < <(
awk -v idList="$*" '
BEGIN { count=split(idList, idArr); for (i in idArr) idDict[idArr[i]]++ }
$1 in idDict { idDict[$1] = $2 }
END { for (i=1; i<=count; ++i) print idDict[idArr[i]] }
' links.txt
)
# Print results.
printf '%s\n' "${files[#]}"
readarray -t files reads stdin input (<) line by line into array variable files.
Note: readarray requires Bash v4+; on Bash 3.x, such as on macOS, replace this part with
IFS=$'\n' read -d '' -ra files
<(...) is a Bash process substitution that, loosely speaking, presents the output from the enclosed command as if it were (self-deleting) temporary file.
This technique allows readarray to run in the current shell (as opposed to a subshell if a pipeline had been used), which is necessary for the files variable to remain defined in the remainder of the script.
The awk command breaks down as follows:
-v idList="$*" passes the space-separated list of all command-line arguments as a single string to Awk variable idList.
Note that this assumes that the arguments have no embedded spaces, which is indeed the case here and also generally the case with identifiers.
BEGIN { ... } is only executed once, before the individual lines are processed:
split(idList, idArr) splits the input ID list into an array by whitespace and stores the result in idArr.
for (i in idArr) idDict[idArr[i]]++ } then converts the (conceptually regular) array into associative array idDict (dictionary), whose keys are the input IDs - this enables efficient lookup by ID later, and also allows storing the lookup result for each ID.
$1 in idDict { idDict[$1] = $2 } is processed for every input line:
Pattern $1 in idDict returns true if the line's first whitespace-separated field ($1) - e.g., 6 - is among the keys (in) of associative array idDict, and, if so, executes the associated action ({...}).
Action { idDict[$1] = $2 } then assigns the second field ($2) - e.g., c.sh - to the iDict entry for key $1.
END { ... } is executed once, after all input lines have been processed:
for (i=1; i<=count; ++i) print idDict[idArr[i]] loops over all input IDs in order and prints each ID's lookup result, which is the value of the dictionary entry with that ID.

Looking for a regex pattern, passing that pattern to a script, and replacing the pattern with the output of the script

For every time the pattern shows up (In this example the case of a 2 digit number) I want to pass that pattern to a script and replace that pattern with the output of a script.
I'm using sed an example of what it should look like would be
echo 'siedi87sik65owk55dkd' | sed 's/[0-9][0-9]/.\/script.sh/g'
Right now this returns
siedi./script.shsik./script.showk./script.shdkd
But I would like it to return
siedi!!!87!!!sik!!!65!!!owk!!!55!!!dkd
This is what is in ./script.sh
#!/bin/bash
echo "!!!$1!!!"
It has to be replaced with the output. In this example I know I could just use a normal sed substitution but I don't want that as an answer.
sed is for simple substitutions on individual lines, that is all. Anything else, even if it can be done, requires arcane language constructs that became obsolete in the mid-1970s when awk was invented and are used today purely for the mental exercise. Your problem is not a simple substitution so you shouldn't try to use sed to solve it.
You're going to want something like:
awk '{
head = ""
tail = $0
while ( match(tail,/[0-9]{2}/) ) {
tgt = substr(tail,RSTART,RLENGTH)
cmd = "./script.sh " tgt
if ( (cmd | getline line) > 0) {
tgt = line
}
close(cmd)
head = head substr(tail,1,RSTART-1) tgt
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}'
e.g. using an echo in place of your script.sh command:
$ echo 'siedi87sik65owk55dkd' |
awk '{
head = ""
tail = $0
while ( match(tail,/[0-9]{2}/) ) {
tgt = substr(tail,RSTART,RLENGTH)
cmd = "echo !!!" tgt "!!!"
if ( (cmd | getline line) > 0) {
tgt = line
}
close(cmd)
head = head substr(tail,1,RSTART-1) tgt
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}'
siedi!!!87!!!sik!!!65!!!owk!!!55!!!dkd
Ed's awk solution is obviously the way to go here.
For fun, I tried to come up with a sed solution, and here is (a convoluted GNU sed) one that takes the pattern and the script to be run as parameters; the input is either read from standard input (i.e., you can pipe to it) or from a file supplied as the third argument.
For your example, we'd have infile with contents
siedi87sik65owk55dkd
siedi11sik22owk33dkd
(two lines to demonstrate how this works for multiple lines), then script with contents
#!/bin/bash
echo "!!!${1}!!!"
and finally the solution script itself, so. Usage is
./so pattern script [input]
where pattern is an extended regular expression as understood by GNU sed (with the -r option), script is the name of the command you want to run for each match, and the optional input is the name of the input file if input is not standard input.
For your example, this would be
./so '[[:digit:]]{2}' script infile
or, as a filter,
cat infile | ./so '[[:digit:]]{2}' script
with output
siedi!!!87!!!sik!!!65!!!owk!!!55!!!dkd
siedi!!!11!!!sik!!!22!!!owk!!!33!!!dkd
This is what so looks like:
#!/bin/bash
pat=$1 # The pattern to match
script=$2 # The command to run for each pattern
infile=${3:-/dev/stdin} # Read from standard input if not supplied
# Use sed and have $pattern and $script expand to the supplied parameters
sed -r "
:build_loop # Label to loop back to
h # Copy pattern space to hold space
s/.*($pat).*/.\/\"$script\" \1/ # (1) Extract last match and prepare command
# Replace pattern space with output of command
e
G # (2) Append hold space to pattern space
s/(.*)$pat(.*)/\1~~~\2/ # (3) Replace last match of pattern with ~~~
/\n[^\n]*$pat[^\n]*$/b build_loop # Loop if string contains match
:fill_loop # Label for second loop
s/(.*\n)(.*)\n([^\n]*)~~~([^\n]*)$/\1\3\2\4/ # (4) Replace last ~~~
t fill_loop # Loop if there was a replacement
s/(.*)\n(.*)~~~(.*)$/\2\1\3/ # (5) Final ~~~ replacement
" < "$infile"
The sed command works with two loops. The first one copies the pattern space to the hold space, then removes everything but the last match from the pattern space and prepares the command to be run. After the substitution with (1) in its comment, the pattern space looks like this:
./script 55
The e command (a GNU extension) then replaces the pattern space with the output of this command. After this, G appends the hold space to the pattern space (2). The pattern space now looks like this:
!!!55!!!
siedi87sik65owk55dkd
The substitution at (3) replaces the last match with a string hopefully not equal to the pattern and we get
!!!55!!!
siedi87sik65owk~~~dkd
The loop repeats if the last line of the pattern space still has a match for the pattern. After three loops, the pattern space looks like this:
!!!87!!!
!!!65!!!
!!!55!!!
siedi~~~sik~~~owk~~~dkd
The second loop now replaces the last ~~~ with the second to last line of the pattern space with substitution (4). The command uses lots of "not a newline" ([^\n]) to make sure we're not pulling the wrong replacement for ~~~.
Because of the way command (4) is written, the loop ends with one last substitution to go, so before command (5), we have this pattern space:
!!!87!!!
siedi~~~sik!!!65!!!owk!!!55!!!dkd
Command (5) is a simpler version of command (4), and after it, the output is as desired.
This seems to be fairly robust and can deal with spaces in the name of the script to be run as long as it's properly quoted when calling:
./so '[[:digit:]]{2}' 'my script' infile
This would fail if
The input file contains ~~~ (solvable by replacing all occurrences at the start, putting them back at the end)
The output of script contains ~~~
The pattern contains ~~~
i.e., the solution very much depends on ~~~ being unique.
Because nobody asked: so as a one-liner.
#!/bin/bash
sed -re ":b;h;s/.*($1).*/.\/\"$2\" \1/;e" -e "G;s/(.*)$1(.*)/\1~~~\2/;/\n[^\n]*$1[^\n]*$/bb;:f;s/(.*\n)(.*)\n([^\n]*)~~~([^\n]*)$/\1\3\2\4/;tf;s/(.*)\n(.*)~~~(.*)$/\2\1\3/" < "${3:-/dev/stdin}"
Still works!
A conceptually simpler multi-utility solution:
Using GNU utilities:
echo 'siedi87sik65owk55dkd' |
sed 's|[0-9]\{2\}|$(./script.sh &)|g' |
xargs -d'\n' -I% sh -c 'echo '\"%\"
Using BSD utilities (also works with GNU utilities):
echo 'siedi87sik65owk55dkd' |
sed 's|[0-9]\{2\}|$(./script.sh &)|g' | tr '\n' '\0' |
xargs -0 -I% sh -c 'echo '\"%\"
The idea is to use sed to translate the tokens of interest lexically into a string containing shell command substitutions that invoke the target script with the token, and then pass the result to the shell for evaluation.
Note:
Any embedded " and $ characters in the input must be \-escaped.
xargs -d'\n' (GNU) and tr '\n' '\0' / xargs -0 (BSD) are only needed to correctly preserve whitespace in the input - if that is not needed, the following POSIX-compliant solution will do:
echo 'siedi87sik65owk55dkd' |
sed 's|[0-9]\{2\}|$(./script.sh &)|g' | tr '\n' '\0' |
xargs -I% sh -c 'printf "%s\n" '\"%\"

Modify a shell variable inside awk block of code

Is there any way to modify a shell variable inside awk block of code?
--------- [shell_awk.sh]---------------------
#!/bin/bash
shell_variable_1=<value A>
shell_variable_2=<value B>
shell_variable_3=<value C>
awk 'function A(X)
{ return X+1 }
{ a=A('$shell_variable_1')
b=A('$shell_variable_2')
c=A('$shell_variable_3')
shell_variable_1=a
shell_variable_2=b
shell_variable_3=c
}' FILE.TXT
--------- [shell_awk.sh]---------------------
This is a very simple example, the real script load a file and make some changes using functions, I need to keep each value before change into a specific variable, so then I can register into MySQL the before and after value.
The after value is received from parameters ($1, $2 and so on).
The value before I already know how to get it from the file.
All is done well, except the shell_variable been set by awk variable. Outside from awk block code is easy to set, but inside, is it possible?
No program -- in awk, shell, or any other language -- can directly modify a parent process's memory. That includes variables. However, of course, your awk can write contents to stdout, and the parent shell can read that content and modify its own variables itself.
Here's an example of awk that writes key/value pairs out to be read by bash. It's not perfect -- read the caveats below.
#!/bin/bash
shell_variable_1=10
shell_variable_2=20
shell_variable_3=30
run_awk() {
awk -v shell_variable_1="$shell_variable_1" \
-v shell_variable_2="$shell_variable_2" \
-v shell_variable_3="$shell_variable_3" '
function A(X) { return X+1 }
{ a=A(shell_variable_1)
b=A(shell_variable_2)
c=A(shell_variable_3) }
END {
print "shell_variable_1=" a
print "shell_variable_2=" b
print "shell_variable_3=" c
}' <<<""
}
while IFS="=" read -r key value; do
printf -v "$key" '%s' "$value"
done < <(run_awk)
for var in shell_variable_{1,2,3}; do
printf 'New value for %s is %s\n' "$var" "${!var}"
done
Advantages
Doesn't use eval. Content such as $(rm -rf ~) in the output from awk won't be executed by your shell.
Disadvantages
Can't handle variable contents with newlines. (You could fix this by NUL-delimiting output from your awk script, and adding -d '' to the read command).
A hostile awk script could modify PATH, LD_LIBRARY_PATH, or other security-sensitive variables. (You could fix this by reading variables into an associative array, rather than the global namespace, or by enforcing a prefix on their names).
The code above uses several ksh extensions also available in bash; however, it will not run with POSIX sh. Thus, be sure not to run this via sh scriptname (which only guarantees POSIX functionality).

Resources