Using sed to search large number of files for specific string and replace it - bash

What I am trying to do is search a large number of source files for a particular pattern and put in fort of this pattern another expression. The files I am looking in are all with the same extension *.F90.
My first step is to use grep and find all lines of those files containing allocate but not allocated, so I have:
grep –I “ allocate *(” *.F90 | grep –v allocated
The first problem that I have is that the bracket might be preceded by one or more spaces. I can have
allocate(
or allocate (
or allocate (
This is why I need the “*” in the grep command.
The general rule however (besides the spaces) says that the allocate is followed by “(” and than comes the thing that is being allocated. So I have:
allocate ( array_name ( ....
again the spaces are optional
So what I would like to do is find this string, and put in front of it the following:
If( allocated(array_name) ) deallocate(array_name)
and imidiately after this on the next line I would like to have the original string allocate(array( … .
Please note that the array_name is an alphanumeric string which after the substitutions is appearing in more than one place. It is the name of the array being alocated.
I would be very grateful if someone can give me a hint how to do this. I am stuck and have no idea how to do it.

I assume you mean you want to replace allocate ( array_name ) with If( allocated(array_name) ) deallocate(array_name) allocate ( array_name ).
In GNU or BSD sed you can do the following:
sed -i.bk -e '/allocated/t' \
-e 's/allocate *( *\([A-Za-z0-9_]*\) *)/If( allocated(\1) ) deallocate(\1) &/' \
*.F90
This will search and replace matching lines in *.F90 and skip lines with allocated on. The original file will be called *.F90.bk.
As #Anders Johansson mentioned there can be other cases where the argument to allocate is something not alphanum-underscore, then you can search for this before you search and replace:
for i in *.F90; do
echo "$i"
sed -n '/.*allocate *( *\([^ )]*\) *).*/{h; s//\1/; /^[A-Za-z0-9_]*$/t
x; p;}' "$i"
done
(note the newline after t, BSD sed interpret everything after t as a label). Press ctrl+v ctrl+j in bash to input a newline on the command line.
/a\(b\)c/ find line with matching string
h *h*old the match abc into hold space
s//\1/ *s*ubstitute last match abc with first group b
/^[a-z]*$/t if b matches ^[a-z]*$, then branch to end of script
x e*x*change hold space abc an pattern space b
p *p*rint pattern space b

cat old_file.txt | sed 's/allocate *( *\([a-zA-Z0-9_]*\)/If( allocated(\1) ) deallocate(\1)\
allocate(\1/' > new_file.txt

Related

Build a variable made with 2 sub-stings of another variable in bash

Here is a script I use:
for dir in $(find . -type d -name "single_copy_busco_sequences"); do
sppname=$(dirname $(dirname $(dirname $dir))| sed 's#./##g');
for file in ${dir}/*.faa; do name=$(basename $file); cp $file /Users/admin/Documents/busco_aa/${sppname}_${name}; sed -i '' 's#>#>'${sppname}'|#g' /Users/admin/Documents/busco_aa/${sppname}_${name}; cut -f 1 -d ":" /Users/admin/Documents/busco_aa/${sppname}_${name} > /Users/admin/Documents/busco_aa/${sppname}_${name}.1;
done;
done
The sppname variable is something like Gender_species
do you know how could I add a line in my script to creat a new variable called abbrev which transformes Gender_species into Genspe, the 3 first letters cat with the 3 first letters after _
exemples:
Homo_sapiens gives Homsap
Canis_lupus gives Canlup
etc
Thank for your help :)
You can achieve this using a regular expression with sed:
echo "Homo_sapiens" | sed -e s'/^\(...\).*_\(...\).*/\1\2/'
Homsap
start, get 3 chars (to keep in \1), anything, _, anything, get 3 chars (to keep in \2), anything
Replace echo "Homo_sapiens" by your $dir thing
PS: will fail if you have less than 3 chars in one word
You can do it all with bash built-in parameter expansions. Specifically, string indexes and substring removal.
$ a=Homo_sapiens; prefix=${a:0:3}; a=${a#*_}; postfix=${a:0:3}; echo $prefix$postfix
Homsap
$ a=Canis_lupus; prefix=${a:0:3}; a=${a#*_}; postfix=${a:0:3}; echo $prefix$postfix
Canlup
Using bash built-ins is always more efficient than spawning separate subshell(s) to invoke utilities to accomplish the same thing.
Explanation
Your string index form (bash only) allows you to index characters from within a string, e.g.
* ${parameter:offset:length} ## indexes are zero based, ${a:0:2} is 1st 2 chars
Where parameter is simply the variable name holding the string.
(you can index from the end of a string by using a negative offset preceded by a space or enclosed in parenthesis, e.g. a=12345; echo ${a: -3:2} outputs "34")
prefix=${a:0:3} ## save the first 3 characters in prefix
a=${a#*_} ## remove the front of the string through '_' (see below)
postfix=${a:0:3} ## save the first 3 characters after '_'
Your substring removal forms (POSIX) are:
${parameter#word} trim to 1st occurrence of word from parameter from left
${parameter##word} trim to last occurrence of word from parameter from left
and
${parameter%word} trim to 1st occurrence of word from parameter from right
${parameter%%word} trim to last occurrence of word from parameter from right
(word can contain globbing to expand to a pattern as well)
a=${a#*_} ## trim from left up to (and including) the first '_'
See bash(1) - Linux manual page for full details.

What ##*/ does in bash? [duplicate]

I have a string like this:
/var/cpanel/users/joebloggs:DNS9=domain.example
I need to extract the username (joebloggs) from this string and store it in a variable.
The format of the string will always be the same with exception of joebloggs and domain.example so I am thinking the string can be split twice using cut?
The first split would split by : and we would store the first part in a variable to pass to the second split function.
The second split would split by / and store the last word (joebloggs) into a variable
I know how to do this in PHP using arrays and splits but I am a bit lost in bash.
To extract joebloggs from this string in bash using parameter expansion without any extra processes...
MYVAR="/var/cpanel/users/joebloggs:DNS9=domain.example"
NAME=${MYVAR%:*} # retain the part before the colon
NAME=${NAME##*/} # retain the part after the last slash
echo $NAME
Doesn't depend on joebloggs being at a particular depth in the path.
Summary
An overview of a few parameter expansion modes, for reference...
${MYVAR#pattern} # delete shortest match of pattern from the beginning
${MYVAR##pattern} # delete longest match of pattern from the beginning
${MYVAR%pattern} # delete shortest match of pattern from the end
${MYVAR%%pattern} # delete longest match of pattern from the end
So # means match from the beginning (think of a comment line) and % means from the end. One instance means shortest and two instances means longest.
You can get substrings based on position using numbers:
${MYVAR:3} # Remove the first three chars (leaving 4..end)
${MYVAR::3} # Return the first three characters
${MYVAR:3:5} # The next five characters after removing the first 3 (chars 4-9)
You can also replace particular strings or patterns using:
${MYVAR/search/replace}
The pattern is in the same format as file-name matching, so * (any characters) is common, often followed by a particular symbol like / or .
Examples:
Given a variable like
MYVAR="users/joebloggs/domain.example"
Remove the path leaving file name (all characters up to a slash):
echo ${MYVAR##*/}
domain.example
Remove the file name, leaving the path (delete shortest match after last /):
echo ${MYVAR%/*}
users/joebloggs
Get just the file extension (remove all before last period):
echo ${MYVAR##*.}
example
NOTE: To do two operations, you can't combine them, but have to assign to an intermediate variable. So to get the file name without path or extension:
NAME=${MYVAR##*/} # remove part before last slash
echo ${NAME%.*} # from the new var remove the part after the last period
domain
Define a function like this:
getUserName() {
echo $1 | cut -d : -f 1 | xargs basename
}
And pass the string as a parameter:
userName=$(getUserName "/var/cpanel/users/joebloggs:DNS9=domain.example")
echo $userName
What about sed? That will work in a single command:
sed 's#.*/\([^:]*\).*#\1#' <<<$string
The # are being used for regex dividers instead of / since the string has / in it.
.*/ grabs the string up to the last backslash.
\( .. \) marks a capture group. This is \([^:]*\).
The [^:] says any character _except a colon, and the * means zero or more.
.* means the rest of the line.
\1 means substitute what was found in the first (and only) capture group. This is the name.
Here's the breakdown matching the string with the regular expression:
/var/cpanel/users/ joebloggs :DNS9=domain.example joebloggs
sed 's#.*/ \([^:]*\) .* #\1 #'
Using a single Awk:
... | awk -F '[/:]' '{print $5}'
That is, using as field separator either / or :, the username is always in field 5.
To store it in a variable:
username=$(... | awk -F '[/:]' '{print $5}')
A more flexible implementation with sed that doesn't require username to be field 5:
... | sed -e s/:.*// -e s?.*/??
That is, delete everything from : and beyond, and then delete everything up until the last /. sed is probably faster too than awk, so this alternative is definitely better.
Using a single sed
echo "/var/cpanel/users/joebloggs:DNS9=domain.example" | sed 's/.*\/\(.*\):.*/\1/'
I like to chain together awk using different delimitators set with the -F argument. First, split the string on /users/ and then on :
txt="/var/cpanel/users/joebloggs:DNS9=domain.com"
echo $txt | awk -F"/users/" '{print$2}' | awk -F: '{print $1}'
$2 gives the text after the delim, $1 the text before it.
I know I'm a little late to the party and there's already good answers, but here's my method of doing something like this.
DIR="/var/cpanel/users/joebloggs:DNS9=domain.example"
echo ${DIR} | rev | cut -d'/' -f 1 | rev | cut -d':' -f1

Replace and increment letters and numbers with awk or sed

I have a string that contains
fastcgi_cache_path /var/run/nginx-cache15 levels=1:2 keys_zone=MYSITEP:100m inactive=60m;
One of the goals of this script is to increment nginx-cache two digits based on the value find on previous file. For doing that I used this code:
# Replace cache_path
PREV=$(ls -t /etc/nginx/sites-available | head -n1) #find the previous cache_path number
CACHE=$(grep fastcgi_cache_path $PREV | awk '{print $2}' |cut -d/ -f4) #take the string to change
SUB=$(echo $CACHE |sed "s/nginx-cache[0-9]*[0-9]/&#/g;:a {s/0#/1/g;s/1#/2/g;s/2#/3/g;s/3#/4/g;s/4#/5/g;s/5#/6/g;s/6#/7/g;s/7#/8/g;s/8#/9/g;s/9#/#0/g;t a};s/#/1/g") #increment number
sed -i "s/nginx-cache[0-9]*/$SUB/g" $SITENAME #replace number
Maybe not so elegant, but it works.
The other goal is to increment last letter of all occurrences of MYSITEx (MYSITEP, in that case, should become MYSITEQ, after MYSITEP, etc. etc and once MYSITEZ will be reached add another letter, like MYSITEAA, MYSITEAB, etc. etc.
I thought something like:
sed -i "s/MYSITEP[A-Z]*/MYSITEGG/g" $SITENAME
but it can't works cause MYSITEGG is a static value and can't be used.
How can I calculate the last letter, increment it to the next one and once the last Z letter will be reached, add another letter?
Thank you!
Perl's autoincrement will work on letters as well as digits, in exactly the manner you describe
We may as well tidy your nginx-cache increment as well while we're at it
I assume SITENAME holds the name of the file to be modified?
It would look like this. I have to assign the capture $1 to an ordinary variable $n to increment it, as $1 is read-only
perl -i -pe 's/nginx-cache\K(\d+)/ ++($n = $1) /e; s/MYSITE\K(\w+)/ ++($n = $1) /e;' $SITENAME
If you wish, this can be done in a single substitution, like this
perl -i -pe 's/(?:nginx-cache|MYSITE)\K(\w+)/ ++($n = $1) /ge' $SITENAME
Note: The solution below is needlessly complicated, because as Borodin's helpful answer demonstrates (and #stevesliva's comment on the question hinted at), Perl directly supports incrementing letters alphabetically in the manner described in the question, by applying the ++ operator to a variable containing a letter (sequence); e.g.:
$ perl -E '$letters = "ZZ"; say ++$letters'
AAA
The solution below may still be of interest as an annotated showcase of how Perl's power can be harnessed from the shell, showing techniques such as:
use of s///e to determine the replacement string with an expression.
splitting a string into a character array (split //, "....")
use of the ord and chr functions to get the codepoint of a char., and convert a(n incremented) codepoint back to a char.
string replication (x operator)
array indexing and slices:
getting an array's last element ($chars[-1])
getting all but the last element of an array (#chars[0..$#chars-1])
A perl solution (in effect a re-implementation of what ++ can do directly):
perl -pe 's/\bMYSITE\K([A-Z]+)/
#chars = split qr(), $1; $chars[-1] eq "Z" ?
"A" x (1 + scalar #chars)
:
join "", #chars[0..$#chars-1], chr (1 + ord $chars[-1])
/e' <<'EOF'
...=MYSITEP:...
...=MYSITEZP:...
...=MYSITEZZ:...
EOF
yields:
...=MYSITEQ:... # P -> Q
...=MYSITEZQ:... # ZP -> ZQ
...=MYSITEAAA:... # ZZ -> AAA
You can use perl's -i option to replace the input file with the result
(perl -i -pe '...' "$SITENAME").
As Borodin's answer demonstrates, it's not hard to solve all tasks in the question using perl alone.
The s function's /e option allows use of a Perl expression for determining the replacement string, which enables sophisticated replacements:
$1 references the current MYSITE suffix in the expression.
#chars = split qr(), $1 splits the suffix into a character array.
$chars[-1] eq "Z" tests if the last suffix char. is Z
If so: The suffix is replaced with all As, with an additional A appended
("A" x (1 + scalar #chars)).
Otherwise: The last suffix char. is replaced with the following letter in the alphabet
(join "", #chars[0..$#chars-1], chr (1 + ord $chars[-1]))

Looking for a regex pattern, passing that pattern to a script, and replacing the pattern with the output of the script

For every time the pattern shows up (In this example the case of a 2 digit number) I want to pass that pattern to a script and replace that pattern with the output of a script.
I'm using sed an example of what it should look like would be
echo 'siedi87sik65owk55dkd' | sed 's/[0-9][0-9]/.\/script.sh/g'
Right now this returns
siedi./script.shsik./script.showk./script.shdkd
But I would like it to return
siedi!!!87!!!sik!!!65!!!owk!!!55!!!dkd
This is what is in ./script.sh
#!/bin/bash
echo "!!!$1!!!"
It has to be replaced with the output. In this example I know I could just use a normal sed substitution but I don't want that as an answer.
sed is for simple substitutions on individual lines, that is all. Anything else, even if it can be done, requires arcane language constructs that became obsolete in the mid-1970s when awk was invented and are used today purely for the mental exercise. Your problem is not a simple substitution so you shouldn't try to use sed to solve it.
You're going to want something like:
awk '{
head = ""
tail = $0
while ( match(tail,/[0-9]{2}/) ) {
tgt = substr(tail,RSTART,RLENGTH)
cmd = "./script.sh " tgt
if ( (cmd | getline line) > 0) {
tgt = line
}
close(cmd)
head = head substr(tail,1,RSTART-1) tgt
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}'
e.g. using an echo in place of your script.sh command:
$ echo 'siedi87sik65owk55dkd' |
awk '{
head = ""
tail = $0
while ( match(tail,/[0-9]{2}/) ) {
tgt = substr(tail,RSTART,RLENGTH)
cmd = "echo !!!" tgt "!!!"
if ( (cmd | getline line) > 0) {
tgt = line
}
close(cmd)
head = head substr(tail,1,RSTART-1) tgt
tail = substr(tail,RSTART+RLENGTH)
}
print head tail
}'
siedi!!!87!!!sik!!!65!!!owk!!!55!!!dkd
Ed's awk solution is obviously the way to go here.
For fun, I tried to come up with a sed solution, and here is (a convoluted GNU sed) one that takes the pattern and the script to be run as parameters; the input is either read from standard input (i.e., you can pipe to it) or from a file supplied as the third argument.
For your example, we'd have infile with contents
siedi87sik65owk55dkd
siedi11sik22owk33dkd
(two lines to demonstrate how this works for multiple lines), then script with contents
#!/bin/bash
echo "!!!${1}!!!"
and finally the solution script itself, so. Usage is
./so pattern script [input]
where pattern is an extended regular expression as understood by GNU sed (with the -r option), script is the name of the command you want to run for each match, and the optional input is the name of the input file if input is not standard input.
For your example, this would be
./so '[[:digit:]]{2}' script infile
or, as a filter,
cat infile | ./so '[[:digit:]]{2}' script
with output
siedi!!!87!!!sik!!!65!!!owk!!!55!!!dkd
siedi!!!11!!!sik!!!22!!!owk!!!33!!!dkd
This is what so looks like:
#!/bin/bash
pat=$1 # The pattern to match
script=$2 # The command to run for each pattern
infile=${3:-/dev/stdin} # Read from standard input if not supplied
# Use sed and have $pattern and $script expand to the supplied parameters
sed -r "
:build_loop # Label to loop back to
h # Copy pattern space to hold space
s/.*($pat).*/.\/\"$script\" \1/ # (1) Extract last match and prepare command
# Replace pattern space with output of command
e
G # (2) Append hold space to pattern space
s/(.*)$pat(.*)/\1~~~\2/ # (3) Replace last match of pattern with ~~~
/\n[^\n]*$pat[^\n]*$/b build_loop # Loop if string contains match
:fill_loop # Label for second loop
s/(.*\n)(.*)\n([^\n]*)~~~([^\n]*)$/\1\3\2\4/ # (4) Replace last ~~~
t fill_loop # Loop if there was a replacement
s/(.*)\n(.*)~~~(.*)$/\2\1\3/ # (5) Final ~~~ replacement
" < "$infile"
The sed command works with two loops. The first one copies the pattern space to the hold space, then removes everything but the last match from the pattern space and prepares the command to be run. After the substitution with (1) in its comment, the pattern space looks like this:
./script 55
The e command (a GNU extension) then replaces the pattern space with the output of this command. After this, G appends the hold space to the pattern space (2). The pattern space now looks like this:
!!!55!!!
siedi87sik65owk55dkd
The substitution at (3) replaces the last match with a string hopefully not equal to the pattern and we get
!!!55!!!
siedi87sik65owk~~~dkd
The loop repeats if the last line of the pattern space still has a match for the pattern. After three loops, the pattern space looks like this:
!!!87!!!
!!!65!!!
!!!55!!!
siedi~~~sik~~~owk~~~dkd
The second loop now replaces the last ~~~ with the second to last line of the pattern space with substitution (4). The command uses lots of "not a newline" ([^\n]) to make sure we're not pulling the wrong replacement for ~~~.
Because of the way command (4) is written, the loop ends with one last substitution to go, so before command (5), we have this pattern space:
!!!87!!!
siedi~~~sik!!!65!!!owk!!!55!!!dkd
Command (5) is a simpler version of command (4), and after it, the output is as desired.
This seems to be fairly robust and can deal with spaces in the name of the script to be run as long as it's properly quoted when calling:
./so '[[:digit:]]{2}' 'my script' infile
This would fail if
The input file contains ~~~ (solvable by replacing all occurrences at the start, putting them back at the end)
The output of script contains ~~~
The pattern contains ~~~
i.e., the solution very much depends on ~~~ being unique.
Because nobody asked: so as a one-liner.
#!/bin/bash
sed -re ":b;h;s/.*($1).*/.\/\"$2\" \1/;e" -e "G;s/(.*)$1(.*)/\1~~~\2/;/\n[^\n]*$1[^\n]*$/bb;:f;s/(.*\n)(.*)\n([^\n]*)~~~([^\n]*)$/\1\3\2\4/;tf;s/(.*)\n(.*)~~~(.*)$/\2\1\3/" < "${3:-/dev/stdin}"
Still works!
A conceptually simpler multi-utility solution:
Using GNU utilities:
echo 'siedi87sik65owk55dkd' |
sed 's|[0-9]\{2\}|$(./script.sh &)|g' |
xargs -d'\n' -I% sh -c 'echo '\"%\"
Using BSD utilities (also works with GNU utilities):
echo 'siedi87sik65owk55dkd' |
sed 's|[0-9]\{2\}|$(./script.sh &)|g' | tr '\n' '\0' |
xargs -0 -I% sh -c 'echo '\"%\"
The idea is to use sed to translate the tokens of interest lexically into a string containing shell command substitutions that invoke the target script with the token, and then pass the result to the shell for evaluation.
Note:
Any embedded " and $ characters in the input must be \-escaped.
xargs -d'\n' (GNU) and tr '\n' '\0' / xargs -0 (BSD) are only needed to correctly preserve whitespace in the input - if that is not needed, the following POSIX-compliant solution will do:
echo 'siedi87sik65owk55dkd' |
sed 's|[0-9]\{2\}|$(./script.sh &)|g' | tr '\n' '\0' |
xargs -I% sh -c 'printf "%s\n" '\"%\"

why does one less space in regex makes my sed go weird?

Here is an example of some regex I am trying to figure out. The goal is to strip out extra spaces and make it only one space between words via sed. The sample given has three spaces between sdf and sdk:
test#ubuntu:~/addr_book_script$ echo "est sdf sdk" | sed 's/ */ /g'
est sdf sdk
test#ubuntu:~/addr_book_script$ echo "est sdf sdk" | sed 's/ */ /g'
e s t s d f s d k
You will notice that the two sed statement only differs on the number of spaces before the *. The first statement had two spaces and it behaved exactly what I wanted.
The second statement had one space before the * and it stuck a space between each letter and word.
I know the * means any number of occurrences of whatever-it-is-that-I-am-looking-for. What I don't understand is why the one space sed replace behaves the way it does.
Thanks
sed 's/ */ /g'
The regex * matches 0 or more occurrences of (space).
At the start of the string a 0 space match is found and replaced by single space
After the first letter another 0 space match is found and replaced by single space and so forth.
After est, more than 0 space is found and replaced by single space
And so forth.
Another example:
~ >>> echo "est sdf sdk" | sed 's/a*/ /g'
e s t s d f s d k
The replacements are occurred because of 0 character match.
" *" (space-star) in regex means 0 or more occurrences of space and so it replaces every instance of 0 or more spaces with a space
" *" (space-space-star) forces there to be at least one space
" +" (space-plus) would accomplish the same thing in some regular expression flavors, but not BRE

Resources