How to find/replace and increment a matched number with sed/awk? - bash

Straight to the point, I'm wondering how to use grep/find/sed/awk to match a certain string (that ends with a number) and increment that number by 1. The closest I've come is to concatenate a 1 to the end (which works well enough) because the main point is to simply change the value. Here's what I'm currently doing:
find . -type f | xargs sed -i 's/\(\?cache_version\=[0-9]\+\)/\11/g'
Since I couldn't figure out how to increment the number, I captured the whole thing and just appended a "1". Before, I had something like this:
find . -type f | xargs sed -i 's/\?cache_version\=\([0-9]\+\)/?cache_version=\11/g'
So at least I understand how to capture what I need.
Instead of explaining what this is for, I'll just explain what I want it to do. It should find text in any file, recursively, based on the current directory (isn't important, it could be any directory, so I'd configure that later), that matches "?cache_version=" with a number. It will then increment that number and replace it in the file.
Currently the stuff I have above works, it's just that I can't increment that found number at the end. It would be nicer to be able to increment instead of appending a "1" so that the future values wouldn't be "11", "111", "1111", "11111", and so on.
I've gone through dozens of articles/explanations, and often enough, the suggestion is to use awk, but I cannot for the life of me mix them. The closest I came to using awk, which doesn't actually replace anything, is:
grep -Pro '(?<=\?cache_version=)[0-9]+' . | awk -F: '{ print "match is", $2+1 }'
I'm wondering if there's some way to pipe a sed at the end and pass the original file name so that sed can have the file name and incremented number (from the awk), or whatever it needs that xargs has.
Technically, this number has no importance; this replacement is mainly to make sure there is a new number there, 100% for sure different than the last. So as I was writing this question, I realized I might as well use the system time - seconds since epoch (the technique often used by AJAX to eliminate caching for subsequent "identical" requests). I ended up with this, and it seems perfect:
CXREPLACETIME=`date +%s`; find . -type f | xargs sed -i "s/\(\?cache_version\=\)[0-9]\+/\1$CXREPLACETIME/g"
(I store the value first so all files get the same value, in case it spans multiple seconds for whatever reason)
But I would still love to know the original question, on incrementing a matched number. I'm guessing an easy solution would be to make it a bash script, but still, I thought there would be an easier way than looping through every file recursively and checking its contents for a match then replacing, since it's simply incrementing a matched number...not much else logic. I just don't want to write to any other files or something like that - it should do it in place, like sed does with the "i" option.

I think finding file isn't the difficult part for you. I therefore just go to the point, to do the +1 calculation. If you have gnu sed, it could be done in this way:
sed -r 's/(.*)(\?cache_version=)([0-9]+)(.*)/echo "\1\2$((\3+1))\4"/ge' file
let's take an example:
kent$ cat test
ello
barbaz?cache_version=3fooooo
bye
kent$ sed -r 's/(.*)(\?cache_version=)([0-9]+)(.*)/echo "\1\2$((\3+1))\4"/ge' test
ello
barbaz?cache_version=4fooooo
bye
you could add -i option if you like.
edit
/e allows you to pass matched part to external command, and do substitution with the execution result. Gnu sed only.
see this example: external command/tool echo, bc are used
kent$ echo "result:3*3"|sed -r 's/(result:)(.*)/echo \1$(echo "\2"\|bc)/ge'
gives output:
result:9
you could use other powerful external command, like cut, sed (again), awk...

Pure sed version:
This version has no dependencies on other commands or environment variables.
It uses explicit carrying. For carry I use the # symbol, but another name can be used if you like. Use something that is not present in your input file.
First it finds SEARCHSTRING<number> and appends a # to it.
It repeats incrementing digits that have a pending carry (that is, have a carry symbol after it: [0-9]#)
If 9 was incremented, this increment yields a carry itself, and the process will repeat until there are no more pending carries.
Finally, carries that were yielded but not added to a digit yet are replaced by 1.
sed "s/SEARCHSTRING[0-9]*[0-9]/&#/g;:a {s/0#/1/g;s/1#/2/g;s/2#/3/g;s/3#/4/g;s/4#/5/g;s/5#/6/g;s/6#/7/g;s/7#/8/g;s/8#/9/g;s/9#/#0/g;t a};s/#/1/g" numbers.txt

This perl command will search all files in current directory (without traverse it, you will need File::Find module or similar for that more complex task) and will increment the number of a line that matches cache_version=. It uses the /e flag of the regular expression that evaluates the replacement part.
perl -i.bak -lpe 'BEGIN { sub inc { my ($num) = #_; ++$num } } s/(cache_version=)(\d+)/$1 . (inc($2))/eg' *
I tested it with file in current directory with following data:
hello
cache_version=3
bye
It backups original file (ls -1):
file
file.bak
And file now with:
hello
cache_version=4
bye
I hope it can be useful for what you are looking for.
UPDATE to use File::Find for traversing directories. It accepts * as argument but will discard them with those found with File::Find. The directory to begin the search is the current of execution of the script. It is hardcoded in the line find( \&wanted, "." ).
perl -MFile::Find -i.bak -lpe '
BEGIN {
sub inc {
my ($num) = #_;
++$num
}
sub wanted {
if ( -f && ! -l ) {
push #ARGV, $File::Find::name;
}
}
#ARGV = ();
find( \&wanted, "." );
}
s/(cache_version=)(\d+)/$1 . (inc($2))/eg
' *

This is ugly (I'm a little rusty), but here's a start using sed:
orig="something1" ;
text=`echo $orig | sed "s/\([^0-9]*\)\([0-9]*\)/\1/"` ;
num=`echo $orig | sed "s/\([^0-9]*\)\([0-9]*\)/\2/"` ;
echo $text$(($num + 1))
With an original filename ($orig) of "something1", sed splits off the text and numeric portions into $text and $num, then these are combined in the final section with an incremented number, resulting in something2.
Just a start since it doesn't consider cases with numbers within the file name or names with no number at the end, but hopefully helps with your original goal of using sed.
This can actually be simplified within sed by using buffers, I believe (sed can operate recursively), but I'm really rusty with that aspect of it.

perl -pi -e 's/(\?cache_version=)(\d+)/$1.($2+1)/ge' FILE [FILE...]
or for a complete solution:
find . -type f | xargs perl -pi -e 's/(\?cache_version=)(\d+)/$1.($2+1)/ge'
perl substitution operator
/e modifier evaluates the replacement as if it were a Perl statement, using its return value as the replacement text.
. operator concatenates strings in Perl. The parentheses ensures that the arithmetic operation $2+1 takes precedence over concatenation.
/g modifier applies substitution to all matched strings within line
perl options
-p ensures that perl will execute the command on every line of each file
-i ensures that each file will be edited inplace
-e specifies the perl command(s) that are executed (in this case, the substitution operation)

Related

Can I do a Bash wildcard expansion (*) on an entire pipeline of commands?

I am using Linux. I have a directory of many files, I want to use grep, tail and wildcard expansion * in tandem to print the last occurrence of <pattern> in each file:
Input: <some command>
Expected Output:
<last occurrence of pattern in file 1>
<last occurrence of pattern in file 2>
...
<last occurrence of pattern in file N>
What I am trying now is grep "pattern" * | tail -n 1 but the output contains only one line, which is the last occurrence of pattern in the last file. I assume the reason is because the * wildcard expansion happens before pipelining of commands, so the tail runs only once.
Does there exist some Bash syntax so that I can achieve the expected outcome, i.e. let tail run for each file?
I know I can always use a for-loop to solve the problem. I'm just curious if the problem can be solved with a more condensed command.
I've also tried grep -m1 "pattern" <(tac *), and it seems like the aforementioned reasoning still applies: wildcard expansion applies to only to the immediate command it is associated with, and the "outer" command runs only once.
Wildcards are expanded on the command line before any command runs. For example if you have files foo and bar in your directory and run grep pattern * | tail -n1 then bash transforms this into grep pattern foo bar | tail -n1 and runs that. Since there's only one stream of output from grep, there's only one stream of input to tail and it prints the last line of that stream.
If you want to search each file and print the last line of grep's output separately you can use a loop:
for file in * ; do
grep pattern "${file}" | tail -n1
done
The problem with non-loop solutions is that tail doesn't inherently know where the output of one file ends and the output of another file begins, or indeed that there are even files involved on the other end of the pipe. It just knows input is coming in from somewhere and it has to print the last line of that input. If you didn't want a loop, you'd have to use a more powerful tool like awk and perhaps use the fact that grep prepends the names of matched files (if multiple files are matched, or with -H) to delimit the start and end of outputs from each file. But, the work to write an awk program that keeps track of the current file to know when its output ends and print its last line is probably more effort than is worth when the loop solution is so simple.
You can achieve what you want using xargs. For your example it would be:
ls * | xargs -n 1 sh -c 'grep "pattern" $0 | tail -n 1'
Can save you from having to write a loop.
You can do this with awk, although (as tjm3772 pointed out in their answer) it's actually more complicated than the shell for loop. For the record, here's what I came up with:
awk -v pattern="YourPatternHere" '(FNR==1 && line!="") {print line; line=""}; $0~pattern {line=$0}; END {if (line!="") print line}'
Explanation: when it finds a matching line ($0~pattern), it stores that line in the line variable ({line=$0}) (this means that at the end of the file, line will hold the last matching line.
(Note: if you want to just include a literal pattern in the program, remove the -v pattern="YourPatternHere" part and replace $0~pattern with just /YourPatternHere/)
There's no simple trigger to print a match at the end of each file, so that part's split into two pieces: if it's the first line of a file AND line is set because of a match in the previous file ((FNR==1 && line!="")), print line and then clear it so it's not mistaken for a match in the current file ({print line; line=""}). Finally, at the end of the final file (END), print a match found in that last file if there was one ({if (line!="") print line}).
Also, note that the print-at-beginning-of-new-file test must be before the check for a matching line, or else it'll get very confused if the first line of the new file matches.
So... yeah, a shell for loop is simpler (and much easier to get right).

sed: Can't replace latest text occurrence including "-" dashes using variables

Trying to replace a text to another with sed, using a variable. It works great until the variable's content includes a dash "-" and sed tries to interpret it.
It is to be noted that in this context, I need to replace only the latest occurrence of the origin variable ${src}, which is why my sed command looks like this:
sed -e "s:${source}([^${source}]*)$:${dest}\1:"
"sed" is kind of new to me, I always got my results with "replace" or "awk" whenever possible, but here I'm trying to make the code as versatile as possible, hence using sed. If you think of another solution, that is viable as well.
Example for the issue:
# mkdir "/home/youruser/TEST-master"
# source="TEST-master" ; dest="test-master" ; find /home/youruser/ -depth -type d -name '*[[:upper:]]*' | grep "TEST" | sed -e "s:${source}([^${source}]*)$:${dest}\1:"
sed: -e expression #1, char 46: Invalid range end
Given that I don't know how many dashes every single variable may contain, does any sed expert know how could I make this work?
Exact context: Open source project LinuxGSM for which I'm rewriting a function to recursively lowercase files and directories.
Bash function I'm working on and comment here: https://github.com/GameServerManagers/LinuxGSM/issues/1868#issuecomment-996287057
If I'm understanding the context right, the actual goal is to take a path that contains some uppercase characters in its last element, and create a version with the last element lowercased. For example, /SoMe/PaTh/FiLeNaMe would be converted to /SoMe/PaTh/filename. If that's the case, rather than using string substitution, use dirname and basename to split it into components, uppercase the last, then reassemble it:
parentdir=$(dirname "$src")
filename=$(basename "$src")
lowername=$(echo "${latestpath}" | tr '[:upper:]' '[:lower:]')
dst="$parentdir/$lowername"
(Side note: it's important to quote the parameters to tr, to make sure the shell doesn't treat them as filename wildcards and replace them with lists of matching files.)
As long as the paths contain at least one "/" but not end with "/", you can use bash substitutions instead of dirname and basename:
parentdir="${src%/*}"
filename="${src##*/}"
As long as you're using bash v4.0 or later, you can also use a builtin substitution to do the lowercasing:
lowername="${filename,,}"

Sed through files without using for loop?

I have a small script which basically generates a menu of all the scripts in my ~/scripts folder and next to each of them displays a sentence describing it, that sentence being the third line within the script commented out. I then plan to pipe this into fzf or dmenu to select it and start editing it or whatever.
1 #!/bin/bash
2
3 # a script to do
So it would look something like this
foo.sh a script to do X
bar.sh a script to do Y
Currently I have it run a for loop over all the files in the scripts folder and then run sed -n 3p on all of them.
for i in $(ls -1 ~/scripts); do
echo -n "$i"
sed -n 3p "~/scripts/$i"
echo
done | column -t -s '#' | ...
I was wondering if there is a more efficient way of doing this that did not involve a for loop and only used sed. Any help will be appreciated. Thanks!
Instead of a loop that is parsing ls output + sed, you may try this awk command:
awk 'FNR == 3 {
f = FILENAME; sub(/^.*\//, "", f); print f, $0; nextfile
}' ~/scripts/* | column -t -s '#' | ...
Yes there is a more efficient way, but no, it doesn't only use sed. This is probably a silly optimization for your use case though, but it may be worthwhile nonetheless.
The inefficiency is that you're using ls to read the directory and then parse its output. For large directories, that causes lots of overhead for keeping that list in memory even though you only traverse it once. Also, it's not done correctly, consider filenames with special characters that the shell interprets.
The more efficient way is to use find in combination with its -exec option, which starts a second program with each found file in turn.
BTW: If you didn't rely on line numbers but maybe a tag to mark the description, you could also use grep -r, which avoids an additional process per file altogether.
This might work for you (GNU sed):
sed -sn '1h;3{H;g;s/\n/ /p}' ~/scripts/*
Use the -s option to reset the line number addresses for each file.
Copy line 1 to the hold space.
Append line 3 to the hold space.
Swap the hold space for the pattern space.
Replace the newline with a space and print the result.
All files in the directory ~/scripts will be processed.
N.B. You may wish to replace the space delimiter by a tab or pipe the results to the column command.

degenerate positions in motifs -bash

I am new to coding and writing a shell script which searches for motifs in protein sequence files and prints their location if present.
But these motifs have degenerate positions.
For example,
A motif can be (psi, psi,x, psi) where psi=(I, L or V) and x can be any of the 20 amino acids.
I would search a set of sequences for the occurrence of this motif. However, my protein sequences are exact sequences, i.e. they have no ambiguity, like:
>
MSGIALSRLAQERKAWRKDHPFGFVAVPTKNPDGTMNLMNWECAIPGKKGTPWEGGL
Would like the search for the all possible exact instances of the motif in the protein sequence which is present in fasta file.
I have a rough code which I know is wrong.
#!/usr/bin/bash
x=(A C G H I L M P S T V D E F K N Q R W Y)
psi=(I L V)
alpha=(D E)
motif1=($psi,$psi,$x,$psi)
for f in *.fasta ; do
if grep -q "$motif1" <$f ; then
echo $f
grep "^>" $f | tr -d ">"
grep -v ">" $f | grep -aob "$motif1"
fi
done
Appreciate any help in finding my way.
Thanks in advance!
The shell is an excellent tool for orchestrating other tools, but it's not particularly well suited to analyzing the contents of files.
A common arrangement is to use the shell to run Awk over a set of files, and do the detection logic in Awk instead. (Other popular tools are Python and Perl; I would perhaps tackle this in Python if I were to start from scratch.)
Regardless of the language of your script, you should avoid code duplication; refactor to put parameters in variables and then run the code with those parameters, or perhaps move the functionality to a function and then call it with different parameters. For example,
scan () {
local f
for f in *.fasta; do
# Inefficient: refactor to do the grep only once, then decide whether you want to show the output or not
if grep -q "$1" "$f"; then
# Always, always use double quotes around file names
echo "$f"
grep "^>" "$f" | tr -d ">"
grep -v ">" "$f" | grep -aob "$1"
fi
done
}
case $motif in
1) scan "$SIM_Type_1";; # Notice typo in variable name
2) scan "$SIM_Type_2";; # Ditto
3) scan "$SIM_Type_3";; # Ditto
4) scan "$SIM_Type_4";; # Ditto
5) scan "$SIM_TYPE_5";; # Notice inconsistent variable name
alpha) scan "$SIM_Type_alpha";;
beta) scan "$SIM_Type_beta";;
esac
You are declaring the _*Type_* variables (or occasionally *_TYPE_* -- the shell is case sensitive, and you should probably use the same capitalization for all the variables just to make it easier for yourself) as arrays, but then you are apparently attempting to use them as regular scalars. I can only guess as to what you intend for the variables to actually contain; but I'm guessing you want something like
# These are strings which contain regular expressions
x='[ACGHILMPSTVDEFKNQRWY]'
psi='[ILV]'
psi_1='[IV]'
alpha='[DE]'
# These are strings which contain sequences of the above regexes
# The ${variable} braces are not strictly necessary, but IMHO help legibility
SIM_Type_1="${psi}${psi}${x}${psi}"
SIM_Type_2="${psi}${x}${psi}${psi}"
SIM_Type_3="${psi}${psi}${psi}${psi}"
SIM_Type_4="${x}${psi}${x}${psi}"
SIM_TYPE_5="${psi}${alpha}${psi}${alpha}${psi}"
SIM_Type_alpha="${psi_1}${x}${psi_1}${psi}"
SIM_Type_beta="${psi_1}${psi_1}.${x}${psi}"
# You had an empty spot here ^ I guessed you want to permit any character?
If you really wanted these to be arrays, the way to access the contents of the array is "${array[#]}" but then that will not produce something we can directly pass to grep or Awk so I went with declaring these as strings containing regular expressions for the motifs.
But to reiterate, Awk is probably a better language for this, so let's refactor scan to be an Awk script.
# This replaces the function definition above
scan () {
awk -v re="$1" '{ if (/^>/) label=$0
else if (idx = match($0, re, result) {
if (! printed) { print FILENAME; printed = 1 }
print len + idx ":" result[0]
}
len += 1+length($0) # one for the newline
}
# Reset printed if we skip to a new file
FNR == 1 { printed = 0 }' *.fasta
}
The main complication here is reimplementing the grep -b byte offset calculation. If that is not strictly necessary (perhaps a line number would suffice?) then the Awk script can be reduced to a somewhat more trivial one.
Your use of grep -a suggests that perhaps your input files contain DOS newlines. I think this will work fine regardless of this.
The immediate benefit of this refactoring is that we avoid scanning the potentially large input file twice. We only scan the file once, and print the file name on the first match.
If you want to make the script more versatile, this is probably also a better starting point than the grep | tr solution you had before. But if the script does what you need, and the matches are often near the beginning of the input file, or the input files are not large, perhaps you don't actually want to switch to Awk after all.
Notice also that like your grep logic, this will not work if a sequence is split over several lines in the FASTA file and the match happens to straddle one of the line breaks.
Finally, making the script prompt for interactive input is a design wart. I would suggest you accept the user choice as a command-line argument instead.
motif=$1
So you'd use this as ./scriptname alpha to run the alpha regex against your FASTA files.
Another possible refactoring would be to read all the motif regexs into a slightly more complex Awk script and print all matches for all of them in a form which then lets you easily pick the ones you actually want to examine in more detail. If you have a lot of data to process, looping over it only once is a huge win.

Using BASH, how to increment a number that uniquely only occurs once in most lines of an HTML file?

The target is always going to be between two characters, 'E' and '/' and there will never be but one occurrence of this combination, e.g. 'E01/' in most lines in the HTML file and will always be between '01' and '90'.
So, I need to programmatically read the file and replace each occurrence of 'Enn/' where 'nn' in 'Enn/' will be between '01' and '90' and must maintain the '0' for numbers '01' to '09' in 'Enn/' while incrementing the existing number by 1 throughout the HTML file.
Is this doable and if so how best to go about it?
Edit: Target lines will be in one or the other formats:
<DT>ProgramName
<DT>Program Name
You can use sed inside BASH as a fantastic one-liner, either:
sed -ri 's/(.*E)([0-9]{2})(\/.*)/printf "\1%02u\3" $((10#\2+(10#\2>=90?0:1)))/ge' FILENAME
or if you are guaranteed the number is lower than 100:
sed -ri 's/(.*E)([0-9]{2})(\/.*)/printf "\1%02u\3" $((10#\2+1)))/ge' FILENAME
Basically, you'll be doing inplace search and replace. The above will not add anything after 90 (since you didn't specify the exact nature of the overflow condition). So E89/ -> E90/, E90/ -> E90/, and if by chance you have E91/, it will remain E91/. Add this line inside a loop for multiple files
A small explanation of the above command:
-r states that you'll be using a regular expression
-i states to write back to the same file (be careful with overwriting!)
s/search/replace/ge this is the regex command you'll be using
s/ states you'll be using a string search
(.E) first grouping of all characters upto the first E (case sensitive)
([0-9]{2}) second grouping of numbers 0 through 9, repeated twice (fixed width)
(/.) third grouping getting the escaped trailing slash and everything after that
/ (slash separator) denotes end of search pattern and beginning of replacement pattern
printf "format" var this is the expression used for each replacement
\1 place first grouping found here
%02u the replace format for the var
\3 place third grouping found here
$((expression)) BASH arithmetic expression to use in printf format
10#\2 force second grouping as a base 10 number
+(10#\2>=90?0:1) add 0 or 1 to the second grouping based on if it is >= 90 (as used in first command)
+1 add 1 to the second grouping (see second command)
/ge flags for global replacement and the replace parameter will be an expression
GNU sed and awk are very powerful tools to do this sort of thing.
You can use the following perl one-liner to increment the numbers while maintaining the ones with leading 0s.
perl -pe 's/E\K([0-9]+)/sprintf "%02d", 1+$1/e' file
$ cat file
<DT>ProgramName
<DT>Program Name
<DT>Program Name
<DT>Program Name
$ perl -pe 's/E\K([0-9]+)/sprintf "%02d", 1+$1/e' file
<DT>ProgramName
<DT>Program Name
<DT>Program Name
<DT>Program Name
You can add the -i option to make changes in-place. I would recommend creating backup before doing so.
Not as elegant as one line sed!
Break the commands used into multiple commands and you can debug your bash or grep or sed.
# find the number
# use -o to grep to just return pattern
# use head -n1 for safety to just get 1 number
n=$(grep -o "E[0-9][0-9]\/" file.html |grep -o "[0-9][0-9]"|head -n1)
#octal 08 and 09 are problem so need to do this
n1=10#$n
echo Debug n1=$n1 n=$n
n2=n1
# bash arithmetic done inside (( ))
# as ever with bash bracketing whitespace is needed
(( n2++ ))
echo debug n2=$n2
# use sed with -i -e for inline edit to replace number
sed -ie "s/E$n\//E$(printf '%02d' $n2)\//" file.html
grep "E[0-9][0-9]" file.html
awk might be better. Maybe could do it in one awk command also.
The sed one-liner in other answer is awesome :-)
This works in bash or sh.
http://unixhelp.ed.ac.uk/CGI/man-cgi?grep

Resources