Bash regex to match multiple blocks of indented content and print all of them - bash

I'm trying to do some regex matching in bash.
I'd like to match multiple block of indented (space or tab) content, with the block itself starting with a keyword.
Some other content could be present in the file.
Using this sample content :
keyword aaa match1
Some other content
keyword ccc match2
indentend content
matching
Some other content
with indendation
keyword ddd match2
indented content still matching
I managed to use this : (^keyword.*(?:\n^\h+.*)*), which seems to be sort of okay, everything is matching as expected. :
https://regex101.com/r/kvMlKK/1
Expected output would be to print every matches :
keyword aaa match1
keyword ccc match2
indentend content
matching
keyword ddd match2
indented content still matching
Unfortunatly I did not find a way to print all matches in bash. I can use grep/sed/awk/perl without any problem (edit: i meant I have access to all these command in the environnement i am working with).
Edit:
grep -E --include \*.md '(^keyword.*(?:\n^\h+.*)*)' $(dirname "$0")/../_inbox/draft.md
Using grep it does not return the full match, only first line because of the lack of multi-line matching support I guess.
I am not familiar with awk/sed, I did not get any meaningful results (even if it seems to be better to use them for multi-line matching).
Edit: if that could work on multiple files that would be awesome
Thanks for your help!

You can do it in pure bash, by looping... Because bash regex doesn't support multi-line matching.
#!/bin/bash
# Flag to track whether inside indented block
indented=0
# Read input line by line
while IFS= read -r line; do
# Check if line starts with keyword
reg="^[ \t]*keyword"
if [[ $line =~ $reg ]]; then
# Print line
printf "%s\n" "$line"
# Set flag to indicate inside indented block
indented=1
else
# Check if line starts with whitespace and inside indented block
reg="^[ \t]+.*"
if [[ $line =~ $reg && $indented -eq 1 ]]; then
# Print line
printf "%s\n" "$line"
else
# Reset flag to indicate outside indented block
indented=0
fi
fi
done < "input"
You can do it in awk too:
awk '/^[ \t]*keyword/{print;while(getline line) if(line~/^[ \t]+.*/) print line;else break}' input
Or use sed
sed -n '/^[ \t]*keyword/{:start;p;n;/^[ \t]/{p;n;b start;}}' input

Using awk:
$ awk '!/^[\t ]/{p=0} /^keyword/{p=1} p' file
keyword aaa match1
keyword ccc match2
indentend content
matching
keyword ddd match2
indented content still matching
$

Related

sed insert line after a match only once [duplicate]

UPDATED:
Using sed, how can I insert (NOT SUBSTITUTE) a new line on only the first match of keyword for each file.
Currently I have the following but this inserts for every line containing Matched Keyword and I want it to only insert the New Inserted Line for only the first match found in the file:
sed -ie '/Matched Keyword/ i\New Inserted Line' *.*
For example:
Myfile.txt:
Line 1
Line 2
Line 3
This line contains the Matched Keyword and other stuff
Line 4
This line contains the Matched Keyword and other stuff
Line 6
changed to:
Line 1
Line 2
Line 3
New Inserted Line
This line contains the Matched Keyword and other stuff
Line 4
This line contains the Matched Keyword and other stuff
Line 6
You can sort of do this in GNU sed:
sed '0,/Matched Keyword/s//New Inserted Line\n&/'
But it's not portable. Since portability is good, here it is in awk:
awk '/Matched Keyword/ && !x {print "Text line to insert"; x=1} 1' inputFile
Or, if you want to pass a variable to print:
awk -v "var=$var" '/Matched Keyword/ && !x {print var; x=1} 1' inputFile
These both insert the text line before the first occurrence of the keyword, on a line by itself, per your example.
Remember that with both sed and awk, the matched keyword is a regular expression, not just a keyword.
UPDATE:
Since this question is also tagged bash, here's a simple solution that is pure bash and doesn't required sed:
#!/bin/bash
n=0
while read line; do
if [[ "$line" =~ 'Matched Keyword' && $n = 0 ]]; then
echo "New Inserted Line"
n=1
fi
echo "$line"
done
As it stands, this as a pipe. You can easily wrap it in something that acts on files instead.
If you want one with sed*:
sed '0,/Matched Keyword/s//Matched Keyword\nNew Inserted Line/' myfile.txt
*only works with GNU sed
This might work for you:
sed -i -e '/Matched Keyword/{i\New Inserted Line' -e ':a;n;ba}' file
You're nearly there! Just create a loop to read from the Matched Keyword to the end of the file.
After inserting a line, the remainder of the file can be printed out by:
Introducing a loop place holder :a (here a is an arbitrary name).
Print the current line and fetch the next into the pattern space with the ncommand.
Redirect control back using the ba command which is essentially a goto to the a place holder. The end-of-file condition is naturally taken care of by the n command which terminates any further sed commands if it tries to read passed the end-of-file.
With a little help from bash, a true one liner can be achieved:
sed $'/Matched Keyword/{iNew Inserted Line\n:a;n;ba}' file
Alternative:
sed 'x;/./{x;b};x;/Matched Keyword/h;//iNew Inserted Line' file
This uses the Matched Keyword as a flag in the hold space and once it has been set any processing is curtailed by bailing out immediately.
If you want to append a line after first match only, use AWK instead of SED as below
awk '{print} /Matched Keyword/ && !n {print "New Inserted Line"; n++}' myfile.txt
Output:
Line 1
Line 2
Line 3
This line contains the Matched Keyword and other stuff
New Inserted Line
Line 4
This line contains the Matched Keyword and other stuff
Line 6

How to grep a specific pattern before match?

I'm currently working on multiple configuration files which use the following format:
[Stanza1]
action.script=1
action.ping=0
action.lookup=1
action.notable.param=0
action.script.filename=script.pl
[Stanza2]
action.script=0
action.ping=0
action.lookup=1
[Stanza3]
action.script=1
action.ping=0
action.lookup=0
action.script.filename=script.pl
I want to know which stanzas include "action.script.filename=script.pl", so the expected result would be
[Stanza1]
[Stanza3]
Using something like:
grep -B 10 "action.script.filename = script.pl" file
doesn't work for cases where the stanza name is more than 10 lines before the match, and proves quite cumbersome to use.
Any suggestions on how to do this?
The following sed command would do the trick :
sed -n '/^\[/h;/^action\.script\.filename=script\.pl$/{x;p}'
You can try it here.
When it encounters a line that starts with "[", it stores it into its hold buffer. When it encounters a "action.script.filename=script.pl" line, it prints the content of the hold buffer.
I'm not sure this can be done purely with grep. I would recommend a small bash script:
while read line
do
if [[ $line =~ \[.* ]]; then
# save stanza for later
stanza=$line
fi
if [[ $line =~ action.script.filename=script.pl ]]; then
echo $stanza
fi
done < file
With awk
$ awk '/action\.script\.filename=script\.pl/{print h} /^\[/{h=$0}' ip.txt
[Stanza1]
[Stanza3]
/^\[/ lines starting with [ character, you can also use something like /Stanza/ as long as it uniquely identifies header lines
h=$0 for such lines, save the content ($0) to variable h
/action\.script\.filename=script\.pl/ if input line matches the given search criteria
print h print the value of h variable
if you are matching whole line, then you can also use string match $0 == "action.script.filename=script.pl" instead of regex match
This line of code works for me
grep '^\[Stanza\|^action.script.filename=script.pl$' fileName | grep -B1 'action.script.filename=script.pl' | grep -v 'action.script.filename=script.pl\|\-\-'
Explanation:
grep '^\[Stanza\|^action.script.filename=script.pl$' fileName
matches either [Stanza]* lines or action.script.filename=script.pl ones. Output is something like this
[Stanza1]
action.script.filename=script.pl
[Stanza2]
[Stanza3]
action.script.filename=script.pl
Adding this filter | grep -B1 'action.script.filename=script.pl' will result in this
[Stanza1]
action.script.filename=script.pl
--
[Stanza3]
action.script.filename=script.pl
Now you just need to clean the output from unwanted parts
| grep -v 'action.script.filename=script.pl\|\-\-'
This is the final output
[Stanza1]
[Stanza3]
awk '/^\[.*\]$/{stanza=$0;next} /action.script.filename=script.pl/{print stanza}' filename
[Stanza1]
[Stanza3]
You can store each stanza in a variable called stanza and move to next line. Whenever you see the string action.script.filename=script.pl , print the variable stanza.

sed removing # and ; comments from files up to certain keyword

I have files that need to be removed from comments and white space until keyword . Line number varies . Is it possible to limit multiple continued sed substitutions based on Keyword ?
This removes all comments and white spaces from file :
sed -i -e 's/#.*$//' -e 's/;.*$//' -e '/^$/d' file
For example something like this :
# string1
# string2
some string
; string3
; string4
####
<Keyword_Keep_this_line_and_comments_white_space_after_this>
# More comments that need to be here
; etc.
sed -i '1,/keyword/{/^[#;]/d;/^$/d;}' file
I would suggest using awk and setting a flag when you reach your keyword:
awk '/Keyword/ { stop = 1 } stop || !/^[[:blank:]]*([;#]|$)/' file
Set stop to true when the line contains Keyword. Do the default action (print the line) when stop is true or when the line doesn't match the regex. The regex matches lines whose first non-blank character is a semicolon or hash, or blank lines. It's slightly different to your condition but I think it does what you want.
The command prints to standard output so you should redirect to a new file and then overwrite the original to achieve an "in-place edit":
awk '...' input > tmp && mv tmp input
Use grep -n keyword to get the line number that contains the keyword.
Use sed -i -e '1,N s/#..., when N is the line number that contains the keyword, to only remove comments on the lines 1 to N.

Replace some lines in fasta file with appended text using while loop and if/else statement

I am working with a fasta file and need to add line-specific text to each of the headers. So for example if my file is:
>TER1
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
I want a while loop that will read through each line; for those with a > at the start, I want to append |population: plus the first three characters after the >. So line one would be:
>TER1|population:TER
etc.
I can't figure out how to make this work. Here my best attempt so far.
filename="testfasta.fa"
while read -r line
do
if [[ "$line" == ">"* ]]; then
id=$(cut -c2-4<<<"$line")
printf $line"|population:"$id"\n" >>outfile
else
printf $line"\n">>outfile
fi
done <"$filename"
This produces a file with the original headers and following line each on a single line.
Can someone tell me where I'm going wrong? My if and else loop aren't working at all!
Thanks!
You could use a while loop if you really want,
but sed would be simpler:
sed -e 's/^>\(...\).*/&|population:\1/' "$filename"
That is, for lines starting with > (pattern: ^>),
capture the next 3 characters (with \(...\)),
and match the rest of the line (.*),
replace with the line as it was (&),
and the fixed string |population:,
and finally the captured 3 characters (\1).
This will produce for your input:
>TER1|population:TER
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2|population:TER
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1|population:URC
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2|population:URC
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3|population:UCR
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
Or you can use this awk, also producing the same output:
awk '{sub(/^>.*/, $0 "|population:" substr($0, 2, 3))}1' "$filename"
You can do this quickly in awk:
awk '$1~/^>/{$1=$1"|population:"substr($1,2,3)}{}1' infile.txt > outfile.txt
$ awk '$1~/^>/{$1=$1"|population:"substr($1,2,3)}{}1' testfile
>TER1|population:TER
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2|population:TER
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1|population:URC
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2|population:URC
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3|population:UCR
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
Here awk will:
Test if the record starts with a > The $1 looks at the first field, but $0 for the entire record would work just as well in this case. The ~ will perform a regex test, and ^> means "Starts with >". Making the test: ($1~/^>/)
If so it will set the first field to the output you are looking for (using substr() to get the bits of the string you want. {$1=$1"|population:"substr($1,2,3)}
Finally it will print out the entire record (with the changes if applicable): {}1 which is shorthand for {print $0} or.. print the entire record.

Comment out line, only if previous line contains matching string

Looking for a solution for a bash script using sed or awk to comment out a line, only if the previous line contains a matching string.
For example, a file containing:
...
if [ $V1 -gt 100 ]; then
some specific commands
else
some other specific commands
fi
...
I'd like to comment out the line containing else but ONLY if the previous line contains specific.
I've attempted piping multiple sed commands along with grep commands to no avail.
sed -E '/specific/{n;s/^([[:blank:]]*)else$/\1#else/}'
Output
...
if [ $V1 -gt 100 ]; then
some specific commands
#else
some other commands
fi
...
A retrospection
/specific/ look for the line containing the pattern specific
n add the next line to the pattern space. n auto prints the current pattern space.
Check if the next line is (one_or_more_spaces)else,if yes, substitute the line with a (one_or_more_spaces_found_previously)#else. Remember () is for pattern reuse and \1 is the previously matched pattern reused.
-E enable extended regex
-i is for inplace edit of the actual file
You can use this awk solution:
awk '/specific/{p=NR} NR==p+1{p=0; if (/^[[:blank:]]*else/) $0 = "#" $0} 1' file
if [ $V1 -gt 100 ]; then
some specific commands
#else
some other commands
fi
In this block /specific/p=NR we find specific and store current line # in p
Next block is executed for very next line due to p == NR+1 condition
We rest p=0 and if that line has else at start with optional whitespaces before we just comment it out.

Resources