Replace text in file if previous line matches another text - macos

My file looks like this:
FooBarA
foo bar
foo = bar
FooBarB
foo bar
foo = bar
FooBarC
foo bar
foo = bar
...
What I would like to do is to write a script that replaces the bar in foo = bar but only if it belongs to FooBarB. So in the example above only the second bar out of all foo = bar lines should be replaced.
I've played around with sed but I just can't get it done right. I would also like to avoid installing any tools that aren't necessarily pre-installed on the system (I'm on Mac OS), since the script will be used by other team members too.

One way to do it with sed (tested using macOS's sed and GNU sed), would be this:
replace.sed
#!/usr/bin/env sed -Ef
/FooBarB/,/^FooBar/ {
s/(foo[[:space:]]*=[[:space:]]*).+/\1new-value/
}
Here's what it does:
/FooBarB/,/^FooBar/ matches a range of lines where the first line matches the regex /FooBarB/ and the last line matches the regex /^FooBar/ (which is the start of the next "group"). The comma between the two regexes is the syntax for range matching in sed.
s/(foo[[:space:]]*=[[:space:]]*).+/\1new-value/ — [s]ubstitutes (in the matched range of lines) whatever matches the regex (foo[[:space:]]*=[[:space:]]*).+ with \1new-value, where \1 references the first capturing group in the search regex. The search regex looks for foo followed by optional whitespace, followed by an = sign, followed again by whitespace and then whatever else is there, which in your case is the old value.
You could do it all in just one line, but I wanted to show a version that's a bit more digestible (as far as sed goes, in any case):
sed -E '/FooBarA/,/^FooBar/s/(foo[[:space:]]*=[[:space:]]*).+/\1new-value/' temp.md

This might work for you (GNU sed):
sed '/FooBarB/{:a;n;/^$/b;/foo = bar/!ba;s//foo = baz/}' file
Match on the string FooBarB and start a loop.
Fetch the next line and study it.
If the line is empty the stanza is done, so break out of the loop.
If the line does not contains the string foo = bar, fetch the next line and continue the loop.
Otherwise, substitute the new value for bar and finish the loop.
Alternative (which may work for macos users?):
sed -e '/FooBarB/{:a' -e 'n;/^$/b;/foo = bar/!ba;s//foo = baz/;}' file
Since the OP changed the input data to the question another solution:
sed '/FooBar/h;G;/FooBarB/s/foo = bar/foo = baz/;P;d' file

Using any awk in any shell on every Unix box:
$ awk -v tgt='FooBarB' -v val='whatever' '
NF==1{tag=$0} (NF>1) && (tag==tgt) && sub(/=.*/,"= "){$0=$0 val}
1' file
FooBarA
foo bar
foo = bar
FooBarB
foo bar
foo = whatever
FooBarC
foo bar
foo = bar

For reference, the GNU awk variant:
awk -v v="newvalue" 'BEGIN{FS=OFS="\n";RS=ORS="\n\n"}$1=="FooBarB"{$3="foo = " v}1' file
By using the option -v, the variable v holds the wanted string.
The BEGIN statement sets respectively the input, output field separator, the input and output record separator to one and two carriage return.
That way a record is composed of the block of several lines containing the pattern Foobar[ABC].
The last statement sets the new value by rewriting the third line.

Related

Speed up searching a large file using sed or an alternative

I have several large files in which I need to find a specific string and take everything between the line which contains the string and the next date at the beginning of a line. This file looks like this:
20220520-11:53:01.242: foofoobar
20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: blablabla
20220520-11:53:01.257: bla this_i_need bla
20220520-11:53:01.258: barbarfooo
The output I need is this:
20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: bla this_i_need bla
Now I'm using sed '/'"$string"'/,/'"$date"'/!d' which works as intended except it also takes the next row with the date even if it doesn't contain the string, but it's not a big problem.
The problem is that it takes a really long time searching the files.
Is it possible to edit the sed command so it will run faster or is there any other option to get a better runtime? Maybe using awk or grep?
EDIT: I forgot to add that the expected results occur multiple times in one file, so exiting after one match is not suitable. I am looping trough multiple files in a for loop with the same $string and same $date. There are a lot of factors slowing the script down that i can't change (extracting files one by one from a 7z, searching and removing them after search in one loop).
Using sed you might use:
sed -n '/this_i_need/{:a;N;/\n20220520/!ba;p;q}' file
Explanation
-n Prevent default printing of a line
/this_i_need/ When matching this_i_need
:a Set a label a to be able to jump back to
N pull the next line into the pattern space
/\n20220520/! If not matching a newline followed by the date
ba Jump back to the label (like a loop and process what is after the label again)
p When we do match a newline and the date, then print the pattern space
q Exit sed
Output
20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: blablabla
With sed it has to delete all the lines outside the matching ranges from the buffer, which is inefficient when the file is large.
You can instead use awk to output the desired lines directly by setting a flag upon matching the specific string and clearing the flag when matching a date pattern, and outputting the line when the flag is set:
awk '/[0-9]{8}/{f=0}/this_i_need/{f=1}f' file
Demo: https://ideone.com/J2ISVD
You might use exit statement to instruct GNU AWK to stop processing, which should give speed gain if lines you are looking ends far before end of file. Let file.txt content be
20220520-11:53:01.242: foofoobar
20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: blablabla
20220520-11:53:01.257: bla this_i_need bla
20220520-11:53:01.258: barbarfooo
then
awk 's&&/^[[:digit:]]{8}.*this_i_need/{print;exit}/this_i_need/{p=1;s=1;next}p&&/^[[:digit:]]{8}/{p=0}p{print}' file.txt
gives output
what
to
do
20220520-11:53:01.257: bla this_i_need bla
Explanation: I use 2 flag-variables p as priting and s as seen. I inform GNU AWK to
print current line and exit if seen and line starts with 8 digits followed by 0 or more any characters followed by this_i_need
set p flag to 1 (true) and s flag to 1 (true) and go to next line if this_i_need was found in line
set p flag to 0 (false) if p flag is 1 and line starts with 8 digit
print current line if p flag is set to 1
Note that order of actions is crucial.
Disclaimer: this solution assumes that if line starts with 8 digits, then it is line beginning with date, if this is not case adjust regular expression according to your needs.
(tested in gawk 4.2.1)
Assumptions:
start printing when we find the desired string
stop printing when we read a line that starts with any date (ie, any 8-digit string)
One awk idea:
string='this_i_need'
awk -v ptn="${string}" ' # pass bash variable "$string" in as awk variable "ptn"
/^[0-9]{8}/ { printme=0 } # clear printme flag if line starts with 8-digit string
$0 ~ ptn { printme=1 } # set printme flag if we find "ptn" in the current line
printme # only print current line if printme==1
' foo.dat
Or as a one-liner sans comments:
awk -v ptn="${pattern}" '/^[0-9]{8}/ {printme=0} $0~ptn {printme=1} printme' foo.dat
NOTE: OP can rename the awk variables (ptn, printme) as desired as long as they are not a reserved keyword (see 'Keyword' in awk glossary)
This generates:
20220520-11:53:01.244: foo_bar blah: this_i_need
what
to
do
20220520-11:53:01.257: bla this_i_need bla

How to replace a whole line (between 2 words) using sed?

Suppose I have text as:
This is a sample text.
I have 2 sentences.
text is present there.
I need to replace whole text between two 'text' words. The required solution should be
This is a sample text.
I have new sentences.
text is present there.
I tried using the below command but its not working:
sed -i 's/text.*?text/text\
\nI have new sentence/g' file.txt
With your shown samples please try following. sed doesn't support lazy matching in regex. With awk's RS you could do the substitution with your shown samples only. You need to create variable val which has new value in it. Then in awk performing simple substitution operation will so the rest to get your expected output.
awk -v val="your_new_line_Value" -v RS="" '
{
sub(/text\.\n*[^\n]*\n*text/,"text.\n"val"\ntext")
}
1
' Input_file
Above code will print output on terminal, once you are Happy with results of above and want to save output into Input_file itself then try following code.
awk -v val="your_new_line_Value" -v RS="" '
{
sub(/text\.\n*[^\n]*\n*text/,"text.\n"val"\ntext")
}
1
' Input_file > temp && mv temp Input_file
You have already solved your problem using awk, but in case anyone else will be looking for a sed solution in the future, here's a sed script that does what you needed. Granted, the script is using some advanced sed features, but that's the fun part of it :)
replace.sed
#!/usr/bin/env sed -nEf
# This pattern determines the start marker for the range of lines where we
# want to perform the substitution. In our case the pattern is any line that
# ends with "text." — the `$` symbol meaning end-of-line.
/text\.$/ {
# [p]rint the start-marker line.
p
# Next, we'll read lines (using `n`) in a loop, so mark this point in
# the script as the beginning of the loop using a label called `loop`.
:loop
# Read the next line.
n
# If the last read line doesn't match the pattern for the end marker,
# just continue looping by [b]ranching to the `:loop` label.
/^text/! {
b loop
}
# If the last read line matches the end marker pattern, then just insert
# the text we want and print the last read line. The net effect is that
# all the previous read lines will be replaced by the inserted text.
/^text/ {
# Insert the replacement text
i\
I have a new sentence.
# [print] the end-marker line
p
}
# Exit the script, so that we don't hit the [p]rint command below.
b
}
# Print all other lines.
p
Usage
$ cat lines.txt
foo
This is a sample text.
I have many sentences.
I have many sentences.
I have many sentences.
I have many sentences.
text is present there.
bar
$
$ ./replace.sed lines.txt
foo
This is a sample text.
I have a new sentence.
text is present there.
bar
Substitue
sed -i 's/I have 2 sentences./I have new sentences./g'
sed -i 's/[A-Z]\s[a-z].*/I have new sentences./g'
Insert
sed -i -e '2iI have new sentences.' -e '2d'
I need to replace whole text between two 'text' words.
If I understand, first text. (with a dot) is at the end of first line and second text at the beginning of third line. With awk you can get the required solution adding values to var s:
awk -v s='\nI have new sentences.\n' '/text.?$/ {s=$0 s;next} /^text/ {s=s $0;print s;s=""}' file
This is a sample text.
I have new sentences.
text is present there.

Removing lines between tags in a text file

I have many text files containing annotations. The original text is marked with lines containing the words:
START OF TEXT OF PASSAGE 1
END OF TEXT OF PASSAGE 1
Obviously I can search each document for the phrase START OF TEXT and delete everything up to it. Then search for END OF TEXT and start selecting text for deletion until I get to the next START OF TEXT.
I have come up with this design so far:
#!/bin/bash
a="START OF PROJECT"
b="END OF PROJECT"
while read line; do
if line contains a; do
while read line; do
'if line does not contain b'
'append the line to output.txt'; fi
done
done
fi
done
Perhaps there is an easier way using sed, awk, grep and pipes?
'for every document' 'loop through it doing this' ('find the original text between START and END' | >> output.txt)
Unfortunately I am poor at bash and ignorant of sed/awk.
The reason for this is that I am assembling a huge text document that is a concatenation of thousands of marked up documents – each of which contains some annotated passages.
In Python:
import re
with open('in.txt') as f, open('out.txt', 'w') as output:
output.write('\n'.join(re.findall(r'START OF TEXT(.*?)END OF TEXT', f.read())))
This reads the input, searches for all matches that begin and end with the necessary markers, captures the text of interest in a group, joins all those groups on a linefeed, and writes that to the result file.
Pretty easy to do with awk. You would create a script (I'll call it yank.awk) containing this:
#!/usr/bin/awk
/START OF PROJECT/ { capture = 1; next }
/END OF PROJECT/ { capture = 0 }
capture == 1 { print }
and then run it like so:
yank.awk in.txt > output.txt
Could also do with sed and grep:
sed -ne '/START OF PROJECT/,/END OF PROJECT/p' in.txt | grep -vE '(START|END) OF PROJECT' > output.txt
(Another Python solution)
You can have itertools.groupby group lines together based on a boolean value - just use a global flag to keep track of whether you are in a block or not, and then use groupby to group the lines that are in or out of blocks. Then just discard the ones that are not blocks:
sample_lines = """
lskdjflsdkjf
sldkjfsdlkjf
START OF TEXT
Asdlkfjlsdkfj
Bsldkjf
Clsdkjf
END OF TEXT
sldkfjlsdkjf
sdlkjfdklsjf
sdlkfjdlskjf
START OF TEXT
Dsdlkfjlsdkfj
Esldkjf
Flsdkjf
END OF TEXT
sldkfjlsdkjf
sdlkjfdklsjf
sdlkfjdlskjf
""".splitlines()
from itertools import groupby
in_block = False
def is_in_block(line):
global in_block
if line.startswith("END OF TEXT"):
in_block = False
ret = in_block
if line.startswith("START OF TEXT"):
in_block = True
return ret
for lines_are_text,lines in groupby(sample_lines, key=is_in_block):
if lines_are_text:
print(list(lines))
gives:
['Asdlkfjlsdkfj', 'Bsldkjf', 'Clsdkjf']
['Dsdlkfjlsdkfj', 'Esldkjf', 'Flsdkjf']
See that first group has the lines that start with A, B, and C, and the second group is made up of those lines starting with D, E, and F.
It sounds like the specific solution you need is:
awk '/END OF TEXT OF PASSAGE/{f=0} f; /START OF TEXT OF PASSAGE/{f=1}' file
See https://stackoverflow.com/a/18409469/1745001 for other ways to select text from files.
Use Perl's Flip-Flop Operator to Print Text Between Markers
Given a corpus like:
START OF TEXT OF PASSAGE 1
foo
END OF TEXT OF PASSAGE 1
START OF TEXT OF PASSAGE 2
bar
END OF TEXT OF PASSAGE 2
you can use the Perl flip-flop operator to process within a range of lines. For example, from the shell prompt:
$ perl -ne 'if (/^START OF TEXT/ ... /^END OF TEXT/) {
next if /^(?:START|END)/;
print;
}' /tmp/corpus
foo
bar
Basically, this short Perl script loops through your input. When it finds your start and end tags, it throws away the tags themselves and prints everything else in between.
Usage Notes
The line breaks between passages in the corpus are for readability. It doesn't matter if your real corpus has no line breaks between passages, so long as the text markers always start at the beginning of the line as shown in your original post. If that assumption doesn't hold true, then you will need to adjust the regular expressions used to identify the start and end of your passages.
You can pass multiple files to the Perl script. Again, it makes no practical difference as long as you don't exceed the length limit of your shell.
If you want the final output to go to somewhere other than standard output, just use shell redirection. For example:
perl -ne 'if (/^START OF TEXT/ ... /^END OF TEXT/) {
next if /^(?:START|END)/;
print;
}' /tmp/file1 /tmp/file2 /tmp/file3 > /tmp/output
You can use sed as follows:
sed -n '/^START OF TEXT/,/^END OF TEXT/{/^\(START\|END\) OF TEXT/!p}' infile
or, with extended regular expressions (-r):
sed -rn '/^START OF TEXT/,/^END OF TEXT/{/^(START|END) OF TEXT/!p}' infile
-n prevents sed from printing as a default. The rest works as follows:
/^START OF TEXT/,/^END OF TEXT/ { # For lines between these two matches
/^\(START\|END\) OF TEXT/!p # If the line does NOT match, print it
}
This works with GNU sed and might require some tweaking to run with other seds.

SED incorrectly replaces only the first instance of a pattern on a line

Hello: I have tab separated data of the form
customer-item description-purchase price-category
e.g. a.out contains:
1\t400 Bananas\t3.00\tfruit
2\t60 Oranges\t0.00\tfruit
3\tNULL\t3.0\tfruit
4\tCarrots\tNULL\tfruit
5\tNULL\tNULL\tfruit
I'm attempting to get rid of all the NULL fields. I can't rely on the simple replacement of the string "NULL" as it may be a substring; so I am attempting
sed -i 's:\tNULL\t:\t\t:g' a.out
when I do this, I end up with
1\t400 Bananas\t3.00\tfruit
2\t60 Oranges\t0.00\tfruit
3\t\t3.0\tfruit
4\tCarrots\t\tfruit
5.\t\tNULL\tfruit
what's wrong here is that #5 has only suffered a replacement of the first instance of the search string on each line.
If I run my sed command twice, I end up with the result I want:
1\t400 Bananas\t3.00\tfruit
2\t60 Oranges\t0.00\tfruit
3\t\t3.0\tfruit
4\tCarrots\t\tfruit
5.\t\t\tfruit
where you can see that line 5 has both of the NULLs removed
But I don't understand why I'm suffering this?
awk -F'\t' -v OFS='\t' '{
for (i = 1; i <= NF; ++i) {
if ($i == "NULL") {
$i = "";
}
}
print
}' test.txt
The straightforward solution is to use \t as a field separator and then loop over all of the fields looking for an exact match of "NULL". No substringing.
Here's the same thing as a one liner:
awk -F'\t' -v OFS='\t' '{for(i=1;i<=NF;++i) if($i=="NULL") $i=""} 1' test.txt
Since tabs can't be inside strings in your case since that would imply a new field you might be able to do what you want simply by doing this;
sed ':start ; s/\tNULL\(\t\|$\)/\t\1/ ; t start' a.out
First the inner part s/\tNULL\(\t\|$\)/\t\1/ searches for tab NULL followed by a tab or end of line $ and replace with a tab followed by the character that did appear after NULL (this last part is done using \1). We'll call that expression
We now have:
sed ':start ; expression ; t start' a.out
This is effectively a loop (like goto). :start is a label. ; acts as a statement delimiter. I have described what expression does above. t start says that IF the expression did any substitution that a jump will be made to label start. The buffer will contain the substituted text. This loop occurs until no substitution can be done on the line and then processing continues.
Information on sed flow control and other useful tidbits can be found here
awk makes it simpler:
awk -F '\tNULL\\>' -v OFS='\t' '{$1=$1}1' file
1\t400 Bananas\t3.00\tfruit
2\t60 Oranges\t0.00\tfruit
3\t\t3.0\tfruit
4\tCarrots\t\tfruit
5\t\t\tfruit
From grep(1) on a recent Linux:
The Backslash Character and Special Expressions
The symbols \< and > respectively match the empty string at the
beginning and end of a word. The symbol \b matches the empty string at
the edge of a word [...]
--
So, how about:
sed -i 's:\<NULL\>::g' a.out

Delete lines before and after a match in bash (with sed or awk)?

I'm trying to delete two lines either side of a pattern match from a file full of transactions. Ie. find the match then delete two lines before it, then delete two lines after it and then delete the match. The write this back to the original file.
So the input data is
D28/10/2011
T-3.48
PINITIAL BALANCE
M
^
and my pattern is
sed -i '/PINITIAL BALANCE/,+2d' test.txt
However this is only deleting two lines after the pattern match and then deleting the pattern match. I can't work out any logical way to delete all 5 lines of data from the original file using sed.
an awk one-liner may do the job:
awk '/PINITIAL BALANCE/{for(x=NR-2;x<=NR+2;x++)d[x];}{a[NR]=$0}END{for(i=1;i<=NR;i++)if(!(i in d))print a[i]}' file
test:
kent$ cat file
######
foo
D28/10/2011
T-3.48
PINITIAL BALANCE
M
x
bar
######
this line will be kept
here
comes
PINITIAL BALANCE
again
blah
this line will be kept too
########
kent$ awk '/PINITIAL BALANCE/{for(x=NR-2;x<=NR+2;x++)d[x];}{a[NR]=$0}END{for(i=1;i<=NR;i++)if(!(i in d))print a[i]}' file
######
foo
bar
######
this line will be kept
this line will be kept too
########
add some explanation
awk '/PINITIAL BALANCE/{for(x=NR-2;x<=NR+2;x++)d[x];} #if match found, add the line and +- 2 lines' line number in an array "d"
{a[NR]=$0} # save all lines in an array with line number as index
END{for(i=1;i<=NR;i++)if(!(i in d))print a[i]}' #finally print only those index not in array "d"
file # your input file
sed will do it:
sed '/\n/!N;/\n.*\n/!N;/\n.*\n.*PINITIAL BALANCE/{$d;N;N;d};P;D'
It works this way:
if sed has only one string in pattern space it joins another one
if there are only two it joins the third one
if it does natch to pattern LINE + LINE + LINE with BALANCE it joins two following strings, deletes them and goes at the beginning
if not, it prints the first string from pattern and deletes it and goes at the beginning without swiping the pattern space
To prevent the appearance of pattern on the first string you should modify the script:
sed '1{/PINITIAL BALANCE/{N;N;d}};/\n/!N;/\n.*\n/!N;/\n.*\n.*PINITIAL BALANCE/{$d;N;N;d};P;D'
However, it fails in case you have another PINITIAL BALANCE in string which are going to be deleted. However, other solutions fails too =)
For such a task, I would probably reach for a more advanced tool like Perl:
perl -ne 'push #x, $_;
if (#x > 4) {
if ($x[2] =~ /PINITIAL BALANCE/) { undef #x }
else { print shift #x }
}
END { print #x }' input-file > output-file
This will remove 5 lines from the input file. These lines will be the 2 lines before the match, the matched line, and the two lines afterwards. You can change the total number of lines being removed modifying #x > 4 (this removes 5 lines) and the line being matched modifying $x[2] (this makes the match on the third line to be removed and so removes the two lines before the match).
A more simple and easy to understand solution might be:
awk '/PINITIAL BALANCE/ {print NR-2 "," NR+2 "d"}' input_filename \
| sed -f - input_filename > output_filename
awk is used to make a sed-script that deletes the lines in question and the result is written on the output_filename.
This uses two processes which might be less efficient than the other answers.
This might work for you (GNU sed):
sed ':a;$q;N;s/\n/&/2;Ta;/\nPINITIAL BALANCE$/!{P;D};$q;N;$q;N;d' file
save this code into a file grep.sed
H
s:.*::
x
s:^\n::
:r
/PINITIAL BALANCE/ {
N
N
d
}
/.*\n.*\n/ {
P
D
}
x
d
and run a command like this:
`sed -i -f grep.sed FILE`
You can use it so either:
sed -i 'H;s:.*::;x;s:^\n::;:r;/PINITIAL BALANCE/{N;N;d;};/.*\n.*\n/{P;D;};x;d' FILE

Resources