match repeated character in sed on mac - bash

I am trying to find all instances of 3 or more new lines and replace them with only 2 new lines (imagine a file with wayyy too much white space). I am using sed, but OK with an answer using awk or the like if that's easier.
note: I'm on a mac, so sed is slightly different than on linux (BSD vs GNU)
My actual goal is new lines, but I can't get it to work at all so for simplicity I'm trying to match 3 or more repetitions of bla and replace that with BLA.
Make an example file called stupid.txt:
$ cat stupid.txt
blablabla
$
My understanding is that you match i or more things using regex syntax thing{i,}.
I have tried variations of this to match the 3 blas with no luck:
cat stupid.txt | sed 's/bla{3,}/BLA/g' # simplest way
cat stupid.txt | sed 's/bla\{3,\}/BLA/g' # escape curly brackets
cat stupid.txt | sed -E 's/bla{3,}/BLA/g' # use extended regular expressions
cat stupid.txt | sed -E 's/bla\{3,\}/BLA/g' # use -E and escape brackets
Now I am out of ideas for what else to try!

thing{3,} matches thinggg. Use (..) to group things to make the quantifier apply to what you want:
$ echo blablabla | sed -E 's/(bla){3}/BLA/g'
BLA

If slurping the whole file is acceptable:
perl -0777pe 's/(\n){3,}/\n\n/g' newlines.txt
Where you should replace \n with whatever newline sequence is appropriate.
-0777 tells perl to not break each line into its own record, which allows a regex that works across lines to function.
If you are satisfied with the result, -i causes perl to replace the file in-place rather than output to stdout:
perl -i -0777pe 's/(\n){3,}/\n\n/g' newlines.txt
You can also do as so: -i~ to create a backup file with the given suffix (~ in this case).
If slurping the whole file is not acceptable:
perl -ne 'if (/^$/) {$i++}else{$i=0}print if $i<3' newlines.txt
This prints any line that is not the third (or higher) consecutive empty line. -i works with this the same.
ps--MacOS comes with perl installed.

sed -E 's/bla{3,}/BLA/g'
The above matches bl followed by three or more repetitions of a. This is not what you want. It appears that you actually want three or more repetitions of bla. If that is the case, then replace:
$ sed -E 's/bla{3,}/BLA/g' stupid.txt
blablabla
With:
$ sed -E 's/(bla){3,}/BLA/g' stupid.txt
BLA
The above, though, doesn't directly help with your task of replacing newlines because, by default, sed reads in only one line at a time.
Replacing newlines
Let's consider this file which has 3 newlines between the 1 and 2:
$ cat file.txt
1
3
To replace any occurrence of three or more newlines with a single newline:
$ sed -E 'H;1h;$!d;x; s/\n{3,}/\n/g' file.txt
1
3
How it works:
H;1h;$!d;x
This complex series of commands reads in the whole file. It is probably
simplest to think of this as an idiom. If you really want to know
the gory details:
H - Append current line to hold space
1h - If this is the first line, overwrite the hold space
with it
$!d - If this is not the last line, delete pattern space
and jump to the next line.
x - Exchange hold and pattern space to put whole file in
pattern space
s/\n{3,}/\n/g
This replaces all sequences of three or more newlines with a single newline.
Alternate
The above solution reads in the whole file at once. For large (gigabyte) files that could be a disadvantage. This alternate approach avoids that:
$ sed -E '/^$/{:a; N; /\n$/ba; s/\n{3,}([^\n]*)/\1/}' file.txt # GNU only
1
3
How it works:
/^$/{...}
This selects blank lines. For blank lines and only blank lines, the commands in braces are executed and they are:
:a
This defines a label a.
N
This reads in the next line from the file into the pattern space, separated from the previous by a newline.
/\n$/ba
If the last line read in is empty, branch (jump) to label a.
s/\n{3,}([^\n]*)/\1/
If we didn't branch, then this substitution is performed which removes the excess newlines.
BSD Version: I don't have a BSD system to test this on but I am guessing:
sed -E -e '/^$/{:a' -e N -e '/\n$/ba' -e 's/\n{3,}([^\n]*)/\1/}' file.txt

To keep only 2 newlines, you can try this sed
sed '
/^$/!b
N
/../b
h
:A
y/\n/#/
/^#$/!bB
s/#//
$bB
N
bA
:B
s/^#//
/./ {
x
G
b
}
g
' infile
/^$/!b If it's a empty line don't print it
N get a new line
/../b if this new line is not empty print the 2 lines
h keep the 2 empty lines in the hold buffer
:A label A
At this point there is always 2 lines in the pattern buffer and the first is empty
y/\n/#/ substitute \n by # (you can choose another char not present in your file)
/^#$/!bB If the second line is not empty jump to B
s/#// remove the #
$bB If it's the last line jump to B
At this point there is 1 empty line in the pattern space
N get the last line
bA jump to A
:B label B
s/^#// remove the # at the start of the line
/./ { If the last line is not empty
x exchange pattern and hold buffer
G add the hold buffer to the pattern space
b jump to end
}
g replace the pattern space (empty) by the hold space
print the pattern space

Related

How to refresh the line numbers in sed

How can you refresh the line numbers of a sed output inside the same sed command?
I have a sed script as follows -
#!/usr/bin/sed -f
/pattern/i #inserting a line
1~10i ####
What this does is that it inserts lines wherever the pattern is matched and then inserts #### every ten lines. The problem is that it inserts the hashes every 10 lines according to the line numbers of the original file before inserting the lines for the matching pattern. I want to refresh the line numbers after inserting the lines and use them for inserting the 4 hashes every 10 lines.
Anyway this can be done without piping the output into a new sed?
Interesting challenge. If your file is not too large, the following may work for you (tested with GNU sed):
#!/usr/bin/sed -nEf
:a; N; $!ba
{
s/([^\n]*pattern[^\n]*\n)/#inserting a line\n\1/g
s/\n/ \n/g
s/\`/####\n/
:b
s/(.*####\n([^\n]* \n){9}[^\n]*) \n/\1\n####\n/
tb
s/ \n/\n/g
p
}
Explanations, line by line:
No print, extended RE mode (-nE).
Loop around label a to concatenate the whole file in the pattern space (reason why its size matters).
Add #inserting a line\n before each line containing pattern.
Add a space before all endline characters.
Insert ####\n before the first line.
Label b.
Append ####\n' to anything followed by ####\n` and 10 space-terminated lines, removing the final space (to prevent subsequent matches).
Goto b if there was a substitution.
Remove all spaces at the end of a line.
print.
Note: if your file does not contain NUL characters the -z option of GNU sed saves a few commands:
#!/usr/bin/sed -Ezf
s/([^\n]*pattern[^\n]*\n)/#inserting a line\n\1/g
s/\n/ \n/g
s/\`/####\n/
:a
s/(.*####\n([^\n]* \n){9}[^\n]*) \n/\1\n####\n/
ta
s/ \n/\n/g
Note: with the hold space we could probably do the same on the fly, instead of storing the whole file in the pattern space.
This might work for you (GNU sed):
sed -zE 's/.*pattern/# insert line\n&/mg
s/([^\n]*\n){10}/&####\n/g
s/^/####\n/' file
Slurp the file into memory.
Insert desired text before lines containing pattern.
Insert #### every 10 lines and before the first line.

How to get all lines from a file after the last empty line?

Having a file like foo.txt with content
1
2
3
4
5
How do i get the lines starting with 4 and 5 out of it (everything after last empty line), assuming the amount of lines can be different?
Updated
Let's try a slightly simpler approach with just sed.
$: sed -n '/^$/{g;D;}; N; $p;' foo.txt
4
5
-n says don't print unless I tell you to.
/^$/{g;D;}; says on each blank line, clear it all out with this:
g : Replace the contents of the pattern space with the contents of the hold space. Since we never put anything in, this erases the (possibly long accumulated) pattern space. Note that I could have used z since this is GNU, but I wanted to break it out for non-GNU sed's below, and in this case this works for both.
D : remove the now empty line from the pattern space, and go read the next.
Now previously accumulated lines have been wiped if (and only if) we saw a blank line. The D loops back to the beginning, so N will never see a blank line.
N : Add a newline to the pattern space, then append the next line of input to the pattern space. This is done on every line except blanks, after which the pattern space will be empty.
This accumulates all nonblanks until either 1) a blank is hit, which will clear and restart the buffer as above, or 2) we reach EOF with a buffer intact.
Finally, $p says on the LAST line (which will already have been added to the pattern space unless the last line was blank, which will have removed the pattern space...), print the pattern space. The only time this will have nothing to print is if the last line of the file was a blank line.
So the whole logic boils down to: clean the buffer on empty lines, otherwise pile the non-empty lines up and print at the end.
If you don't have GNU sed, just put the commands on separate lines.
sed -n '
/^$/{
g
D
}
N
$p
' foo.txt
Alternate
The method above is efficient, but could potentially build up a very large pattern buffer on certain data sets. If that's not an issue, go with it.
Or, if you want it in simple steps, don't mind more processes doing less work each, and prefer less memory consumed:
last=$( sed -n /^$/= foo.txt|tail -1 ) # find the last blank
next=$(( ${last:-0} + 1 )) # get the number of the line after
cmd="$next,\$p" # compose the range command to print
sed -n "$cmd" foo.txt # run it to print the range you wanted
This runs a lot of small, simple tasks outside of sed so that it can give sed the simplest, most direct and efficient description of the task possible. It will read the target file twice, but won't have to manage filling, flushing, and refilling the accumulation of data in the pattern buffer with records before a blank line. Still likely slower unless you are memory bound, I'd think.
Reverse the file, print everything up to the first blank line, reverse it again.
$ tac foo.txt | awk '/^$/{exit}1' | tac
4
5
Using GNU awk:
awk -v RS='\n\n' 'END{printf "%s",$0}' file
RS is the record separator set to empty line.
The END statement prints the last record.
try this:
tail +$(($(grep -nE ^$ test.txt | tail -n1 | sed -e 's/://g')+1)) test.txt
grep your input file for empty lines.
get last line with tail => 5:
remove unnecessary :
add 1 to 5 => 6
tail starting from 6
You can try with sed :
sed -n ':A;$bB;/^$/{x;s/.*//;x};H;n;bA;:B;H;x;s/^..//;p' infile
With GNU sed:
sed ':a;/$/{N;s/.*\n\n//;ba;}' file

Sed range and removing last matching line

I have this data:
One
two
three
Four
five
six
Seven
eight
And this command:
sed -n '/^Four$/,/^[^[:blank:]]/p'
I get the following output:
Four
five
six
Seven
How can I change this sed expression to not match the final line of the output? So the ideal output should be:
Four
five
six
I've tried many things involving exclamation points but haven't managed to get close to getting this working.
Use a "do..while()" loop:
sed -n '/^Four$/{:a;p;n;/^[[:blank:]]/ba}'
details:
/^Four$/ {
:a # define the label "a"
p # print the pattern-space
n # load the next line in the pattern space
/^[[:blank:]]/ba # if the pattern succeeds, go to label "a"
}
You may pipe to another sed and skip last line:
sed -n '/^Four$/,/^[^[:blank:]]/p' file | sed '$d'
Four
five
six
Alternatively you may use:
sed -n '/^Four$/,/^[^[:blank:]]/{/^Four$/p; /^[^[:blank:]]/!p;}' file
You're using the wrong tool. sed is for doing s/old/new, that is all. Just use awk:
$ awk '/^[^[:blank:]]/{f=/^Four$/} f' file
Four
five
six
How it works: Every time it finds a line that doesn't start with spaces (/^[^[:blank:]]/) it sets a flag f (for "found") to 1 if that line starts with Four and 0 otherwise (f=/^Four$/). Whenever f is non-zero that is interpreted as a true condition and so invokes awks default behavior which is to print the current line. So when it hits a block starting with Four it prints every line in that block because f is 1/true and for every other block it doesn't print since f is 0/false.
Following awk may help you here.
awk '!/^ /{flag=""} /Four/{flag=1} flag' Input_file
Output will be as follows.
Four
five
six
Also in case of you need to save the output into Input_file itself append > temp_file && mv temp_file Input_file to above code.
grep -Pzo '\n\KFour\n(\s.+\n)+' input.txt
Output
Four
five
six
This might work for you (GNU sed):
sed '/^Four/{:a;n;/^\s/ba};d' file
If the line begins with Four print it and any following lines beginning with a space.
Another way:
sed '/^\S/h;G;/^Four/MP;d' file
If a line begins with a non-space, copy it to the hold space (HS). Append the HS to each line and if either line begins with Four print the first line and delete the rest. This will delete all lines other than the section beginning with Four.

What is the meaning of "0,/xxx" in sed?

A sed command used in a script as following:
sed -i "0,/^ENABLE_DEBUG.*/s/^ENABLE_DEBUG.*/ENABLE_DEBUG = YES/" MakeConfig
I knows that
s/^ENABLE_DEBUG.*/ENABLE_DEBUG = YES/
is to substitutes line prefix
ENABLE_DEBUG as ENABLE_DEBUG = YES
But no idea about the meaning of
0,/^ENABLE_DEBUG.*/
Anyone can help me?
0,/^ENABLE_DEBUG.*/ means that the substitution will only occur on lines from the beginning, 0, to the first line that matches /^ENABLE_DEBUG.*/. No substitution will be made on subsequent lines even if they match /^ENABLE_DEBUG.*/
Other examples of ranges
This will substitute only on lines 2 through 5:
sed '2,5 s/old/new/'
This will substitute from line 2 to the first line after it which includes something:
sed '2,/something/ s/old/new/'
This will substitute from the first line that contains something to the end of the file:
sed '/something/,$ s/old/new/'
POSIX vs. GNU ranges: the meaning of line "0"
Consider this test file:
$ cat test.txt
one
two
one
three
Now, let's apply sed over the range 1,/one/:
$ sed '1,/one/ s/one/Hello/' test.txt
Hello
two
Hello
three
The range starts with line 1 and ends with the first line after line 1 that matches one. Thus two substitutions are made above.
Suppose that we only wanted the first one replaced. With POSIX sed, this cannot be done with ranges. As NeronLeVelu points out, GNU sed offers an extension for this case: it allows us to specify the range as 0,/one/. This range ends with the first occurrence of one in the file:
$ sed '0,/one/ s/one/Hello/' test.txt
Hello
two
one
three
Thus, the range 0,/^ENABLE_DEBUG/ ends with the first line that begins with ENABLE_DEBUG even if that line is the first line. This requires GNU sed.

Empty regular expression in sed script

Found the following sed script to reverse characters in each line, from the famous "sed one liners", and I am not able to follow the following command in //D of the script
sed '/\n/!G;s/\(.\)\(.*\n\)/&\2\1/;//D;s/.//'
Suppose the inital file had two lines to start with say,
apple
banana
After the first command,
/\n/!G
pattern space would be,
apple
banana
[a new line introduced after each line. Code tag removing the last new line here. So it is not shown].
After the second command,
s/\(.\)\(.*\n\)/&\2\1/
pattern space would be,
apple
pple
a
banana
anana
b
How does the third command work after this? Also, I understand empty regular expression(//) matches the previously matched regexp. But in this case, what that will be? \n from the 1st command or the regexp substituted by the 2nd command? Any help would be much appreciated. Thanks.
Using the suggestion from my own comment above
this is what happens:
After /\n/!G pattern space would be
apple¶
banana¶
After s/\(.\)\(.*\n\)/&\2\1/ pattern space would be
apple¶pple¶a
banana¶anana¶b
then comes the D command. from man sed:
D Delete up to the first embedded newline in the pattern space.
Start next cycle, but skip reading from the input if there is
still data in the pattern space.
so the first word and the first ¶ is deleted. then sed starts from the
1st command but since the pattern space contains a ¶ the pattern /\n/
does not match and the G command is not executed.
The 2nd command leads to
pple¶ple¶pa
anana¶nana¶ab
can you continue from there?
D mean Delete first line (until first \n) and restart the current cycle if there is still something in the buffer
// is a shortcut to previous pattern matching (reuse the last pattern to serach for)
$ echo "123" | sed -n 's/2/other/;// p'
$
No corresponding (because it change the pattern matching content)
$ echo "123" | sed -n 's/.2/&still/;// p'
12still3
$
Pattern .2 is found also when // p is used because it is the equivalent to /.2/ p

Resources