Merge two blank lines into one - bash

I am looking for a solution of turning file A to file B, which requires merging two blank lines into one.
File-A:
// Comment 1
// Comment 2
// Comment 3
// Comment 4
// Comment 5
File-B:
// Comment 1
// Comment 2
// Comment 3
// Comment 4
// Comment 5
From this post, I know how to delete empty lines, I am wondering how to merge two consecutive blank lines into one.
PS: blank means that it could be empty OR there might be a tab or a space in the line.

sed -r 's/^\s+$//' infile | cat -s > outfile
sed removes any whitespace on a blank line. The -s option to cat squeezes consecutive blank lines into one.

This might work for you (GNU sed):
sed '$!N;s/^\s*\n\s*$//;P;D' file
This will convert 2 blank lines into one.
If you want to replace multiple blank lines into one:
sed ':a;$!N;s/^\s*\n\s*$//;ta;P;D' file
On reflection a far simpler solution is:
sed ':a;N;s/\n\s*$//;ta' file
Which squeezes one or more blank lines to a single blank line.
An even easier solution uses the range condition:
sed '/\S/,/^\s*$/!d' file
This deletes any blank lines other than those following a non-blank line.

Here is a simple solution with awk:
awk '!NF && !a++; NF {print;a=0}' file
// Comment 1
// Comment 2
// Comment 3
// Comment 4
// Comment 5
NF counts the number of fields; note that a line composed entirely of spaces and tabs counts as a blank line, too.
a is used to count blank lines, and if it's more than 1, skip it.

This page might come handy. TL;DR as follows:
# delete all CONSECUTIVE blank lines from file except the first; also
# deletes all blank lines from top and end of file (emulates "cat -s")
sed '/./,/^$/!d' # method 1, allows 0 blanks at top, 1 at EOF
sed '/^$/N;/\n$/D' # method 2, allows 1 blank at top, 0 at EOF

This should work:
sed 'N;s/^\([[:space:]]*\)\n\([[:space:]]*\)$/\1\2/;P;D' file

awk -v RS='([[:blank:]]*\n){2,}' -v ORS="\n\n" 1 file
I had hoped to produce a shorter Perl version, but Perl does not use regular expressions for its record separator.
awk does not edit in-place. You would have to do this:
awk -v RS='([[:blank:]]*\n){2,}' -v ORS="\n\n" 1 file > tmp && mv tmp file

Related

How to get all lines from a file after the last empty line?

Having a file like foo.txt with content
1
2
3
4
5
How do i get the lines starting with 4 and 5 out of it (everything after last empty line), assuming the amount of lines can be different?
Updated
Let's try a slightly simpler approach with just sed.
$: sed -n '/^$/{g;D;}; N; $p;' foo.txt
4
5
-n says don't print unless I tell you to.
/^$/{g;D;}; says on each blank line, clear it all out with this:
g : Replace the contents of the pattern space with the contents of the hold space. Since we never put anything in, this erases the (possibly long accumulated) pattern space. Note that I could have used z since this is GNU, but I wanted to break it out for non-GNU sed's below, and in this case this works for both.
D : remove the now empty line from the pattern space, and go read the next.
Now previously accumulated lines have been wiped if (and only if) we saw a blank line. The D loops back to the beginning, so N will never see a blank line.
N : Add a newline to the pattern space, then append the next line of input to the pattern space. This is done on every line except blanks, after which the pattern space will be empty.
This accumulates all nonblanks until either 1) a blank is hit, which will clear and restart the buffer as above, or 2) we reach EOF with a buffer intact.
Finally, $p says on the LAST line (which will already have been added to the pattern space unless the last line was blank, which will have removed the pattern space...), print the pattern space. The only time this will have nothing to print is if the last line of the file was a blank line.
So the whole logic boils down to: clean the buffer on empty lines, otherwise pile the non-empty lines up and print at the end.
If you don't have GNU sed, just put the commands on separate lines.
sed -n '
/^$/{
g
D
}
N
$p
' foo.txt
Alternate
The method above is efficient, but could potentially build up a very large pattern buffer on certain data sets. If that's not an issue, go with it.
Or, if you want it in simple steps, don't mind more processes doing less work each, and prefer less memory consumed:
last=$( sed -n /^$/= foo.txt|tail -1 ) # find the last blank
next=$(( ${last:-0} + 1 )) # get the number of the line after
cmd="$next,\$p" # compose the range command to print
sed -n "$cmd" foo.txt # run it to print the range you wanted
This runs a lot of small, simple tasks outside of sed so that it can give sed the simplest, most direct and efficient description of the task possible. It will read the target file twice, but won't have to manage filling, flushing, and refilling the accumulation of data in the pattern buffer with records before a blank line. Still likely slower unless you are memory bound, I'd think.
Reverse the file, print everything up to the first blank line, reverse it again.
$ tac foo.txt | awk '/^$/{exit}1' | tac
4
5
Using GNU awk:
awk -v RS='\n\n' 'END{printf "%s",$0}' file
RS is the record separator set to empty line.
The END statement prints the last record.
try this:
tail +$(($(grep -nE ^$ test.txt | tail -n1 | sed -e 's/://g')+1)) test.txt
grep your input file for empty lines.
get last line with tail => 5:
remove unnecessary :
add 1 to 5 => 6
tail starting from 6
You can try with sed :
sed -n ':A;$bB;/^$/{x;s/.*//;x};H;n;bA;:B;H;x;s/^..//;p' infile
With GNU sed:
sed ':a;/$/{N;s/.*\n\n//;ba;}' file

Sed range and removing last matching line

I have this data:
One
two
three
Four
five
six
Seven
eight
And this command:
sed -n '/^Four$/,/^[^[:blank:]]/p'
I get the following output:
Four
five
six
Seven
How can I change this sed expression to not match the final line of the output? So the ideal output should be:
Four
five
six
I've tried many things involving exclamation points but haven't managed to get close to getting this working.
Use a "do..while()" loop:
sed -n '/^Four$/{:a;p;n;/^[[:blank:]]/ba}'
details:
/^Four$/ {
:a # define the label "a"
p # print the pattern-space
n # load the next line in the pattern space
/^[[:blank:]]/ba # if the pattern succeeds, go to label "a"
}
You may pipe to another sed and skip last line:
sed -n '/^Four$/,/^[^[:blank:]]/p' file | sed '$d'
Four
five
six
Alternatively you may use:
sed -n '/^Four$/,/^[^[:blank:]]/{/^Four$/p; /^[^[:blank:]]/!p;}' file
You're using the wrong tool. sed is for doing s/old/new, that is all. Just use awk:
$ awk '/^[^[:blank:]]/{f=/^Four$/} f' file
Four
five
six
How it works: Every time it finds a line that doesn't start with spaces (/^[^[:blank:]]/) it sets a flag f (for "found") to 1 if that line starts with Four and 0 otherwise (f=/^Four$/). Whenever f is non-zero that is interpreted as a true condition and so invokes awks default behavior which is to print the current line. So when it hits a block starting with Four it prints every line in that block because f is 1/true and for every other block it doesn't print since f is 0/false.
Following awk may help you here.
awk '!/^ /{flag=""} /Four/{flag=1} flag' Input_file
Output will be as follows.
Four
five
six
Also in case of you need to save the output into Input_file itself append > temp_file && mv temp_file Input_file to above code.
grep -Pzo '\n\KFour\n(\s.+\n)+' input.txt
Output
Four
five
six
This might work for you (GNU sed):
sed '/^Four/{:a;n;/^\s/ba};d' file
If the line begins with Four print it and any following lines beginning with a space.
Another way:
sed '/^\S/h;G;/^Four/MP;d' file
If a line begins with a non-space, copy it to the hold space (HS). Append the HS to each line and if either line begins with Four print the first line and delete the rest. This will delete all lines other than the section beginning with Four.

How could I put these lines in range format?

I have a text file with 826,838 lines. Text file looks like this (sorry, couldn't get the image uploader to work).
I'm using sed (sed -n '2p;$p') to print the second and last line but can't figure out how to put the lines in range format.
Current output:
1 3008.00 7380.00 497724.00 3158482.00 497724.00 3158482.00
826838 4744.00 7409.00 480729.00 3207718.00 480729.00 3207718.00
Desired output:
1-826838 3008.00-4744.00 7380.00-7409.00 497724.00-480729.00 3158482.00-3207718.00 497724.00-480729.00 3158482.00-3207718.00
Thank you for your help!
This might work for you (GNU sed):
sed -r '2H;$!d;H;x;:a;s/\n\s*(\S+)\s*(.*\n)\s*(\S+\s*)/\1-\3\n\2/;ta;P;d' file
Store line 2 and the last line in the hold space (HS). Following the last line, swap to the HS and then repeatedly move the first fields of the second and third lines to the first line. Finally print the first line only.
With single awk expression (will get the needed lines and make the needed ranges):
awk 'NR==2{ split($0,a) }END{ for(i=1;i<=NF;i++) printf("%s\t",a[i]"-"$i); print "" }' file
The output:
1-826838 3008.00-4744.00 7380.00-7409.00 497724.00-480729.00 3158482.00-3207718.00 497724.00-480729.00 3158482.00-3207718.00

Select full block of text delimited by some chars

I have a very large text file (40GB gzipped) where blocks of data are separated by //.
How can I select blocks of data where a certain line matches some criterion? That is, can I grep a pattern and extend the selection in both directions to the // delimiter? I can make no assumptions on the size of the block and the position of the line.
not interesting 1
not interesting 2
//
get the whole block 1
MATCH THIS LINE
get the whole block 2
get the whole block 3
//
not interesting 1
not interesting 2
//
I want to select the block of data with MATCH THIS LINE:
get the whole block 1
MATCH THIS LINE
get the whole block 2
get the whole block 3
I tried sed but can't get my head around the pattern definition. This for example should match from // to MATCH THIS LINE:
sed -n -e '/\/\//,/MATCH THIS LINE/ p' file.txt
But it fails matching the //.
Is it possible to achieve this with GNU command line tools?
With GNU awk (due to multi-char RS), you can set the record separator to //, so that every record is a //-delimited set of characters:
$ awk -v RS="//" '/MATCH THIS LINE/' file
get the whole block 1
MATCH THIS LINE
get the whole block 2
get the whole block 3
Note this leaves an empty line above and below because it catches the new line just after // and prints it back, as well as the last one before the // at the end. To remove them you can pipe to awk 'NF'.
To print the separator between blocks of data you can say (thanks 123):
awk -v RS="//" '/MATCH THIS LINE/{print RT $0 RT}' file

Removing newlines between tokens

I have a file that contains some information spanning multiple lines. In order for certain other bash scripts I have to work property, I need this information to all be on a single line. However, I obviously don't want to remove all newlines in the file.
What I want to do is replace newlines, but only between all pairs of STARTINGTOKEN and ENDINGTOKEN, where these two tokens are always on different lines (but never get jumbled up together, it's impossible for instance to have two STARTINGTOKENs in a row before an ENDINGTOKEN).
I found that I can remove newlines with
tr "\n" " "
and I also found that I can match patterns over multiple lines with
sed -e '/STARTINGTOKEN/,/ENDINGTOKEN/!d'
However, I can't figure out how to combine these operations while leaving the remainder of the file untouched.
Any suggestions?
are you looking for this?
awk '/STARTINGTOKEN/{f=1} /ENDINGTOKEN/{f=0} {if(f)printf "%s",$0;else print}' file
example:
kent$ cat file
foo
bar
STARTINGTOKEN xx
1
2
ENDINGTOKEN yy
3
4
STARTINGTOKEN mmm
5
6
7
nnn ENDINGTOKEN
8
9
kent$ awk '/STARTINGTOKEN/{f=1} /ENDINGTOKEN/{f=0} {if(f)printf "%s",$0;else print}' file
foo
bar
STARTINGTOKEN xx12ENDINGTOKEN yy
3
4
STARTINGTOKEN mmm567nnn ENDINGTOKEN
8
9
This seems to work:
sed -ne '/STARTINGTOKEN/{ :next ; /ENDINGTOKEN/!{N;b next;}; s/\n//g;p;}' "yourfile"
Once it finds the starting token it loops, picking up lines until it finds the ending token, then removes all the embedded newlines and prints it. Then repeats.
Using awk:
awk '$0 ~ /STARTINGTOKEN/ || l {l=sprintf("%s%s", l, $0)}
/ENDINGTOKEN/{print l; l=""}' input.file
This might work for you (GNU sed):
sed '/STARTINGTOKEN/!b;:a;$bb;N;/ENDINGTOKEN/!ba;:b;s/\n//g' file
or:
sed -r '/(START|END)TOKEN/,//{/STARTINGTOKEN/{h;d};H;/ENDINGTOKEN/{x;s/\n//gp};d}' file

Resources