Removing newlines between tokens - bash

I have a file that contains some information spanning multiple lines. In order for certain other bash scripts I have to work property, I need this information to all be on a single line. However, I obviously don't want to remove all newlines in the file.
What I want to do is replace newlines, but only between all pairs of STARTINGTOKEN and ENDINGTOKEN, where these two tokens are always on different lines (but never get jumbled up together, it's impossible for instance to have two STARTINGTOKENs in a row before an ENDINGTOKEN).
I found that I can remove newlines with
tr "\n" " "
and I also found that I can match patterns over multiple lines with
sed -e '/STARTINGTOKEN/,/ENDINGTOKEN/!d'
However, I can't figure out how to combine these operations while leaving the remainder of the file untouched.
Any suggestions?

are you looking for this?
awk '/STARTINGTOKEN/{f=1} /ENDINGTOKEN/{f=0} {if(f)printf "%s",$0;else print}' file
example:
kent$ cat file
foo
bar
STARTINGTOKEN xx
1
2
ENDINGTOKEN yy
3
4
STARTINGTOKEN mmm
5
6
7
nnn ENDINGTOKEN
8
9
kent$ awk '/STARTINGTOKEN/{f=1} /ENDINGTOKEN/{f=0} {if(f)printf "%s",$0;else print}' file
foo
bar
STARTINGTOKEN xx12ENDINGTOKEN yy
3
4
STARTINGTOKEN mmm567nnn ENDINGTOKEN
8
9

This seems to work:
sed -ne '/STARTINGTOKEN/{ :next ; /ENDINGTOKEN/!{N;b next;}; s/\n//g;p;}' "yourfile"
Once it finds the starting token it loops, picking up lines until it finds the ending token, then removes all the embedded newlines and prints it. Then repeats.

Using awk:
awk '$0 ~ /STARTINGTOKEN/ || l {l=sprintf("%s%s", l, $0)}
/ENDINGTOKEN/{print l; l=""}' input.file

This might work for you (GNU sed):
sed '/STARTINGTOKEN/!b;:a;$bb;N;/ENDINGTOKEN/!ba;:b;s/\n//g' file
or:
sed -r '/(START|END)TOKEN/,//{/STARTINGTOKEN/{h;d};H;/ENDINGTOKEN/{x;s/\n//gp};d}' file

Related

3 Is there anyway to insert new lines in-between two patterns

Is there anyway to insert new lines in-between 2 specific patterns of characters? I want to insert a new line every time "butterfly" occurs in a text file, however I want this new line to be inserted between the "butter" and "fly". For example butter\nfly
I also want to find the length of each line after splitting.
Eg:
if textfile contains:
fgsccgewvdhbejbecbecboubutterflybvdcvhkebcjl
vdjchvhecbihbutterflyglehblejkbedkbutterflyr
Then, I want a result like the following:
29 fgsccgewvdhbejbecbecboubutter
33 flybvdcvhkebcjlvdjchvhecbihbutter
22 flyglehblejkbedkbutter
4 flyr
I believe one way to tackle it would be to insert a new line using "sed" everywhere "butter" occurs and is followed by "fly". Strip out all blank line using grep with a -v flag. Then get the length of each line. However, even after trying a lot, I am unable to get the correct answer.
The Sed 's' sub-command + awk can work together:
sed -e "s/butterfly/butter\\nfly/g" < input.txt | awk '{ print length, $0 }'
This might work for you (GNU sed & bash):
sed -Ez 's/\n//g;s/(butter)(fly)/\1\n\2/g;s/^.*$/l=&;printf "%d %s\n" ${#l} &/meg' file
Slurp the file into memory using the -z sed option. Remove all existing newlines and then insert new ones between butter and fly. Using the m, g and e flags of the sed substitute command, split into separate lines and using bash make a variable l and via printf print the required format.

match repeated character in sed on mac

I am trying to find all instances of 3 or more new lines and replace them with only 2 new lines (imagine a file with wayyy too much white space). I am using sed, but OK with an answer using awk or the like if that's easier.
note: I'm on a mac, so sed is slightly different than on linux (BSD vs GNU)
My actual goal is new lines, but I can't get it to work at all so for simplicity I'm trying to match 3 or more repetitions of bla and replace that with BLA.
Make an example file called stupid.txt:
$ cat stupid.txt
blablabla
$
My understanding is that you match i or more things using regex syntax thing{i,}.
I have tried variations of this to match the 3 blas with no luck:
cat stupid.txt | sed 's/bla{3,}/BLA/g' # simplest way
cat stupid.txt | sed 's/bla\{3,\}/BLA/g' # escape curly brackets
cat stupid.txt | sed -E 's/bla{3,}/BLA/g' # use extended regular expressions
cat stupid.txt | sed -E 's/bla\{3,\}/BLA/g' # use -E and escape brackets
Now I am out of ideas for what else to try!
thing{3,} matches thinggg. Use (..) to group things to make the quantifier apply to what you want:
$ echo blablabla | sed -E 's/(bla){3}/BLA/g'
BLA
If slurping the whole file is acceptable:
perl -0777pe 's/(\n){3,}/\n\n/g' newlines.txt
Where you should replace \n with whatever newline sequence is appropriate.
-0777 tells perl to not break each line into its own record, which allows a regex that works across lines to function.
If you are satisfied with the result, -i causes perl to replace the file in-place rather than output to stdout:
perl -i -0777pe 's/(\n){3,}/\n\n/g' newlines.txt
You can also do as so: -i~ to create a backup file with the given suffix (~ in this case).
If slurping the whole file is not acceptable:
perl -ne 'if (/^$/) {$i++}else{$i=0}print if $i<3' newlines.txt
This prints any line that is not the third (or higher) consecutive empty line. -i works with this the same.
ps--MacOS comes with perl installed.
sed -E 's/bla{3,}/BLA/g'
The above matches bl followed by three or more repetitions of a. This is not what you want. It appears that you actually want three or more repetitions of bla. If that is the case, then replace:
$ sed -E 's/bla{3,}/BLA/g' stupid.txt
blablabla
With:
$ sed -E 's/(bla){3,}/BLA/g' stupid.txt
BLA
The above, though, doesn't directly help with your task of replacing newlines because, by default, sed reads in only one line at a time.
Replacing newlines
Let's consider this file which has 3 newlines between the 1 and 2:
$ cat file.txt
1
3
To replace any occurrence of three or more newlines with a single newline:
$ sed -E 'H;1h;$!d;x; s/\n{3,}/\n/g' file.txt
1
3
How it works:
H;1h;$!d;x
This complex series of commands reads in the whole file. It is probably
simplest to think of this as an idiom. If you really want to know
the gory details:
H - Append current line to hold space
1h - If this is the first line, overwrite the hold space
with it
$!d - If this is not the last line, delete pattern space
and jump to the next line.
x - Exchange hold and pattern space to put whole file in
pattern space
s/\n{3,}/\n/g
This replaces all sequences of three or more newlines with a single newline.
Alternate
The above solution reads in the whole file at once. For large (gigabyte) files that could be a disadvantage. This alternate approach avoids that:
$ sed -E '/^$/{:a; N; /\n$/ba; s/\n{3,}([^\n]*)/\1/}' file.txt # GNU only
1
3
How it works:
/^$/{...}
This selects blank lines. For blank lines and only blank lines, the commands in braces are executed and they are:
:a
This defines a label a.
N
This reads in the next line from the file into the pattern space, separated from the previous by a newline.
/\n$/ba
If the last line read in is empty, branch (jump) to label a.
s/\n{3,}([^\n]*)/\1/
If we didn't branch, then this substitution is performed which removes the excess newlines.
BSD Version: I don't have a BSD system to test this on but I am guessing:
sed -E -e '/^$/{:a' -e N -e '/\n$/ba' -e 's/\n{3,}([^\n]*)/\1/}' file.txt
To keep only 2 newlines, you can try this sed
sed '
/^$/!b
N
/../b
h
:A
y/\n/#/
/^#$/!bB
s/#//
$bB
N
bA
:B
s/^#//
/./ {
x
G
b
}
g
' infile
/^$/!b If it's a empty line don't print it
N get a new line
/../b if this new line is not empty print the 2 lines
h keep the 2 empty lines in the hold buffer
:A label A
At this point there is always 2 lines in the pattern buffer and the first is empty
y/\n/#/ substitute \n by # (you can choose another char not present in your file)
/^#$/!bB If the second line is not empty jump to B
s/#// remove the #
$bB If it's the last line jump to B
At this point there is 1 empty line in the pattern space
N get the last line
bA jump to A
:B label B
s/^#// remove the # at the start of the line
/./ { If the last line is not empty
x exchange pattern and hold buffer
G add the hold buffer to the pattern space
b jump to end
}
g replace the pattern space (empty) by the hold space
print the pattern space

Replacing newlines with commas at every third occurrence using AWK?

For example: a given file has the following lines:
1
alpha
beta
2
charlie
delta
10
text
test
I'm trying to get the following output using awk:
1,alpha,beta
2,charlie,delta
10,text,test
Fairly simple. Use the output record separator as follows. Specify the comma delimiter when the line number is not divisible by 3 and the newline otherwise:
awk 'ORS=NR%3?",":"\n"' file
awk can handle this easily by manipulating ORS:
awk '{ORS=","} !(NR%3){ORS="\n"} 1' file
1,alpha,beta
2,charlie,delta
10,text,test
there is a tool for this kind of text processing pr
$ pr -3ats, file
1,alpha,beta
2,charlie,delta
10,text,test
You can also use xargs with sed to coalesce multiple lines into single lines, useful to know:
cat file|xargs -n3|sed 's/ /,/g'

Sed/Awk to delete second occurence of string - platform independent

I'm looking for a line in bash that would work on both linux as well as OS X to remove the second line containing the desired string:
Header
1
2
...
Header
10
11
...
Should become
Header
1
2
...
10
11
...
My first attempt was using the deletion option of sed:
sed -i '/^Header.*/d' file.txt
But well, that removes the first occurence as well.
How to delete the matching pattern from given occurrence suggests to use something like this:
sed -i '/^Header.*/{2,$d} file.txt
But on OS X that gives the error
sed: 1: "/^Header.*/{2,$d}": extra characters at the end of d command
Next, i tried substitution, where I know how to use 2,$, and subsequent empty line deletion:
sed -i '2,$s/^Header.*//' file.txt
sed -i '/^\s*$/d' file.txt
This works on Linux, but on OS X, as mentioned here sed command with -i option failing on Mac, but works on Linux , you'd have to use
sed -i '' '2,$s/^Header.*//' file.txt
sed -i '' '/^\s*$/d' file.txt
And this one in return doesn't work on Linux.
My question then, isn't there a simple way to make this work in any Bash? Doesn't have to be sed, but should be as shell independent as possible and i need to modify the file itself.
Since this is file-dependent and not line-dependent, awk can be a better tool.
Just keep a counter on how many times this happened:
awk -v patt="Header" '$0 == patt && ++f==2 {next} 1' file
This skips the line that matches exactly the given pattern and does it for the second time. On the rest of lines, it prints normally.
I would recommend using awk for this:
awk '!/^Header/ || !f++' file
This prints all lines that don't start with "Header". Short-circuit evaluation means that if the left hand side of the || is true, the right hand side isn't evaluated. If the line does start with Header, the second part !f++ is only true once.
$ cat file
baseball
Header and some other stuff
aardvark
Header for the second time and some other stuff
orange
$ awk '!/^Header/ || !f++' file
baseball
Header and some other stuff
aardvark
orange
This might work for you (GNU sed):
sed -i '1b;/^Header/d' file
Ignore the first line and then remove any occurrence of a line beginning with Header.
To remove subsequent occurrences of the first line regardless of the string, use:
sed -ri '1h;1b;G;/^(.*)\n\1$/!P;d' file

How to remove a long line with special characters from a big file in bash

I have a file having 200 lines and I want to remove 5 long lines( each line contains special characters).
$cat abc
............
comments[asci?_203] part of jobs where to delete
5 similar lines
.....
I tried sed to remove these 5 lines, using line numbers(nl) on the file, but did not work.
Thanks
Have you tried to remove the lines with awk? This is untested with special characters, but maybe it could work
awk '{if (length($0)<55) print $0}' < abc
Replace 55 with the maximum line length you want to keep

Resources