how to find continuous blank lines and convert them to one - shell

I have a file -- a, and exist some continues blank line(more than one), see below:
cat a
1
2
3
4
5
So first I want to know if exist continues blank lines, I tried
cat a | grep '\n\n\n'
nothing output. So I have to use below manner
vi a
:set list
/\n\n\n
So I want to know if exist other shell command could easily implement this?
then if exist two and more blank lines I want to convert them to one? see below
1
2
3
4
5
at first I tried below shell
sed 's/\n\n\(\n\)*/\n\n/g' a
it does not work, then I tried this shell
cat a | tr '\n' '$' | sed 's/$$\(\$\)*/$$/g' | tr '$' '\n'
this time it works. And also I want to know if exist other manner could implement this?

Well, if your cat implementation supports
-s, --squeeze-blank
suppress repeated empty output lines
then it is as simple as
$ cat -s a
1
2
3
4
5
Also, both -s and -n for numbering lines is likely to be available with less command as well.
remark: lines containing only blanks will not be suppressed.
If your cat does not support -s then you could use:
awk 'NF||p; {p=NF}'
or if you want to guarantee a blank line after every record, including at the end of the output even if none was present in the input, then:
awk -v RS= -v ORS='\n\n' '1'
If your input contains lines of all white space and you want them to be treated just like lines of non white space (like cat -s does, see the comments below) then:
awk '/./||p; {p=/./}'
and to guarantee a blank line at the end of the output:
awk '/./||p; {p=/./} END{if (p) print ""}'

This awk command should work to produce an output with 2 line breaks at each line:
awk -v RS= '{printf "%s%s", $0, ORS (RT ~ /\n{2,}/ ? ORS : "")}' file
1
2
3
4
5
This awk is using:
-v RS=: sets empty input record separator so that each empty line becomes record separator
printf "%s%s", $0, ORS: prints each line with single line break
(RT ~ /\n{2,}/ ? ORS : ""): prints additional line break if input record separator has more than 2 line breaks
You may use perl as well in slurp mode:
perl -0777 -pe 's/\R{2,}/\n\n/g' file
1
2
3
4
5
Command breakup:
-0777 Slurp mode to read entire file
's/\R{2,}/\n\n/g' Match 2 or more line breaks and replace by 2 line breaks

You can --squeeze-repeats with tr and then use sed to insert just a new line:
<a tr -s '\n' | sed 'G'

remark: This is a copy from my answer here
A very quick way is using awk
awk 'BEGIN{RS="";ORS="\n\n"}1'
How does this work:
awk knowns the concept records (which is by default lines) and you can define a record by its record separator RS. If you set the value of RS to an empty string, it will match any multitude of empty lines as a record separator. The value ORS is the output record separator. It states which separator should be printed between two consecutive records. This is set to two <newline> characters. Finally, the statement 1 is a shorthand for {print $0} which prints the current record followed by the output record-separator ORS.
note: This will, just as cat -s keep lines with only blanks as actual lines and will not suppress them.

Another awk solution:
awk 'NF' ORS="\n\n" a
1
2
3
4
5
It checks if the line is not empty by testing if NF (number of fields) is not zero. It it matches, print the line as default action. ORS (output record separator) is set to 2 newline characters, so there is an empty line between non-empty lines.

1) awk solution
$ echo "a\n\n\nb\n\n\nc\n\n\n" | awk 'BEGIN{b=0} /^$/{b=1;next} {printf "%s%s\n", b==1?"\n":"",$0} {b=0} END{printf "%s",b==1?"\n":""}'
a
b
c
$
2) sed solution
sed '
/^$/{ ${ p; d; }; H; d; }
/^$/!{ x; s/^\(\n\{1,\}\)$/\1/; ts; Tf; }
:s { x; s/\(.*\)/\n\1/; x; s/.*//; x; p; d; }
:f { x; p; d; }
'
SED Explanation:
/^$/{ ${ p; d; }; H; d; }
--If input is blank, if it is the last line, just print, else append to the holdspace and delete the pattern space and start new cycle
/^$/!{ x; s/^\(\n\{1,\}\)$/\1/; ts; Tf; }
--If input is not blank, exchange content of the p space and h space and check if h space contains \n. if yes, jump to s, if not jump to f
:s { x; s/\(.*\)/\n\1/; x; s/.*//; x; p; d; }
--If blank lines are present in h space, then append \n to p space, then clear hold space , then print p space and delete p space
:f { x; p; d; }
--If blank lines are absent in h space, then print p space and delete p space

Related

Bash: how to put each line of a column below the same-row-line of another column?

I'm working with some data using bash and I need that this kind of input file:
Col1 Col2
A B
C D
E F
G H
Turn out in this output file:
Col1
A
B
C
D
E
F
G
H
I tried with some commands but they didn't work. Any kind of suggestions will be very appreciated!
As with many problems, there are many solutions. Here is one using awk:
awk 'NR > 1 {print $1; print $2}' inputfile.txt
The NR > 1 expression says to execute the following block for all line numbers greater than one. (NR is the current record number which is the same as line number by default.)
The {print $1; print $2} code block says to print the first field, then print the second field. The advantage of using awk in this case is that it doesn't matter if the fields are separated by space characters, tabs, or a combination; the fields just have to be separated by some number of whitespace characters.
If the field values on each line are only separated by a single space character, then this would work:
tail -n +2 inputfile.txt | tr ' ' '\n'
In this solution, tail -n +2 is used to print all lines starting with the second line and tr ' ' '\n' is used to replace all the space characters with newlines, as suggested by previously.

Remove first two lines, last two lines and space from file and add quotes on each line and replace newline with commas in shell script

I have to input.txt file which needs to be formatted by shell script with following condition
remove first two lines and
last two lines
remove all spaces in each
lines(each line have two spaces at
beginning and one space at end)
Each line should be within single
quotes(' ')
At last replace newline($) with
commas.
(original)
input.txt
sql
--------
Abce
Bca
Efr
-------
Row (3)
Desired output file
output.txt
'Abce','Bca','Efr'
I have tried using following commands
Sed -i 1,2d input.txt > input.txt
Sed "$(( $(wc -l <input.txt) -2+1)), $ d" Input.txt > input.txt
Sed ':a;N;$!ba;s/\n/, /g' input.txt > output.txt
But i get blank output.txt
Would you please try the following:
mapfile -t ary < <(tail -n +3 input.txt | head -n -2 | sed -E "s/^[[:blank:]]*/'/; s/[[:blank:]]*$/'/")
(IFS=,; echo "${ary[*]}")
tail -n +3 outputs lines after the 3rd line, inclusive.
head -n -2 outputs lines excluding the last 2 lines.
sed -E "s/^[[:blank:]]*/'/" removes leading whitespaces and prepends
a single quote.
Similarly the sed command "s/[[:blank:]]*$/'/" removes trailing
whitespaces and appends a single quote.
The syntax <(command ..) is a process substitution and the
output of the commands within the parentheses is fed to the mapfile
via the redirect.
mapfile -t ary reads lines from the standard input into the array
variable named ary.
echo "${ary[*]}" expands to a single string with the contents of
the array ary separated by the value of IFS, which is just assigned
to a comma.
The assignment of IFS and the array expansion are enclosed with
parentheses to be executed in the subshell. This prevents the IFS
to be modified in the current process.
With your shown samples, please try following awk program. Written and tested in GNU awk, should work with any version.
awk -v s1="'" -v lines="$(wc -l < Input_file)" '
BEGIN{ OFS="," }
FNR==(lines-1) {
print val
exit
}
FNR>2{
sub(/^[[:space:]]+/,"")
val=(val?val OFS:"") (s1 $0 s1)
}
' Input_file
Explanation: Adding detailed explanation for above code, this is only for explanation purposes.
awk -v s1="'" -v lines="$(wc -l < Input_file)" ' ##Starting awk program, setting s1 variable to ' and creating lines which has total number of lines in it, using wc -l command on Input_file file.
BEGIN{ OFS="," } ##Setting OFS to comma in BEGIN section of this program.
FNR==(lines-1) { ##Checking condition if its 2nd last line of Input_file.
print val ##Then printing val here.
exit ##exiting from program from here.
}
FNR>2{ ##Checking condition if FNR is greater than 2 then do following.
sub(/^[[:space:]]+/,"") ##Substituting initial spaces with NULL here.
val=(val?val OFS:"") (s1 $0 s1) ##Creating val which has ' current line ' in it and keep adding it in val.
}
' Input_file ##Mentioning Input_file name here.
If you know the input is small enough to fit in memory:
$ awk '
NR>4 { gsub(/^ *| *$/,"\047",p2); out=out sep p2; sep="," }
{ p2=p1; p1=$0 }
END { print out }
' input.txt
'Abce','Bca','Efr'
Otherwise:
$ awk '
NR>4 { gsub(/^ *| *$/,"\047",p2); printf "%s%s", sep, p2; sep="," }
{ p2=p1; p1=$0 }
END { print "" }
' input.txt
'Abce','Bca','Efr'
Either script will work using any awk in any shell on every Unix box.
This might work for you (GNU sed):
sed -E '1,2d;$!H;$!d;x;s/^\s*(.*)\s*$/'\''\1'\''/mg;s/\n[^\n]*$//;y/\n/,/' file
Delete the first two lines.
Append each line to the hold space, except for the last (this means the second from last line will still be present - see later).
Delete all lines except for the last.
Swap to the hold space.
Remove all spaces either side of the words on each line and surround those words by single quotes.
Remove the last line and its newline.
Replace all newlines by commas.
The first sed -i overwrites input.txt with an empty file. You can't write output back to the file you are reading, and sed -i does not produce any output anyway.
The minimal fix is to take out the -i and string together the commands into a pipeline; but of course, sed allows you to combine the commands into a single script.
len=$(wc -l <input.txt)
sed -e '1,2d' -e "$((len - 3))"',$d' \
-e ':a' \
-e 's/^ \(.*\) $/'"'\\1'/" \
-e N -e '$!ba' -e 's/\n/, /g' input.txt >output.txt
(Untested; if your sed does not allow multiple -e options, needs refactoring to use a single string with semicolons or newlines between the commands.)
This is hard to write and debug and brittle because of the ways you have to combine the quoting features of the shell with the requirements of sed and this particular script, but also more inherently because sed is a terse and obscure language.
A much more legible and maintainable solution is to switch to Awk, which allows you to express the logic in more human terms, and avoid having to pull in support from the shell for simple tasks like arithmetic and string formatting.
awk 'FNR > 2 { sub(/^ /, ""); sub(/ $/, "");
a[++i] = sprintf("\047%s\047,", $0); }
END { for(j=1; j < i-1; ++j) printf "%s", a[j] }' input.txt >output.txt
This literally replaces all newlines with commas; perhaps you would in fact like to print a newline instead of the comma on the last line?
awk 'FNR > 2 { sub(/^ /, ""); sub(/ $/, "");
a[++i] = sprintf("%s\047%s\047", sep, $0); sep="," }
END { for(j=1; j < i-1; ++j) printf "%s", a[j]; printf "\n" }' input.txt >output.txt
If the input file is really large, you might want to refactor this to not keep all the lines in memory. The array a collects the formatted output and we print all its elements except the last two in the END block.
sed -E '
/^-+$/,/^-+$/!d
//d
s/^[[:space:]]*|[[:space:]]*$/'\''/g
' input.txt |
paste -sd ,
This uses a trick that doesn't work on all sed implementations, to print the lines between two patterns (the dashes in this case), excluding those patterns.
On the plus side if the ---- pattern is at a different line number, it still works. Down side is it breaks, if that pattern (a line containing only dashes) occurs an odd number of times (ie. not in pairs, that wrap the lines you want).
Then sub line start and end (including white space) with single quotes.
Finally pipe to paste to sub the new lines with commas, excluding a trailing comma.
Using sed
$ sed "1,2d; /-/,$ d; s/\s\+//;s/.*/'&'/" input_file | sed -z 's/\n/,/g;s/,$/\n/'
'Abce','Bca','Efr'
I'll post a sed solution which is rather light.
sed '$d' input.txt | sed "\$d; 1,2d; s/^\s*\|\s*$/'/g" | paste -sd ',' > output.txt
$d Remove last line with first sed
\$d Remove the last line. $ escaped with backslash as we are within double-quotes.
1,2d Remove the first two lines.
s/^\s*\|\s*$/'/g Replace all leading and trailing whitespace with single quotes.
Use paste to concatenate to a single, comma delimited strings.
If we know that the relevant lines always start with two spaces, then it can even be simplified further.
sed -n "s/\s*$/'/; s/^ /'/p" input.txt | paste -sd ',' > output.txt
-n suppress printing lines unless told to
s/\s*$/'/ replace trailing whitespace with single quotes
s/^ /'/p replace two leading spaces and print lines that match
paste to concat
Then an awk solution:
awk -v i=1 -v q=\' 'FNR>2 {
gsub(/^[[:space:]]*|[[:space:]]*$/, q)
a[i++]=$0
} END {
for(i=1; i<=length(a)-3; i++)
printf "%s,", a[i]
print a[i++]
}' input.txt > output.txt
-v i=1 create an awk variable starting at one
-v q=\' create an awk variable for the single quote character
FNR>2 { ... tells it to only process line 3+
gsub(/^[[:space:]]*|[[:space:]]*$/, q) substitute leading and trailing whitespace with single quotes
a[i++]=$0 add line to array
END { ... Process the rest after reaching end of file
for(i=1; i<=length(a)-3; i++) take the length of the array but subtract three -- representing the last three lines
printf "%s,", a[i] print all but last three entries comma delimited
print a[i++] print next entry and complete the script (skipping the last two entries)
Not a one liner but works
sed "s/^ */\'/;s/\$/\',/;1,2d;N;\$!P;\$!D;\$d" | sed ' H;1h;$!d;x;s/\n//g;s/,$//'
Explanation:
s/^ */\'/;s/\$/\',/ ---> Adds single quotes and comma
N;$!P;$!D;$d ---> Deletes last two lines
H;1h;$!d;x;s/\n//g;s/,$//' ---> Loads entire file and merge all lines and remove last comma

bash remove block of text from file

Suppose I have an input file with lines of text:
line 1
line 2
line 3
line 4
line 2
now suppose I would like to check if my inputfile contains
line 2
line 3
and remove that block of text if it is found. This would give:
line 1
line 4
line 2
Note that I don't want to remove just every occurrence of line 2 or line 3; but only if they are found one after another. (In reality I want to check for a block of 5 lines, and not just any block of code between two placeholders, but let's keep the example simple).
I looked into awk but that is getting complicated very quick (I'm not yet ready with this; since I feel this is not the right approach and will explode with 5 lines...)
awk '/line 2/ {if (line0) {print line0; line0=""}; line0=$0}' input.txt
One way with GNU awk for multi-char RS and RT:
$ awk -v RS='(^|\n)line 2\nline 3\n' '{ORS=(RT ~ /^\n/ ? "\n" : "")} 1' file
line 1
line 4
line 2
With any awk:
$ cat file
line 2
line 3
line 1
line 2
line 3
line 4
line 2
line 3
$ awk '
{ rec = rec $0 RS }
END {
rec = RS rec
gsub(/\nline 2\nline 3\n/,RS,rec)
gsub(/^\n|\n$/,"",rec)
print rec
}
' file
line 1
line 4
The above assumes you want to match using regexps since that's what your posted code does. If you want to do literal string matches instead that's do-able too with some massaging:
$ cat tst.awk
{ rec = rec $0 RS }
END {
while ( beg = index(RS rec,RS block RS) ) {
out = out substr(RS rec,1,beg-1)
rec = substr(RS rec,beg+length(block)+2)
}
print substr(out rec,2)
}
$ awk -v block='line 2\nline 3' -f tst.awk file
line 1
line 4
Not awk, but this is straightforward with Perl 5, as #triplee pointed out. With the five-line input file you showed above as foo.txt:
perl -0777 -pe 's{^line 2\nline 3\n}{}gm' foo.txt
produces the desired three-line output.
Explanation:
-0777 causes perl to read the entire input as one string (see perlrun).
The /m modifier on the regex causes ^ to match at the beginning of a line (see perlre).
Edit ^ will also match at the beginning of the file, so you can detect blocks of lines even if there is not a newline before them.
The separators between the lines are literal \ns because $ matches before the \n with the /m modifier. Therefore, it's easier just to match the \n.
Thanks to this U&L SE answer by Stéphane Chazelas for the basics.
With gnu sed
sed -z 's/line 2\nline 3\n//g;s/line 2\nline 3\n$//' infile
This might work for you (GNU sed):
sed '/^line 2$/!b;N;/^line 3$/Md;P;D' file
If a line does not match the string line 2, print it and begin the next cycle. Otherwise, append the following line and if that does match the string line 3, delete both lines. Otherwise, print then delete the first line and repeat.

Awk/sed replace newlines

Intro:
I have been given a CSV file in which the field delimiter is the pipe characted (i.e., |).
This file has a pre-defined number of fields (say N). I can discover the value of N by reading the header of the CSV file, which we can assume to be correct.
Problem:
Some of the fields contain a newline character by mistake, which makes the line appear shorter than required (i.e., it has M fields, with M < N).
What I need to create is a sh script (not bash) to fix those lines.
Attempted solution:
I tried creating the following script to try fixing the file:
if [ $# -ne 1 ]
then
echo "Usage: $0 <filename>"
exit
fi
# get first line
first_line=$(head -n 1 $1)
# get number of fields
num_separators=$(echo "$first_line" | tr -d -c '|' | awk '{print length}')
cat $1 | awk -v numFields=$(( num_separators + 1 )) -F '|' '
{
totRecords = NF/numFields
# loop over lines
for (record=0; record < totRecords; record++) {
output = ""
# loop over fields
for (i=0; i<numFields; i++) {
j = (numFields*record)+i+1
# replace newline with question mark
sub("\n", "?", $j)
output = output (i > 0 ? "|" : "") $j
}
print output
}
}
'
However, the newline character is still present.
How can I fix that problem?
Example of the CSV:
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a
newline
Foo|Bar|Baz
Expected output:
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz
* I don't care about the replacement, it could be a space, a question mark, whatever except a newline or a pipe (which would create a new field)
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { reqdNF = NF; printf "%s", $0; next }
{ printf "%s%s", (NF < reqdNF ? " " : ORS), $0 }
END { print "" }
$ awk -f tst.awk file.csv
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a newline
Foo|Bar|Baz
If that's not what you want then edit your question to provide more truly representative sample input and associated output.
Based on the assumption that the last field may contain one newline. Using tac and sed:
tac file.csv | sed -n '/|/!{h;n;x;H;x;s/\n/ * /p;b};p' | tac
Output:
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz
How it works. Read the file backwards, sed is easier without forward references. If a line has no '|' separator, /|/!, run the block of code in curly braces {};, otherwise just p print the line. The block of code:
h; stores the delimiter-less line in sed's hold buffer.
n; fetches another line, since we're reading backwards, this is the line that should be appended to.
x; exchange hold buffer and pattern buffer.
H; append pattern buffer to hold buffer.
x; exchange newly appended lines to pattern buffer, now there's two lines in one buffer.
s/\n/ * /p; replace the middle linefeed with a " * ", now there's only one longer line; and print.
b start again, leave the code block.
Re-reverse the file with tac; done.

Grab nth occurrence in between two patterns using awk or sed

I have an issue where I want to parse through the output from a file and I want to grab the nth occurrence of text in between two patterns preferably using awk or sed
category
1
s
t
done
category
2
n
d
done
category
3
r
d
done
category
4
t
h
done
Let's just say for this example I want to grab the third occurrence of text in between category and done, essentially the output would be
category
3
r
d
done
This might work for you (GNU sed):
'sed -n '/category/{:a;N;/done/!ba;x;s/^/x/;/^x\{3\}$/{x;p;q};x}' file
Turn off automatic printing by using the -n option. Gather up lines between category and done. Store a counter in the hold space and when it reaches 3 print the collection in the pattern space and quit.
Or if you prefer awk:
awk '/^category/,/^done/{if(++m==1)n++;if(n==3)print;if(/^done/)m=0}' file
Try doing this :
awk -v n=3 '/^category/{l++} (l==n){print}' file.txt
Or more cryptic :
awk -v n=3 '/^category/{l++} l==n' file.txt
If your file is big :
awk -v n=3 '/^category/{l++} l>n{exit} l==n' file.txt
If your file doesn't contain any null characters, here's on way using GNU sed. This will find the third occurrence of a pattern range. However, you can easily modify this to get any occurrence you'd like.
sed -n '/^category/ { x; s/^/\x0/; /^\x0\{3\}$/ { x; :a; p; /done/q; n; ba }; x }' file.txt
Results:
category
3
r
d
done
Explanation:
Turn off default printing with the -n switch. Match the word 'category' at the start of a line. Swap the pattern space with the hold space and append a null character to the start of the pattern. In the example, if the pattern then contains two leading null characters, pull the pattern out of holdspace. Now create a loop and print the contents of the pattern space until the last pattern is matched. When this last pattern is found, sed will quit. If it's not found sed will continue to read the next line of input in and continue in its loop.
awk -v tgt=3 '
/^category$/ { fnd=1; rec="" }
fnd {
rec = rec $0 ORS
if (/^done$/) {
if (++cnt == tgt) {
printf "%s",rec
exit
}
fnd = 0
}
}
' file
With GNU awk you can set the the record separator to a regular expression:
<file awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
RT is the matched record separator. Note that the record relative to n will be off by one as the first record refers to what precedes the first RS.
Edit
As per Ed's comment, this will not work when the records have other data in between them, e.g.:
category
1
s
t
done
category
2
n
d
done
foo
category
3
r
d
done
bar
category
4
t
h
done
One way to get around this is to clean up the input with a second (or first) awk:
<file awk '/^category$/,/^done$/' |
awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
Edit 2
As Ed has noted in the comments, the above methods do not search for the ending pattern. One way to do this, which hasn't been covered by the other answers, is with getline (note that there are some caveats with awk getline):
<file awk '
/^category$/ {
v = $0
while(!/^done$/) {
if(!getline)
exit
v = v ORS $0
}
if(++nr == n)
print v
}' n=3
On one line:
<file awk '/^category$/ { v = $0; while(!/^done$/) { if(!getline) exit; v = v ORS $0 } if(++nr == n) print v }' n=3

Resources