Awk/sed replace newlines - shell

Intro:
I have been given a CSV file in which the field delimiter is the pipe characted (i.e., |).
This file has a pre-defined number of fields (say N). I can discover the value of N by reading the header of the CSV file, which we can assume to be correct.
Problem:
Some of the fields contain a newline character by mistake, which makes the line appear shorter than required (i.e., it has M fields, with M < N).
What I need to create is a sh script (not bash) to fix those lines.
Attempted solution:
I tried creating the following script to try fixing the file:
if [ $# -ne 1 ]
then
echo "Usage: $0 <filename>"
exit
fi
# get first line
first_line=$(head -n 1 $1)
# get number of fields
num_separators=$(echo "$first_line" | tr -d -c '|' | awk '{print length}')
cat $1 | awk -v numFields=$(( num_separators + 1 )) -F '|' '
{
totRecords = NF/numFields
# loop over lines
for (record=0; record < totRecords; record++) {
output = ""
# loop over fields
for (i=0; i<numFields; i++) {
j = (numFields*record)+i+1
# replace newline with question mark
sub("\n", "?", $j)
output = output (i > 0 ? "|" : "") $j
}
print output
}
}
'
However, the newline character is still present.
How can I fix that problem?
Example of the CSV:
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a
newline
Foo|Bar|Baz
Expected output:
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz
* I don't care about the replacement, it could be a space, a question mark, whatever except a newline or a pipe (which would create a new field)

$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { reqdNF = NF; printf "%s", $0; next }
{ printf "%s%s", (NF < reqdNF ? " " : ORS), $0 }
END { print "" }
$ awk -f tst.awk file.csv
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a newline
Foo|Bar|Baz
If that's not what you want then edit your question to provide more truly representative sample input and associated output.

Based on the assumption that the last field may contain one newline. Using tac and sed:
tac file.csv | sed -n '/|/!{h;n;x;H;x;s/\n/ * /p;b};p' | tac
Output:
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz
How it works. Read the file backwards, sed is easier without forward references. If a line has no '|' separator, /|/!, run the block of code in curly braces {};, otherwise just p print the line. The block of code:
h; stores the delimiter-less line in sed's hold buffer.
n; fetches another line, since we're reading backwards, this is the line that should be appended to.
x; exchange hold buffer and pattern buffer.
H; append pattern buffer to hold buffer.
x; exchange newly appended lines to pattern buffer, now there's two lines in one buffer.
s/\n/ * /p; replace the middle linefeed with a " * ", now there's only one longer line; and print.
b start again, leave the code block.
Re-reverse the file with tac; done.

Related

Remove first two lines, last two lines and space from file and add quotes on each line and replace newline with commas in shell script

I have to input.txt file which needs to be formatted by shell script with following condition
remove first two lines and
last two lines
remove all spaces in each
lines(each line have two spaces at
beginning and one space at end)
Each line should be within single
quotes(' ')
At last replace newline($) with
commas.
(original)
input.txt
sql
--------
Abce
Bca
Efr
-------
Row (3)
Desired output file
output.txt
'Abce','Bca','Efr'
I have tried using following commands
Sed -i 1,2d input.txt > input.txt
Sed "$(( $(wc -l <input.txt) -2+1)), $ d" Input.txt > input.txt
Sed ':a;N;$!ba;s/\n/, /g' input.txt > output.txt
But i get blank output.txt
Would you please try the following:
mapfile -t ary < <(tail -n +3 input.txt | head -n -2 | sed -E "s/^[[:blank:]]*/'/; s/[[:blank:]]*$/'/")
(IFS=,; echo "${ary[*]}")
tail -n +3 outputs lines after the 3rd line, inclusive.
head -n -2 outputs lines excluding the last 2 lines.
sed -E "s/^[[:blank:]]*/'/" removes leading whitespaces and prepends
a single quote.
Similarly the sed command "s/[[:blank:]]*$/'/" removes trailing
whitespaces and appends a single quote.
The syntax <(command ..) is a process substitution and the
output of the commands within the parentheses is fed to the mapfile
via the redirect.
mapfile -t ary reads lines from the standard input into the array
variable named ary.
echo "${ary[*]}" expands to a single string with the contents of
the array ary separated by the value of IFS, which is just assigned
to a comma.
The assignment of IFS and the array expansion are enclosed with
parentheses to be executed in the subshell. This prevents the IFS
to be modified in the current process.
With your shown samples, please try following awk program. Written and tested in GNU awk, should work with any version.
awk -v s1="'" -v lines="$(wc -l < Input_file)" '
BEGIN{ OFS="," }
FNR==(lines-1) {
print val
exit
}
FNR>2{
sub(/^[[:space:]]+/,"")
val=(val?val OFS:"") (s1 $0 s1)
}
' Input_file
Explanation: Adding detailed explanation for above code, this is only for explanation purposes.
awk -v s1="'" -v lines="$(wc -l < Input_file)" ' ##Starting awk program, setting s1 variable to ' and creating lines which has total number of lines in it, using wc -l command on Input_file file.
BEGIN{ OFS="," } ##Setting OFS to comma in BEGIN section of this program.
FNR==(lines-1) { ##Checking condition if its 2nd last line of Input_file.
print val ##Then printing val here.
exit ##exiting from program from here.
}
FNR>2{ ##Checking condition if FNR is greater than 2 then do following.
sub(/^[[:space:]]+/,"") ##Substituting initial spaces with NULL here.
val=(val?val OFS:"") (s1 $0 s1) ##Creating val which has ' current line ' in it and keep adding it in val.
}
' Input_file ##Mentioning Input_file name here.
If you know the input is small enough to fit in memory:
$ awk '
NR>4 { gsub(/^ *| *$/,"\047",p2); out=out sep p2; sep="," }
{ p2=p1; p1=$0 }
END { print out }
' input.txt
'Abce','Bca','Efr'
Otherwise:
$ awk '
NR>4 { gsub(/^ *| *$/,"\047",p2); printf "%s%s", sep, p2; sep="," }
{ p2=p1; p1=$0 }
END { print "" }
' input.txt
'Abce','Bca','Efr'
Either script will work using any awk in any shell on every Unix box.
This might work for you (GNU sed):
sed -E '1,2d;$!H;$!d;x;s/^\s*(.*)\s*$/'\''\1'\''/mg;s/\n[^\n]*$//;y/\n/,/' file
Delete the first two lines.
Append each line to the hold space, except for the last (this means the second from last line will still be present - see later).
Delete all lines except for the last.
Swap to the hold space.
Remove all spaces either side of the words on each line and surround those words by single quotes.
Remove the last line and its newline.
Replace all newlines by commas.
The first sed -i overwrites input.txt with an empty file. You can't write output back to the file you are reading, and sed -i does not produce any output anyway.
The minimal fix is to take out the -i and string together the commands into a pipeline; but of course, sed allows you to combine the commands into a single script.
len=$(wc -l <input.txt)
sed -e '1,2d' -e "$((len - 3))"',$d' \
-e ':a' \
-e 's/^ \(.*\) $/'"'\\1'/" \
-e N -e '$!ba' -e 's/\n/, /g' input.txt >output.txt
(Untested; if your sed does not allow multiple -e options, needs refactoring to use a single string with semicolons or newlines between the commands.)
This is hard to write and debug and brittle because of the ways you have to combine the quoting features of the shell with the requirements of sed and this particular script, but also more inherently because sed is a terse and obscure language.
A much more legible and maintainable solution is to switch to Awk, which allows you to express the logic in more human terms, and avoid having to pull in support from the shell for simple tasks like arithmetic and string formatting.
awk 'FNR > 2 { sub(/^ /, ""); sub(/ $/, "");
a[++i] = sprintf("\047%s\047,", $0); }
END { for(j=1; j < i-1; ++j) printf "%s", a[j] }' input.txt >output.txt
This literally replaces all newlines with commas; perhaps you would in fact like to print a newline instead of the comma on the last line?
awk 'FNR > 2 { sub(/^ /, ""); sub(/ $/, "");
a[++i] = sprintf("%s\047%s\047", sep, $0); sep="," }
END { for(j=1; j < i-1; ++j) printf "%s", a[j]; printf "\n" }' input.txt >output.txt
If the input file is really large, you might want to refactor this to not keep all the lines in memory. The array a collects the formatted output and we print all its elements except the last two in the END block.
sed -E '
/^-+$/,/^-+$/!d
//d
s/^[[:space:]]*|[[:space:]]*$/'\''/g
' input.txt |
paste -sd ,
This uses a trick that doesn't work on all sed implementations, to print the lines between two patterns (the dashes in this case), excluding those patterns.
On the plus side if the ---- pattern is at a different line number, it still works. Down side is it breaks, if that pattern (a line containing only dashes) occurs an odd number of times (ie. not in pairs, that wrap the lines you want).
Then sub line start and end (including white space) with single quotes.
Finally pipe to paste to sub the new lines with commas, excluding a trailing comma.
Using sed
$ sed "1,2d; /-/,$ d; s/\s\+//;s/.*/'&'/" input_file | sed -z 's/\n/,/g;s/,$/\n/'
'Abce','Bca','Efr'
I'll post a sed solution which is rather light.
sed '$d' input.txt | sed "\$d; 1,2d; s/^\s*\|\s*$/'/g" | paste -sd ',' > output.txt
$d Remove last line with first sed
\$d Remove the last line. $ escaped with backslash as we are within double-quotes.
1,2d Remove the first two lines.
s/^\s*\|\s*$/'/g Replace all leading and trailing whitespace with single quotes.
Use paste to concatenate to a single, comma delimited strings.
If we know that the relevant lines always start with two spaces, then it can even be simplified further.
sed -n "s/\s*$/'/; s/^ /'/p" input.txt | paste -sd ',' > output.txt
-n suppress printing lines unless told to
s/\s*$/'/ replace trailing whitespace with single quotes
s/^ /'/p replace two leading spaces and print lines that match
paste to concat
Then an awk solution:
awk -v i=1 -v q=\' 'FNR>2 {
gsub(/^[[:space:]]*|[[:space:]]*$/, q)
a[i++]=$0
} END {
for(i=1; i<=length(a)-3; i++)
printf "%s,", a[i]
print a[i++]
}' input.txt > output.txt
-v i=1 create an awk variable starting at one
-v q=\' create an awk variable for the single quote character
FNR>2 { ... tells it to only process line 3+
gsub(/^[[:space:]]*|[[:space:]]*$/, q) substitute leading and trailing whitespace with single quotes
a[i++]=$0 add line to array
END { ... Process the rest after reaching end of file
for(i=1; i<=length(a)-3; i++) take the length of the array but subtract three -- representing the last three lines
printf "%s,", a[i] print all but last three entries comma delimited
print a[i++] print next entry and complete the script (skipping the last two entries)
Not a one liner but works
sed "s/^ */\'/;s/\$/\',/;1,2d;N;\$!P;\$!D;\$d" | sed ' H;1h;$!d;x;s/\n//g;s/,$//'
Explanation:
s/^ */\'/;s/\$/\',/ ---> Adds single quotes and comma
N;$!P;$!D;$d ---> Deletes last two lines
H;1h;$!d;x;s/\n//g;s/,$//' ---> Loads entire file and merge all lines and remove last comma

Selectively reformatting a file with spaces and \n

I have multiple files in the following format. This one has 3 sequences (number of sequences vary in all files, but always end in ".") with 40 positions each, as indicated by the numbers in the first line. From the beginning of the lines (except the first one) there are the names of the sequences:
3 40
00076284. ATGTCTGTGG TTCTTTAACC
00892634. TTGTCTGAGG TTCGTAAACC
00055673. TTGTCTGAGG TCCGTGAACC
GCCGGGAACA TCCGCAAAAA
ACCGTGAAAC GGGGTGAACT
TCCCCCGAAC TCCCTGAACG
I need to convert it to this format, where the sequences are continuous, with no spaces nor \n, and on a new line after their names.The only spaces that should remain are between the two numbers in the first line.
3 40
00076284.
ATGTCTGTGGTTCTTTAACCGCCGGGAACATCCGCAAAAA
00892634.
TTGTCTGAGGTTCGTAAACCACCGTGAAACGGGGTGAACT
00055673.
TTGTCTGAGGTCCGTGAACCTCCCCCGAACTCCCTGAACG
Tried sed to delete spaces and \n's but don't know how to apply it after the first line and how to avoid making one huge line.
Thanks
Here's a shell script that may provide what you need:
head -1 input
awk '
NR == 1 { sequences = $1 ; positions = $2 ; next }
{
if ( $1 ~ /^[0-9]/ ) {
sid = $1 ; $1 = "" ; sequence_name[ NR - 1 ] = sid
sequence[ NR - 1 ] = $0
} else {
sequence[ ( NR - 1 ) % ( sequences + 1 ) ] = sequence[ (NR-1) % ( sequences + 1 ) ] " " $0
}
}
END {
for ( x = 1 ; x <= length( sequence_name ) ; x++ )
{
print sequence_name[x]
print sequence[x]
}
}' input | tr -d ' '
I added head -1 to the top of the shell just to get the first line out of your file. I couldn't output the first line within the awk script because of the pipe to tr -d ' '.
I think this should work, but my output is longer since if I actually concat all the last "orphan" sequences I get a way longer line.
cat input.txt | awk '/^[0-9]+ [0-9]+$/{printf("%s\n",$0); next} /[0-9]+[.]/{ printf("\n%s\n",$1);for(i=2; i<=NF;i++){printf("%s",$i)}; next} /^ */{ for(i=1; i<=NF;i++){printf("%s",$i)}; next;}'
3 40
Please try and let me know.
Remember the position of empty line and merge the lines before empty line with those after:
awk '
NR==1{print;next}
NR!=1 && !empty{arr[NR]=$1 "\n" $2 $3}
/^$/{empty=NR-1;next}
NR!=1 && empty{printf "%s%s%s\n", arr[NR-empty], $1, $2}
' file
My second solution without awk: Merge the file with itself using empty line as separator
cat >file <<EOF
3 40
00076284. ATGTCTGTGG TTCTTTAACC
00892634. TTGTCTGAGG TTCGTAAACC
00055673. TTGTCTGAGG TCCGTGAACC
GCCGGGAACA TCCGCAAAAA
ACCGTGAAAC GGGGTGAACT
TCCCCCGAAC TCCCTGAACG
EOF
head -n1 file
paste <(sed -n '1!{ /^$/q;p; }' file) <(sed -n '1!{ /^$/,//{/^$/!p}; }' file) |
sed 's/[[:space:]]//g; s/\./.\n/'
Will output:
3 40
00076284.
ATGTCTGTGGTTCTTTAACCGCCGGGAACATCCGCAAAAA
00892634.
TTGTCTGAGGTTCGTAAACCACCGTGAAACGGGGTGAACT
00055673.
TTGTCTGAGGTCCGTGAACCTCCCCCGAACTCCCTGAACG
:
head -n1 file output first line
sed -n '1!{ /^$/q;p; }' file
1! - don't output first line
/^$/q - quit when empty line
p print everything else
sed -n '1!{ /^$/,//{/^$/!p}; }' file
1! - ignore first line
/^$/,// - from empty line until the end
/^$/!p - output if not an empty tline
paste <(..) <(...) - merge the two seds with a tab
sed 's/[[:space:]]//g; s/\./.\n/
s/[[:space:]]//g; remove all spaces
s/\./.\n/ replace a comma with a comma and a newline.

printing contents of variable to a specified line in outputfile with sed/awk

I have been working on a script to concatenate multiple csv files into a single, large csv. The csv's contain names of folders and their respective sizes, in a 2-column setup with the format "Size, Projectname"
Example of a single csv file:
49747851728,ODIN
32872934580,_WORK
9721820722,LIBRARY
4855839655,BASELIGHT
1035732096,ARCHIVE
907756578,USERS
123685100,ENV
3682821,SHOTGUN
1879186,SALT
361558,SOFTWARE
486,VFX
128,DNA
For my current test I have 25 similar files, with different numbers in the first column.
I am trying to get this script to do the following:
Read each csv file
For each Project it sees, scan the outputfile if that Project was already printed to the file. If not, print the Projectname
For each file, for each Project, if the Project was found, print the Size to the output csv.
However, I need the Projects to all be on textline 1, comma separated, so I can use this outputfile as input for a javascript graph. The Sizes should be added in the column below their projectname.
My current script:
csv_folder=$(echo "$1" | sed 's/^[ \t]*//;s/\/[ \t]*$//')
csv_allfiles="$csv_folder/*.csv"
csv_outputfile=$csv_folder.csv
echo -n "" > $csv_outputfile
for csv_inputfile in $csv_allfiles; do
while read line && [[ $line != "" ]]; do
projectname=$(echo $line | sed 's/^\([^,]*\),//')
projectfound1=$(cat $csv_outputfile | grep -w $projectname)
if [[ ! $projectfound1 ]]; then
textline=1
sed "${textline}s/$/${projectname}, /" >> $csv_outputfile
for csv_foundfile in $csv_allfiles; do
textline=$(echo $textline + 1 | bc )
projectfound2=$(cat $csv_foundfile | grep -w $projectname)
projectdata=$(echo $projectfound2 | sed 's/\,.*$//')
if [[ $projectfound2 ]]; then
sed "${textline}s/$/$projectdata, /" >> $csv_outputfile
fi
done
fi
done < $csv_inputfile
done
My current script finds the right information (projectname, projectdata) and if I just 'echo' those variables, it prints the correct data to a file. However, with echo it only prints in a long list per project. I want it to 'jump back' to line 1 and print the new project at the end of the current line, then run the loop to print data at the end of each next line.
I was thinking this should be possible with sed or awk. sed should have a way of inserting text to a specific line with
sed '{n}s/search/replace/'
where {n} is the line to insert to
awk should be able to do the same thing with something like
awk -v l2="$textline" -v d="$projectdata" 'NR == l2 {print d} {print}' >> $csv_outputfile
However, while replacing the sed commands in the script with
echo $projectname
echo $projectdata
spit out the correct information (so I know my variables are filled correctly) the sed and awk commands tend to spit out the entire contents of their current inputcsv; not just the line that I want them to.
Pastebin outputs per variant of writing to file
https://pastebin.com/XwxiAqvT - sed output
https://pastebin.com/xfLU6wri - echo, plain output (single column)
https://pastebin.com/wP3BhgY8 - echo, detailed output per variable
https://pastebin.com/5wiuq53n - desired output
As you see, the sed output tends to paste the whole contents of inputcsv, making the loop stop after one iteration. (since it finds the other Projects after one loop)
So my question is one of these;
How do I make sed / awk behave the way I want it to; i.e. print only the info in my var to the current textline, instead of the whole input csv. Is sed capable of this, printing just one line of variable? Or
Should I output the variables through 'echo' into a temp file, then loop over the temp file to make sed sort the lines the way I want them to? (Bear in mind that more .csv files will be added in the future, I can't just make it loop x times to sort the info)
Is there a way to echo/print text to a specific text line without using sed or awk? Is there a printf option I'm missing? Other thoughts?
Any help would be very much appreciated.
A way to accomplish this transposition is to save the data to an associative array.
In the following example, we use a two dimensional array to keep track of our data. Because ordering seems to be important, we create a col array and create a new increment whenever we see a new projectname -- this col array ends up being our first index into our data. We also create a row array which we increment whenever we see a new data for the current column. The row number is our second index into data. At the end, we print out all the records.
#! /usr/bin/awk -f
BEGIN {
FS = ","
OFS = ", "
rows=0
cols=0
head=""
split("", data)
split("", row)
split("", col)
}
!($2 in col) { # new project
if (head == "")
head = $2
else
head = head OFS $2
i = col[$2] = cols++
row[i] = 0
}
{
i = col[$2]
j = row[i]++
data[i,j] = $1
if (j > rows)
rows = j
}
END {
print head
for (j=0; j<=rows; ++j) {
if ((0,j) in data)
x = data[0,j]
else
x = ""
for (i=1; i<cols; ++i) {
if ((i,j) in data)
x = x OFS data[i,j]
else
x = x OFS
}
print x
}
}
As a bonus, here is a script to reproduce the detailed output from one of your pastebins.
#! /usr/bin/awk -f
BEGIN {
FS = ","
split("", data) # accumulated data for a project
split("", line) # keep track of textline for data
split("", idx) # index into above to maintain input order
sz = 0
}
$2 in idx { # have seen this projectname
i = idx[$2]
x = ORS "textline = " ++line[i]
x = x ORS "textdata = " $1
data[i] = data[i] x
next
}
{ # new projectname
i = sz++
idx[$2] = i
x = "textline = 1"
x = x ORS "projectname = " $2
x = x ORS "textline = 2"
x = x ORS "projectdata = " $1
data[i] = x
line[i] = 2
}
END {
for (i=0; i<sz; ++i)
print data[i]
}
Fill parray with project names and array with values, then print them with bash printf, You can choose column width in printf command (currently 13 characters - %13s)
#!/bin/bash
declare -i index=0
declare -i pindex=0
while read project; do
parray[$pindex]=$project
index=0
while read;do
array[$pindex,$index]="$REPLY"
index+=1
done <<< $(grep -h "$project" *.csv|cut -d, -f1)
pindex+=1
done <<< $(cat *.csv|cut -d, -f 2|sort -u)
maxi=$index
maxp=$pindex
for (( pindex=0; $pindex < $maxp ; pindex+=1 ));do
STR="%13s $STR"
VAL="$VAL ${parray[$pindex]}"
done
printf "$STR\n" $VAL
for (( index=0; $index < $maxi;index+=1 ));do
STR=""; VAL=""
for (( pindex=0; $pindex < $maxp;pindex+=1 )); do
STR="%13s $STR"
VAL="$VAL ${array[$pindex,$index]}"
done
printf "$STR\n" $VAL
done
If you are OK with the output being sorted by name this one-liner might be of use:
awk 'BEGIN {FS=",";OFS=","} {print $2,$1}' * | sort | uniq
The files have to be in the same directory. If not a list of files replaces the *. First it exchanges the two fields. Awk will take a list of files and do the concatenation. Then sort the lines and print just the unique lines. This depends on the project size always being the same.
The simple one-liner above gives you one line for each project. If you really want to do it all in awk and use awk write the two lines, then the following would be needed. There is a second awk at the end that accumulates each column entry in an array then spits it out at the end:
awk 'BEGIN {FS=","} {print $2,$1}' *| sort |uniq | awk 'BEGIN {n=0}
{p[n]=$1;s[n++]=$2}
END {for (i=0;i<n;i++) printf "%s,",p[i];print "";
for (i=0;i<n;i++) printf "%s,",s[i];print ""}'
If you have the rs utility then this can be simplified to
awk 'BEGIN {FS=","} {print $2,$1}' *| sort |uniq | rs -C',' -T

how to find continuous blank lines and convert them to one

I have a file -- a, and exist some continues blank line(more than one), see below:
cat a
1
2
3
4
5
So first I want to know if exist continues blank lines, I tried
cat a | grep '\n\n\n'
nothing output. So I have to use below manner
vi a
:set list
/\n\n\n
So I want to know if exist other shell command could easily implement this?
then if exist two and more blank lines I want to convert them to one? see below
1
2
3
4
5
at first I tried below shell
sed 's/\n\n\(\n\)*/\n\n/g' a
it does not work, then I tried this shell
cat a | tr '\n' '$' | sed 's/$$\(\$\)*/$$/g' | tr '$' '\n'
this time it works. And also I want to know if exist other manner could implement this?
Well, if your cat implementation supports
-s, --squeeze-blank
suppress repeated empty output lines
then it is as simple as
$ cat -s a
1
2
3
4
5
Also, both -s and -n for numbering lines is likely to be available with less command as well.
remark: lines containing only blanks will not be suppressed.
If your cat does not support -s then you could use:
awk 'NF||p; {p=NF}'
or if you want to guarantee a blank line after every record, including at the end of the output even if none was present in the input, then:
awk -v RS= -v ORS='\n\n' '1'
If your input contains lines of all white space and you want them to be treated just like lines of non white space (like cat -s does, see the comments below) then:
awk '/./||p; {p=/./}'
and to guarantee a blank line at the end of the output:
awk '/./||p; {p=/./} END{if (p) print ""}'
This awk command should work to produce an output with 2 line breaks at each line:
awk -v RS= '{printf "%s%s", $0, ORS (RT ~ /\n{2,}/ ? ORS : "")}' file
1
2
3
4
5
This awk is using:
-v RS=: sets empty input record separator so that each empty line becomes record separator
printf "%s%s", $0, ORS: prints each line with single line break
(RT ~ /\n{2,}/ ? ORS : ""): prints additional line break if input record separator has more than 2 line breaks
You may use perl as well in slurp mode:
perl -0777 -pe 's/\R{2,}/\n\n/g' file
1
2
3
4
5
Command breakup:
-0777 Slurp mode to read entire file
's/\R{2,}/\n\n/g' Match 2 or more line breaks and replace by 2 line breaks
You can --squeeze-repeats with tr and then use sed to insert just a new line:
<a tr -s '\n' | sed 'G'
remark: This is a copy from my answer here
A very quick way is using awk
awk 'BEGIN{RS="";ORS="\n\n"}1'
How does this work:
awk knowns the concept records (which is by default lines) and you can define a record by its record separator RS. If you set the value of RS to an empty string, it will match any multitude of empty lines as a record separator. The value ORS is the output record separator. It states which separator should be printed between two consecutive records. This is set to two <newline> characters. Finally, the statement 1 is a shorthand for {print $0} which prints the current record followed by the output record-separator ORS.
note: This will, just as cat -s keep lines with only blanks as actual lines and will not suppress them.
Another awk solution:
awk 'NF' ORS="\n\n" a
1
2
3
4
5
It checks if the line is not empty by testing if NF (number of fields) is not zero. It it matches, print the line as default action. ORS (output record separator) is set to 2 newline characters, so there is an empty line between non-empty lines.
1) awk solution
$ echo "a\n\n\nb\n\n\nc\n\n\n" | awk 'BEGIN{b=0} /^$/{b=1;next} {printf "%s%s\n", b==1?"\n":"",$0} {b=0} END{printf "%s",b==1?"\n":""}'
a
b
c
$
2) sed solution
sed '
/^$/{ ${ p; d; }; H; d; }
/^$/!{ x; s/^\(\n\{1,\}\)$/\1/; ts; Tf; }
:s { x; s/\(.*\)/\n\1/; x; s/.*//; x; p; d; }
:f { x; p; d; }
'
SED Explanation:
/^$/{ ${ p; d; }; H; d; }
--If input is blank, if it is the last line, just print, else append to the holdspace and delete the pattern space and start new cycle
/^$/!{ x; s/^\(\n\{1,\}\)$/\1/; ts; Tf; }
--If input is not blank, exchange content of the p space and h space and check if h space contains \n. if yes, jump to s, if not jump to f
:s { x; s/\(.*\)/\n\1/; x; s/.*//; x; p; d; }
--If blank lines are present in h space, then append \n to p space, then clear hold space , then print p space and delete p space
:f { x; p; d; }
--If blank lines are absent in h space, then print p space and delete p space

Shell command for inserting a newline every nth element of a huge line of comma separated strings

I have a one line csv containing a lot of elements. Now I want to insert a newline after every n-th element in a bash/shell script.
Bonus: I'd like to prepend a line with descriptors and using the count of descriptors as 'n'.
Example:
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221","94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713", (...)
into
"id","lon","lat"
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713"
(...)
Edit: I made a first attempt, but the comma delimiters are missing then:
(...) | xargs --delimiter=',' -n3
"4908041eee3d4bf98e606140b21ebc89.16" "7.38974601030349731" "45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16" "7.38845318555831909" "45.31425320325949713"
trying to replace the " " with ","
(...) | xargs --delimiter=',' -n3 -i echo ${{}//" "/","}
-bash: ${{}//\": bad substitution
I would go with Perl for that!
Let's assume this outputs something like your file:
printf "1,2,3,4,5,6,7,8,9,10"
1,2,3,4,5,6,7,8,9,10
Then you could use this if you wanted every 4th comma replaced:
printf "1,2,3,4,5,6,7,8,9,10" | perl -pe 's{,}{++$n % 4 ? $& : "\n"}ge'
1,2,3,4
5,6,7,8
9,10
cat data.txt | xargs -n 3 -d, | sed 's/ /,/g'
With n=3 here and input filename is called data.txt
Note: What distinguishes this solution is that it derives the number of output columns from the number of columns in the header line.
Assuming that the fields in your CSV input have no embedded , instances (in which case you'd need a proper CSV parser), try awk:
awk -v RS=, -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
Note that if the input file ends with a newline (as is typical), you'll get an extra newline trailing the output.
With GNU Awk or Mawk (but not BSD/OSX Awk, which only supports literal, single-character RS values), you can fix this as follows:
awk -v RS='[,\n]' -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
BSD/OSX Awk workaround: stick with -v RS=, and replace file.csv with <(tr -d '\n' < file.csv) in order to remove all newlines from the input first.
Assuming your input file is named input:
echo id,lon,lat; awk '{ORS=NR%3?",":"\n"}1' RS=, input

Resources