use sed to merge lines and add comma - bash

I found several related questions, but none of them fits what I need, and since I am a real beginner, I can't figure it out.
I have a text file with entries like this, separated by a blank line:
example entry &with/ special characters
next line (any characters)
next %*entry
more words
I would like the output merge the lines, put a comma between, and delete empty lines. I.e., the example should look like this:
example entry &with/ special characters, next line (any characters)
next %*entry, more words
I would prefer sed, because I know it a little bit, but am also happy about any other solution on the linux command line.

Improved per Kent's elegant suggestion:
awk 'BEGIN{RS="";FS="\n";OFS=","}{$1=$1}7' file
which allows any number of lines per block, rather than the 2 rigid lines per block I had. Thank you, Kent. Note: The 7 is Kent's trademark... any non-zero expression will cause awk to print the entire record, and he likes 7.
You can do this with awk:
awk 'BEGIN{RS="";FS="\n";OFS=","}{print $1,$2}' file
That sets the record separator to blank lines, the field separator to newlines and the output field separator to a comma.
Output:
example entry &with/ special characters,next line (any characters)
next %*entry,more words

Simple sed command,
sed ':a;N;$!ba;s/\n/, /g;s/, , /\n/g' file
:a;N;$!ba;s/\n/, /g -> According to this answer, this code replaces all the new lines with ,(comma and space).
So After running only the first command, the output would be
example entry &with/ special characters, next line (any characters), , next %*entry, more words
s/, , /\n/g - > Replacing , , with new line in the above output will give you the desired result.
example entry &with/ special characters, next line (any characters)
next %*entry, more words

This might work for you (GNU sed):
sed ':a;$!N;/.\n./s/\n/, /;ta;/^[^\n]/P;D' file
Append the next line to the current line and if there are characters either side of the newline substitute the newline with a comma and a space and then repeat. Eventually an empty line or the end-of-file will be reached, then only print the next line if it is not empty.
Another version but a little more sofisticated (allowing for white space in the empty line) would be:
sed ':a;$!N;/^\s*$/M!s/\n/, /;ta;/\`\s*$/M!P;D' file

sed -n '1h;1!H
$ {x
s/\([^[:cntrl:]]\)\n\([^[:cntrl:]]\)/\1, \2/g
s/\(\n\)\n\{1,\}/\1/g
p
}' YourFile
change all after loading file in buffer. Could be done "on the fly" while reading the file and based on empty line or not.
use -e on GNU sed

Related

Using shell scripts to remove all commas except for the first on each line

I have a text file consisting of lines which all begin with a numerical code, followed by one or several words, a comma, and then a list of words separated by commas. I need to delete all commas in every line apart from the first comma. For example:
1.2.3 Example question, a, question, that, is, hopefully, not, too, rudimentary
which should be changed to
1.2.3 Example question, a question that is hopefully not too rudimentary
I have tried using sed and shell scripts to solve this, and I can figure out how to delete the first comma on each line (1) and how to delete all commas (2), but not how to delete only the commas after the first comma on each line
(1)
while read -r line
do
echo "${line/,/}"
done <"filename.txt" > newfile.txt
mv newfile.txt filename.txt
(2)
sed 's/,//g' filename.txt > newfile.txt
You need to capture the first comma, and then remove the others. One option is to change the first comma into some otherwise unused character (Control-A for example), then remove the remaining commas, and finally replace the replacement character with a comma:
sed -e $'s/,/\001/; s/,//g; s/\001/,/'
(using Bash ANSI C quoting — the \001 maps to Control-A).
An alternative mechanism uses sed's labels and branches, as illustrated by Wiktor Stribiżew's answer.
If using GNU sed, you can specify a number in the flags of sed's s/// command along with g to indicate which match to start replacing at:
$ sed 's/,//2g' <<<'1.2.3 Example question, a, question, that, is, hopefully, not, too, rudimentary'
1.2.3 Example question, a question that is hopefully not too rudimentary
Its manual says:
Note: the POSIX standard does not specify what should happen when you mix the g and NUMBER modifiers, and currently there is no widely agreed upon meaning across sed implementations. For GNU sed, the interaction is defined to be: ignore matches before the NUMBERth, and then match and replace all matches from the NUMBERth on.
so if you're using a different sed, your mileage may vary. (OpenBSD and NetBSD seds raise an error instead, for example).
You can use
sed ':a; s/^\([^,]*,[^,]*\),/\1/;ta' filename.txt > newfile.txt
Details
:a - sets an a label
s/^\([^,]*,[^,]*\),/\1/ - finds 0+ non-commas at the start of string, a comma and again 0+ non-commas, capturing this substring into Group 1, and then just matching a , and replacing the match with the contents of Group 1 (removes the non-first comma)
ta - upon a successful replacement, jumps back to the a label location.
See an online sed demo:
s='1.2.3 Example question, a, question, that, is, hopefully, not, too, rudimentary'
sed ':a; s/^\([^,]*,[^,]*\),/\1/;ta' <<< "$s"
# => 1.2.3 Example question, a question that is hopefully not too rudimentary
awk 'NF>1 {$1=$1","} 1' FS=, OFS= filename.txt
sed ':a;s/,//2;t a' filename.txt
sed 's/,/\
/;s/,//g;y/\n/,/' filename.txt
This might work for you (GNU sed):
sed 's/,/&\n/;h;s/,//g;H;g;s/\n.*\n//' file
Append a newline to the first comma.
Copy the current line to the hold space.
Remove all commas in the current line.
Append the current line to the hold space.
Swap the current line for the hold space.
Remove everything between the introduced newlines.

BASH: Find newlines in between text and replace with two newlines

I am looking to programmatically edit the newlines of .txt files. The desired behavior is that any single newline in between lines of text will become two newlines.
edit (clarification by #kaan): Lines separated by one newline should be separated by two newlines. Any lines that are already separated by two or more lines should be left as is
edit (context): I am working with the .fountain syntax and an npm module called afterwriting that exports text files into a script format as a pdf. lines of text separated by only one new line do not properly space when printed to pdf using the package. So i want to automatically convert single newlines into double, because i also don't want to have to add two new lines in all of the files i am converting
For instance an example of an input would look like:
File with text in it
A new line
Another new line
Line with three new lines above
One last new line
would become
File with text in it
A new line
Another new line
Line with three new lines above
One last new line
Any ideas of how this could be achieved in a bash script would be appreciated
This might work for you (GNU sed):
sed '/\S/b;N;//{P;b};:a;n;//!ba' file
This solution appends another line to the first empty line encountered. If the appended line is not empty it prints the first and bails out, thus doubling the empty line. Otherwise if the appended line is empty, it print them both and then prints any further empty lines until it encounters a non-empty line.
Here is a way to do it using sed:
read the whole file (since normal sed behavior will remove all newlines)
look for a word boundary (\b) followed by two newlines (\n\n – one for ending the current line, then one that's the single blank line), then one more word boundary (\b)
for any matches, add one extra newline in there.
With your sample text inside data.txt, it looks like this:
sed -n 'H; ${x; s/\b\n\n\b/\n\n\n/g; p}' < data.txt | tail -n +2
(Edit: added | tail -n +2 to remove the extra newline that's inserted at the beginning)

replacing specific characters in a line shell script

I have the following contents in a file
{"Hi","Hello","unix":["five","six"]}
I would like to replace comma within the square brackets only to semi colon. Rest of the comma's in the line should not be changed.
Output should be
{"Hi","Hello","unix":["five";"six"]}
I have tried using sed but it is not working. Below is the command I tried. Kindly help.
sed 's/:\[*\,*\]/;/'
Thanks
If your Input_file is same as sample shown then following may help you in same.
sed 's/\([^[]*\)\([^,]*\),\(.*\)/\1\2;\3/g' Input_file
Output will be as follows.
{"Hi","Hello","unix":["five";"six"]}
EDIT: Adding explanation also for same now, it should be only taken for explanation purposes, one should run above code only for getting the output.
sed 's/\([^[]*\)\([^,]*\),\(.*\)/\1\2;\3/g' Input_file
s ##is for substitution in sed.
\([^[]*\) ##Creating the first memory hold which will have the contents from starting to before first occurrence of [ and will be obtained by 1 later in code.
\([^,]*\) ##creating second memory hold which will have everything from [(till where it stopped yesterday) to first occurrence of ,
, ##Putting , here in the line of Input_file.
\(.*\) ##creating third memory hold which will have everything after ,(comma) to till end of current line.
/\1\2;\3/g ##Now mentioning the memory hold by their number \1\2;\3/g so point to be noted here between \2 and \3 have out ;(semi colon) as per OP's request it needed semi colon in place of comma.
Awk would also be useful here
awk -F'[][]' '{gsub(/,/,";",$2); print $1"["$2"]"$3}' file
by using gsub, you can replace all occurrences of matched symbol inside a specific field
Input File
{"Hi","Hello","unix":["five","six"]}
{"Hi","Hello","unix":["five","six","seven","eight"]}
Output
{"Hi","Hello","unix":["five";"six"]}
{"Hi","Hello","unix":["five";"six";"seven";"eight"]}
You should definitely use RavinderSingh13's answer instead of mine (it's less likely to break or exhibit unexpected behavior given very complex input) but here's a less robust answer that's a little easier to explain than his:
sed -r 's/(:\[.*),(.*\])/\1;\2/g' test
() is a capture group. You can see there are two in the search. In the replace, they are refered to as \1 and \2. This allows you to put chunks of your search back in the replace expression. -r keeps the ( and ) from needing to be escaped with a backslash. [ and ] are special and need to be escaped for literal interpretation. Oh, and you wanted .* not *. The * is a glob and is used in some places in bash and other shells, but not in regexes alone.
edit: and /g allows the replacement to happen multiple times.

sed print more than one matches in a line

I have a file, including some strings and variables, like:
${cat.mouse.dog}
bird://localhost:${xfire.port}/${plfservice.url}
bird://localhost:${xfire.port}/${spkservice.synch.url}
bird://localhost:${xfire.port}/${spkservice.asynch.request.url}
${soabp.protocol}://${hpc.reward113.host}:${hpc.reward113.port}
${configtool.store.folder}/config/hpctemplates.htb
I want to print all the strings between "{}". In some lines there are more than one such string and in this case they should remain in the same line. The output should be:
cat.mouse.dog
xfire.port plfservice.url
xfire.port spkservice.synch.url
xfire.port spkservice.asynch.request.url
soabp.protocol hpc.reward113.host hpc.reward113.port
configtool.store.folder
I tried the following:
sed -n 's/.*{//;s/}.*//p' filename
but it printed only the last occurrence of each line. How can I get all the occurrences, remaining in the same line, as in the original file?
This might work for you (GNU sed):
sed -n 's/${/\n/g;T;s/[^\n]*\n\([^}]*\)}[^\n]*/\1 /g;s/ $//p' file
Replace all ${ by newlines and if there are non then move on as there is nothing to process. If there are newlines then remove non-newline characters to the left and non-newline characters to the right of the next } globally. To finish off remove the extra space introduced in the RHS of the global substitution.
If you're not against awk, you can try the following:
awk -v RS='{|}' -v ORS=' ' '/\n/{printf "\n"} (NR+1)%2' file
The record separator RS is set to either { or }. This splits the wanted pattern from the rest.
The script then displays 1 record out of 2 with the statement (NR+1)%2.
In order to keep the alignment as expected, the output record separator is set to a space ORS=' ' and everytime a newline is encountered this statement /\n/{printf "\n"} inserts one.

replace any line that starts with the symbol # using sed, awk, cut

this is simple but I was hoping for a quick command (using sed, cut, awk or something in BASH preferably) to do this:
replace any line that starts with the symbol #:
#<text, on one line, including numbers, letters and colons>
with
#<text, on one line, including numbers, letters and colons>/1
The # is always consistent, the <text, on one line, including numbers, letters and colons> changes. (It's Fastq format for the bioinformaticians out there).
Example:
#HWI-D00193:58:H73UEADXX:1:1101:1516:2209 1:N:0:ATCACG
change to
#HWI-D00193:58:H73UEADXX:1:1101:1516:2209 1:N:0:ATCACG/1
I know this is simple sorry.
With sed, you can do as below:
sed "/^#/ s/$/\/1/g" file
This matches lines that start with # and then appends (substitution at the end to be precise) the /1 on all the matching lines.
Using awk
awk '/^#/ {$0=$0"/1"}1' file

Resources