Converting traditional line breaks to Markdown double-space newlines - macos

I've just learned how to do real line breaks in Markdown, with two spaces at the end of the line. I have a lot of files that I want to convert to this way of doing things because getting used to it is going to make my life a lot easier when using Markdown tools such as Pandoc.
These files currently look like this:
Roses are red
Violets are blue
Bananas are yellow
Oranges are orange
I'd like to transform paragraphs with more than one line so that the result would look like this:
Roses are red<space><space>
Violets are blue
Bananas are yellow
Oranges are orange
Sadly my linux fu is not up to the task. I have \n end of lines. Here's how I would start it:
for i in \*; do sed -e 's/\n/ /g' "$i"; done
I have absolutely no idea on how to differentiate line breaks followed by empty lines which shouldn't be modified (line 2), from line breaks followed by text which should be modified by sed (line 1). Also, empty lines (line 3) should be ignored. Could someone please help me?

To do this reliably, you need a markdown parser. (I believe the awk-based solutions will insert spaces at the end of lines in code blocks, too, which you don't want.) Using pandoc 1.11.1 or later, you can do this:
pandoc -fmarkdown_strict+hard_line_breaks -t markdown_strict
Note that if you plan to use pandoc as your markdown processor, you can simply leave your files as they are, and use either markdown+hard_line_breaks or markdown_strict+hard_line_breaks as your input format.

$ awk '
{
if (NF) {
head = tail
tail = "<space><space>"
}
else {
head = ""
tail = ""
}
printf "%s%s%s", head, (NR>1?ORS:""), $0
}
END { print "" }
' file
Roses are red<space><space>
Violets are blue
Bananas are yellow
Oranges are orange
Just change tail = "<space><space>" to tail = " ".

change empty lines
do you mean this? I used xx to make it easier to see in output:
kent$ awk '{$0=$0"xx"}7' f
Roses are redxx
Violets are bluexx
xx
Bananas are yellowxx
xx
Oranges are orangexx
so, each "new line" will be replaced with two 'x' with newline. if this is what you are looking for, you could do:
awk '{$0=$0" "}7' file
without changing empty lines
if you want to ignore empty lines (for empty line don't do any substitution):
check this out:
kent$ awk '$0{$0=$0"xx"}7' f
Roses are redxx
Violets are bluexx
Bananas are yellowxx
Oranges are orangexx
so you see above the double x didn't show on empty lines. you could use the command:
awk '$0{$0=$0" "}7' file
EDIT
kent$ awk 'NR==1{p=$0;next}{p=p&&$0?p"xx":p; print p;p=$0}END{print $0}' f
Roses are redxx
Violets are blue
Bananas are yellow
Oranges are orange
check the above one-liner, all empty lines and previous line of empty lines are ignored. the last line of the file is ignored too.

This might work for you (GNU sed):
sed '$!N;/^\s*\n\|\n\s*$/!s/\n/<space><space>&/;P;D file
This keeps 2 lines in the pattern space. If the first or second lines are empty i.e the start or end of a paragraph, prints out the first line unchanged. If however they are not, then it prefixes the newline by the desired string.

Related

How to replace a whole line (between 2 words) using sed?

Suppose I have text as:
This is a sample text.
I have 2 sentences.
text is present there.
I need to replace whole text between two 'text' words. The required solution should be
This is a sample text.
I have new sentences.
text is present there.
I tried using the below command but its not working:
sed -i 's/text.*?text/text\
\nI have new sentence/g' file.txt
With your shown samples please try following. sed doesn't support lazy matching in regex. With awk's RS you could do the substitution with your shown samples only. You need to create variable val which has new value in it. Then in awk performing simple substitution operation will so the rest to get your expected output.
awk -v val="your_new_line_Value" -v RS="" '
{
sub(/text\.\n*[^\n]*\n*text/,"text.\n"val"\ntext")
}
1
' Input_file
Above code will print output on terminal, once you are Happy with results of above and want to save output into Input_file itself then try following code.
awk -v val="your_new_line_Value" -v RS="" '
{
sub(/text\.\n*[^\n]*\n*text/,"text.\n"val"\ntext")
}
1
' Input_file > temp && mv temp Input_file
You have already solved your problem using awk, but in case anyone else will be looking for a sed solution in the future, here's a sed script that does what you needed. Granted, the script is using some advanced sed features, but that's the fun part of it :)
replace.sed
#!/usr/bin/env sed -nEf
# This pattern determines the start marker for the range of lines where we
# want to perform the substitution. In our case the pattern is any line that
# ends with "text." — the `$` symbol meaning end-of-line.
/text\.$/ {
# [p]rint the start-marker line.
p
# Next, we'll read lines (using `n`) in a loop, so mark this point in
# the script as the beginning of the loop using a label called `loop`.
:loop
# Read the next line.
n
# If the last read line doesn't match the pattern for the end marker,
# just continue looping by [b]ranching to the `:loop` label.
/^text/! {
b loop
}
# If the last read line matches the end marker pattern, then just insert
# the text we want and print the last read line. The net effect is that
# all the previous read lines will be replaced by the inserted text.
/^text/ {
# Insert the replacement text
i\
I have a new sentence.
# [print] the end-marker line
p
}
# Exit the script, so that we don't hit the [p]rint command below.
b
}
# Print all other lines.
p
Usage
$ cat lines.txt
foo
This is a sample text.
I have many sentences.
I have many sentences.
I have many sentences.
I have many sentences.
text is present there.
bar
$
$ ./replace.sed lines.txt
foo
This is a sample text.
I have a new sentence.
text is present there.
bar
Substitue
sed -i 's/I have 2 sentences./I have new sentences./g'
sed -i 's/[A-Z]\s[a-z].*/I have new sentences./g'
Insert
sed -i -e '2iI have new sentences.' -e '2d'
I need to replace whole text between two 'text' words.
If I understand, first text. (with a dot) is at the end of first line and second text at the beginning of third line. With awk you can get the required solution adding values to var s:
awk -v s='\nI have new sentences.\n' '/text.?$/ {s=$0 s;next} /^text/ {s=s $0;print s;s=""}' file
This is a sample text.
I have new sentences.
text is present there.

Sed replace regex with regex

Given the following file.txt:
this is line 1
# this is line 2
this is line 3
I would like to use sed to replace the lines with # at the beginning with \e[31m${lineContent}\e[0m. This will color that particular line. Additionally, I need the color \e[31m to be in a variable, color. (The desired output of this example would be having the second line colored). I have the following:
function colorLine() {
cat file.txt | sed ''/"$1"/s//"$(printf \e[31m $1 \e[0m)"/g''
}
colorLine "#.*"
The variable color is not included in what I have so far, as I am not sure how to go about that part.
The output of this is (with the second line being red):
this is line 1
#.*
this is line 3
It is apparently interpreting the replace string literally. My question is how do I use the matched line to generate the replace string?
I understand that I could do something much easier like appending \e[31m to the beginning of all lines that start with #, but it is important to use sed with the regexes.
colorLine() {
sed "/$1/s//"$'\e[31m&\e[0m/' file.txt
}
colorLine "#.*"
Multiple fixes, but it uses $1 to identify the pattern from the arguments to the function, and then uses ANSI-C quoting to encode the escape sequences — and fixes the color reset sequence which was (originally) missing the [ after the escape sequence. It also avoids the charge of "UUoC — Useless Use of cat".
The fixed file name is not exactly desirable, but fixing it is left as an exercise for the reader.
What if I needed \e[31m to be a variable, $color. How do I change the quoting?
I have a colour-diff script which contains (in Perl notation — I've translated it to Bash notation using ANSI C quoting as before):
reset=$'\e[0m'
black=$'\e[30;1m' # 30 = Black, 1 = bold
red=$'\e[31;1m' # 31 = Red, 1 = bold
green=$'\e[32;1m' # 32 = Green, 1 = bold
yellow=$'\e[33;1m' # 33 = Yellow, 1 = bold
blue=$'\e[34;1m' # 34 = Blue, 1 = bold
magenta=$'\e[35;1m' # 35 = Magenta, 1 = bold
cyan=$'\e[36;1m' # 36 = Cyan, 1 = bold
white=$'\e[37;1m' # 37 = White, 1 = bold
With those variables around, you can create your function as you wish:
colorLine() {
sed "/$1/s//$blue&$reset/“ file.txt
}
Where you set those variables depends on where you define your function. For myself, I'd probably make a script rather than a function, with full-scale argument parsing, and go from there. YMMV
Take a look at List of ANSI color escape sequences to get a more comprehensive list of colours (and other effects — including background and foreground colours) and the escape sequence used to generate it.
With GNU sed and Kubuntu 16.04.
foo="#.*"
sed 's/'"$foo"'/\x1b[31m&\x1b[0m/' file
I'd trick grep to do it for me this way:
function colorLine() {
GREP_COLORS="mt=31" grep --color=always --context=$(wc -l <file.txt) --no-filename "$1" file.txt
}
Split-out of the trick:
GREP_COLORS="mt=31": SGR substring for matching non-empty text in any matching line. Here will generate \e[31m red before matched string, and reset to default color after matched string.
--color=always: always colorise even in non interactive shell
context=$(wc -l <file.txt): output as much context lines as number of lines in the file (so all lines).
--no-filename: do not print the file name
An awk version
black='\033[30;1m'
red='\033[31;1m'
green='\033[32;1m'
yellow='\033[33;1m'
blue='\033[34;1m'
magenta='\033[35;1m'
cyan='\033[36;1m'
white='\033[37;1m'
color=$cyan
colorLine() { awk -v s="$1" -v c=$color '$0~s {$0=c$0"\033[0m"}1' file.txt; }
colorLine "#.*"
You can add file as a variable as vell:
color=$cyan
file="file.txt"
colorLine() { awk -v s="$1" -v c=$color '$0~s {$0=c$0"\033[0m"}1' $file; }
colorLine "#.*"
In awk \e is printed as \033
A more dynamic version:
colorLine() {
temp=$2;
col=${!temp};
awk -v s="$1" -v c=$col '$0~s {$0=c$0"\033[0m"}1' $3; }
colorLine "#.*" green file.txt
Then you have colorLine pattern color file

Replace the last character in string

How can I just replace the last character (it's a }) from a string? I need everything before the last character but replace the last character with some new string.
I tried many things with awk and sed but didn't succeed.
For example:
...\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural
\\f0
}'
should become:
...\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural
\\f0
\\cf2 Its red now
}'
This replaces the last occurrence of:
}
with
\\cf2 Its red now
}
sed would do this:
# replace '}' in the end
echo '\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural \f0 }' | sed 's/}$/\\cf2 Its red now}/'
# replace any last character
echo '\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural \f0 }' | sed 's/\(.\)$/\\cf2 Its red now\1/'
Replacing the trailing } could be done like this (with $ as the PS1 prompt and > as the PS2 prompt):
$ str="...\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural
> \\f0
> }"
$ echo "$str"
...\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural
\f0
}
$ echo "${str%\}}\cf2 It's red now
}"
...\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural
\f0
\cf2 It's red now
}
$
The first 3 lines assign your string to my variable str. The next 4 lines show what's in the string. The 2 lines:
echo "${str%\}}\cf2 It's red now
}"
contain a (grammar-corrected) substitution of the material you asked for, and the last lines echo the substituted value.
Basically, ${str%tail} removes the string tail from the end of $str; I remember % ends in 't' for tail (and the analogous ${str#head} has hash starting with 'h' for head).
See shell parameter expansion in the Bash manual for the remaining details.
If you don't know the last character, you can use a ? metacharacter to match the end instead:
echo "${str%?}and the extra"
First make a string with newlines
str=$(printf "%s\n%s\n%s" '\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural' '\\f0' "}'")
Now you look for the last } in your string and replace it including a newline.
The $ makes sure it will only replace it at the last line, & stands for the matches string.
echo "${str}" |sed '$ s/}[^}]$/\\\\cf2 Its red now\n&/'
The above solution only works when the } is at the last line. It becomes more difficult when you also want to support str2:
str2=$(printf "Extra } here.\n%s\nsome other text" "${str}")
You can not match the } on the last line. Removing the address $ for the last line will result in replacing all } characters (I added a } at the beginning of str2). You only want to replace the last one.
Replacing once is forced with ..../1. Replacing the last and not the first is done by reversing the order of lines with tac. Since you will tac again after the replacement, you need to use a different order in your sedreplacement string.
echo "${str2}" | tac |sed 's/}[^}]$/&\n\\\\cf2 Its red now/1' |tac
In awk:
$ awk ' BEGIN { RS=OFS=FS="" } $NF="\\\\cf2 Its red now\n}"' file
RS="" sets RS to an empty record (change it to suit your needs)
OFS=FS="" separates characters each to its own field
$NF="\\\\cf2 Its red now\n}" replaces the char in the last field ($NF=}) with the quoted text
awk '{sub(/\\f0/,"\\f0\n\\\\\cfs Its red now")}1' file
...\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural
\\f0
\\cfs Its red now
}'

Removing lines between tags in a text file

I have many text files containing annotations. The original text is marked with lines containing the words:
START OF TEXT OF PASSAGE 1
END OF TEXT OF PASSAGE 1
Obviously I can search each document for the phrase START OF TEXT and delete everything up to it. Then search for END OF TEXT and start selecting text for deletion until I get to the next START OF TEXT.
I have come up with this design so far:
#!/bin/bash
a="START OF PROJECT"
b="END OF PROJECT"
while read line; do
if line contains a; do
while read line; do
'if line does not contain b'
'append the line to output.txt'; fi
done
done
fi
done
Perhaps there is an easier way using sed, awk, grep and pipes?
'for every document' 'loop through it doing this' ('find the original text between START and END' | >> output.txt)
Unfortunately I am poor at bash and ignorant of sed/awk.
The reason for this is that I am assembling a huge text document that is a concatenation of thousands of marked up documents – each of which contains some annotated passages.
In Python:
import re
with open('in.txt') as f, open('out.txt', 'w') as output:
output.write('\n'.join(re.findall(r'START OF TEXT(.*?)END OF TEXT', f.read())))
This reads the input, searches for all matches that begin and end with the necessary markers, captures the text of interest in a group, joins all those groups on a linefeed, and writes that to the result file.
Pretty easy to do with awk. You would create a script (I'll call it yank.awk) containing this:
#!/usr/bin/awk
/START OF PROJECT/ { capture = 1; next }
/END OF PROJECT/ { capture = 0 }
capture == 1 { print }
and then run it like so:
yank.awk in.txt > output.txt
Could also do with sed and grep:
sed -ne '/START OF PROJECT/,/END OF PROJECT/p' in.txt | grep -vE '(START|END) OF PROJECT' > output.txt
(Another Python solution)
You can have itertools.groupby group lines together based on a boolean value - just use a global flag to keep track of whether you are in a block or not, and then use groupby to group the lines that are in or out of blocks. Then just discard the ones that are not blocks:
sample_lines = """
lskdjflsdkjf
sldkjfsdlkjf
START OF TEXT
Asdlkfjlsdkfj
Bsldkjf
Clsdkjf
END OF TEXT
sldkfjlsdkjf
sdlkjfdklsjf
sdlkfjdlskjf
START OF TEXT
Dsdlkfjlsdkfj
Esldkjf
Flsdkjf
END OF TEXT
sldkfjlsdkjf
sdlkjfdklsjf
sdlkfjdlskjf
""".splitlines()
from itertools import groupby
in_block = False
def is_in_block(line):
global in_block
if line.startswith("END OF TEXT"):
in_block = False
ret = in_block
if line.startswith("START OF TEXT"):
in_block = True
return ret
for lines_are_text,lines in groupby(sample_lines, key=is_in_block):
if lines_are_text:
print(list(lines))
gives:
['Asdlkfjlsdkfj', 'Bsldkjf', 'Clsdkjf']
['Dsdlkfjlsdkfj', 'Esldkjf', 'Flsdkjf']
See that first group has the lines that start with A, B, and C, and the second group is made up of those lines starting with D, E, and F.
It sounds like the specific solution you need is:
awk '/END OF TEXT OF PASSAGE/{f=0} f; /START OF TEXT OF PASSAGE/{f=1}' file
See https://stackoverflow.com/a/18409469/1745001 for other ways to select text from files.
Use Perl's Flip-Flop Operator to Print Text Between Markers
Given a corpus like:
START OF TEXT OF PASSAGE 1
foo
END OF TEXT OF PASSAGE 1
START OF TEXT OF PASSAGE 2
bar
END OF TEXT OF PASSAGE 2
you can use the Perl flip-flop operator to process within a range of lines. For example, from the shell prompt:
$ perl -ne 'if (/^START OF TEXT/ ... /^END OF TEXT/) {
next if /^(?:START|END)/;
print;
}' /tmp/corpus
foo
bar
Basically, this short Perl script loops through your input. When it finds your start and end tags, it throws away the tags themselves and prints everything else in between.
Usage Notes
The line breaks between passages in the corpus are for readability. It doesn't matter if your real corpus has no line breaks between passages, so long as the text markers always start at the beginning of the line as shown in your original post. If that assumption doesn't hold true, then you will need to adjust the regular expressions used to identify the start and end of your passages.
You can pass multiple files to the Perl script. Again, it makes no practical difference as long as you don't exceed the length limit of your shell.
If you want the final output to go to somewhere other than standard output, just use shell redirection. For example:
perl -ne 'if (/^START OF TEXT/ ... /^END OF TEXT/) {
next if /^(?:START|END)/;
print;
}' /tmp/file1 /tmp/file2 /tmp/file3 > /tmp/output
You can use sed as follows:
sed -n '/^START OF TEXT/,/^END OF TEXT/{/^\(START\|END\) OF TEXT/!p}' infile
or, with extended regular expressions (-r):
sed -rn '/^START OF TEXT/,/^END OF TEXT/{/^(START|END) OF TEXT/!p}' infile
-n prevents sed from printing as a default. The rest works as follows:
/^START OF TEXT/,/^END OF TEXT/ { # For lines between these two matches
/^\(START\|END\) OF TEXT/!p # If the line does NOT match, print it
}
This works with GNU sed and might require some tweaking to run with other seds.

Delete lines before and after a match in bash (with sed or awk)?

I'm trying to delete two lines either side of a pattern match from a file full of transactions. Ie. find the match then delete two lines before it, then delete two lines after it and then delete the match. The write this back to the original file.
So the input data is
D28/10/2011
T-3.48
PINITIAL BALANCE
M
^
and my pattern is
sed -i '/PINITIAL BALANCE/,+2d' test.txt
However this is only deleting two lines after the pattern match and then deleting the pattern match. I can't work out any logical way to delete all 5 lines of data from the original file using sed.
an awk one-liner may do the job:
awk '/PINITIAL BALANCE/{for(x=NR-2;x<=NR+2;x++)d[x];}{a[NR]=$0}END{for(i=1;i<=NR;i++)if(!(i in d))print a[i]}' file
test:
kent$ cat file
######
foo
D28/10/2011
T-3.48
PINITIAL BALANCE
M
x
bar
######
this line will be kept
here
comes
PINITIAL BALANCE
again
blah
this line will be kept too
########
kent$ awk '/PINITIAL BALANCE/{for(x=NR-2;x<=NR+2;x++)d[x];}{a[NR]=$0}END{for(i=1;i<=NR;i++)if(!(i in d))print a[i]}' file
######
foo
bar
######
this line will be kept
this line will be kept too
########
add some explanation
awk '/PINITIAL BALANCE/{for(x=NR-2;x<=NR+2;x++)d[x];} #if match found, add the line and +- 2 lines' line number in an array "d"
{a[NR]=$0} # save all lines in an array with line number as index
END{for(i=1;i<=NR;i++)if(!(i in d))print a[i]}' #finally print only those index not in array "d"
file # your input file
sed will do it:
sed '/\n/!N;/\n.*\n/!N;/\n.*\n.*PINITIAL BALANCE/{$d;N;N;d};P;D'
It works this way:
if sed has only one string in pattern space it joins another one
if there are only two it joins the third one
if it does natch to pattern LINE + LINE + LINE with BALANCE it joins two following strings, deletes them and goes at the beginning
if not, it prints the first string from pattern and deletes it and goes at the beginning without swiping the pattern space
To prevent the appearance of pattern on the first string you should modify the script:
sed '1{/PINITIAL BALANCE/{N;N;d}};/\n/!N;/\n.*\n/!N;/\n.*\n.*PINITIAL BALANCE/{$d;N;N;d};P;D'
However, it fails in case you have another PINITIAL BALANCE in string which are going to be deleted. However, other solutions fails too =)
For such a task, I would probably reach for a more advanced tool like Perl:
perl -ne 'push #x, $_;
if (#x > 4) {
if ($x[2] =~ /PINITIAL BALANCE/) { undef #x }
else { print shift #x }
}
END { print #x }' input-file > output-file
This will remove 5 lines from the input file. These lines will be the 2 lines before the match, the matched line, and the two lines afterwards. You can change the total number of lines being removed modifying #x > 4 (this removes 5 lines) and the line being matched modifying $x[2] (this makes the match on the third line to be removed and so removes the two lines before the match).
A more simple and easy to understand solution might be:
awk '/PINITIAL BALANCE/ {print NR-2 "," NR+2 "d"}' input_filename \
| sed -f - input_filename > output_filename
awk is used to make a sed-script that deletes the lines in question and the result is written on the output_filename.
This uses two processes which might be less efficient than the other answers.
This might work for you (GNU sed):
sed ':a;$q;N;s/\n/&/2;Ta;/\nPINITIAL BALANCE$/!{P;D};$q;N;$q;N;d' file
save this code into a file grep.sed
H
s:.*::
x
s:^\n::
:r
/PINITIAL BALANCE/ {
N
N
d
}
/.*\n.*\n/ {
P
D
}
x
d
and run a command like this:
`sed -i -f grep.sed FILE`
You can use it so either:
sed -i 'H;s:.*::;x;s:^\n::;:r;/PINITIAL BALANCE/{N;N;d;};/.*\n.*\n/{P;D;};x;d' FILE

Resources