Search and replace a multi-line pattern with sed - bash

I've seen a lot of examples regarding this problem, but have yet to see the one that is tailored for my use. Although I'm very familiar with sed I'm ashamed to say that I'm a noob when it comes to more advanced features. Here's the problem at hand.
I have a multi-line pattern that I can successfully match with sed like so,
sed -e '/Favorite Animals/, /!/p
Favorite Animals
Monkey
Penguin
Cat
!
Favorite Things
Shoe
Dog
Wheel
Moth
!
and what I personally like about this expression is that I can match a variable number of lines up to the exclamation character. Now let's say I wanted to do a search and replace on that same pattern. Basically I would like to replace that multi-line pattern that was previously demonstrated with any string of my choosing. Any Ideas? I'm hoping for similar syntax to my demonstrated sed command, but beggars can't be choosers.
The idea is so that I can replace one of the groups delimited by the exclamation with a string. I'll call them "entries". I want to be able to update or overwrite these entries. If I had a new updated version of favorite animals I would like to be able to replace the old entry with a new one like this.
Favorite Animals
Sloth
Platypus
Badger
Dog
!
Favorite Things
Shoe
Dog
Wheel
Moth
!
As you can see I'm no longer a fan of monkeys now.

There are a variety of options — the i, c, and a commands can all be used.
Amended answer
This amended answer deals with the modified data file now in the question. Here's a mildly augmented version of the modified data file:
There's material at the start of the file then the key information:
Favourite Animals
Monkey
Penguin
Cat
!
Favourite Things
Shoe
Wheel
Moth
!
and some material at the end of the file too.
All three of these sed scripts produce the same output:
sed '/Favourite Animals/,/!/c\
Favourite Animals\
Sloth\
Platypus\
Badger\
!
' data
sed '/Favourite Animals/i\
Favourite Animals\
Sloth\
Platypus\
Badger\
!
/Favourite Animals/,/!/d' data
sed '/Favourite Animals/a\
Favourite Animals\
Sloth\
Platypus\
Badger\
!
/Favourite Animals/,/!/d' data
Sample output:
There's material at the start of the file then the key information:
Favourite Animals
Sloth
Platypus
Badger
!
Favourite Things
Shoe
Wheel
Moth
!
and some material at the end of the file too.
It is crucial that the scripts all use the unique string, /Favourite Animals/ and do not key off the repeated trailing context /!/. If the i or a use /!/ instead of /Favourite Animals/, the outputs change — and not for the better.
/!/i:
There's material at the start of the file then the key information:
Favourite Animals
Sloth
Platypus
Badger
!
Favorite Things
Shoe
Wheel
Moth
Favourite Animals
Sloth
Platypus
Badger
!
!
and some material at the end of the file too.
/!/a:
There's material at the start of the file then the key information:
Favourite Animals
Sloth
Platypus
Badger
!
Favorite Things
Shoe
Wheel
Moth
!
Favourite Animals
Sloth
Platypus
Badger
!
and some material at the end of the file too.
Extra request
Would it be possible to select a range within a range using sed? Basically, what if I wanted to change or remove one/many of my favorite animals within the previously specified range. That is /Favorite Animals/,/!/... change something within this range.
Yes, of course. For a single mapping:
sed '/Favourite Animals/,/!/ s/Monkey/Gorilla/'
For multiple mappings:
sed '/Favourite Animals/,/!/ {
s/Monkey/Gorilla/
s/Penguin/Zebra/
s/Cat/Dog/
}'
You can also combine those onto a single line if you wish — use semicolons to separate them:
sed '/Favourite Animals/,/!/ { s/Monkey/Gorilla/; s/Penguin/Zebra/; s/Cat/Dog/; }'
Be aware that GNU sed and BSD (Mac OS X) sed have different views on the necessity for the last semicolon — what's shown works with both.
The original answer works with a simpler input file.
Original answer
Consider the file data containing:
There's material
at the start of the file
then the key information:
Favourite Animals
Monkey
Penguin
Cat
!
and material at the end of the file too.
Using c, you might write:
$ sed '/Favourite Animals/,/!/c\
> Replacement material\
> for the favourite animals
> ' data
There's material
at the start of the file
then the key information:
Replacement material
for the favourite animals
and material at the end of the file too.
$
Using i, you would use:
$ sed '/Favourite Animals/i\
> Replacement material\
> for the favourite animals
> /Favourite Animals/,/!/d' data
There's material
at the start of the file
then the key information:
Replacement material
for the favourite animals
and material at the end of the file too.
$
Using a, you might write:
$ sed '/!/a\
> Replacement material\
> for the favourite animals
> /Favourite Animals/,/!/d' data
There's material
at the start of the file
then the key information:
Replacement material
for the favourite animals
and material at the end of the file too.
$
Note that with:
c — you change the whole range
i — you insert before the first pattern in the range before you delete the entire range
a — you append after the last pattern in the range before you delete the entire range
Though, come to think of it, you could insert before the last pattern in the range before deleting the entire range, or append after the first pattern in the range before deleting the entire range. So, the key with i and a is to put the 'replacement text' operation before the range-based delete. But c is most succinct.

The i (insert) command has ugly syntax, but it works:
sed '/Favorite Animals/i\
some new text\
some more new text\
and a little more
/Favorite Animals/,/!/d'

Related

Randomly shuffling lines in multiple text files but keeping them as separate files using a command or bash script

I have several text files in a directory. All of them unrelated. Words doesn't repeat in each file. Each line has 1 to 3 words in it such as:
apple
potato soup
vitamin D
banana
guinea pig
life is good
I know how to randomize each file:
sort -R file.txt > file-modified.txt
That's great but I want to do this in over 500+ files in a directory and it would take me ages. There must be something better.
I would like to do something like:
sort -R *.txt -o KEEP-SAME-NAME-AS-ORIGINAL-FILE-ADD-SUFFIX-TO-ALL.txt
Maybe this is possible with an script that go through each file in the directory until finished.
Very important is every file should only randomize the words within itself and do not mix with the other files.
Thank you.
Something like this one-liner:
for file in !(*-modified).txt; do shuf "$file" > "${file%.txt}-modified.txt"; done
Just loop over the files and shuffle each one in turn.
The !(*-modified).txt pattern uses bash's extended pattern matching to not match .txt files that already have -modified at the end of the name so you don't shuffle a pre-existing already shuffled output file and end up with file-modified-modified.txt. Might require a shopt -s extglob first, though that's usually turned on already in an interactive shell session.

Regex and/or sed to replace lowercase

I have a text file with a single column of data. Take the following data for example
united states
germany
france
canada
Of which I am trying to generate all possible mixed case variations. For example the new file might look like this
United states
uNited states
unIted states
uniTed states
unitEd states
uniteD stated
united States
united sTates
united stAtes
united staTes
united statEs
united stateS
UNited states
And so on until all possible case variations of each word have been generated.
Given the above input and expected output I have three questions
Is regex and sed the right tool for this job?
What alternatives do I have to regex and sed for this task?
If I did use regex and sed what might the correct syntax look like?
1) No
2) Awk and substr()
3) You wouldn't
Start with this:
$ echo 'foo' |
awk '{
for (i=1;i<=length($0);i++) {
print substr($0,1,i-1) toupper(substr($0,i,1)) substr($0,i+1)
}
}'
Foo
fOo
foO
and massage to suit with the obvious logic.
For the fun of sed.
1) Yes. (e.g. GNU sed version 4.2.1)
2) Maybe awk, perl
3) See code below
sed -E "s/^.*$/\n&#\n/;:a;s/\n([^#\n]*)([^#\n])#([^#\n]*)\n/\n\1#\u\2\3\n\1#\l\2\3\n/;ta;s/(^\n#|\n$)//g;s/\n#/\n/g;"
This does assume that "#" is not part of the strings found in the file.
create a certain pattern
(start and end with newline; mark the cursor with #)
start a loop
replace text between newlines and containing the cursor by same text twice,
once with uppercase before cursor, once with lower case
move cursor one towards the start
loop if that replaced something
remove newlines at start and end and cursors
Note that # is not special. It just needs to be a character wich will not occur in input and not in desired output. Hopefully you can find a special character.
If you can have all characters, it gets complicated. Look at the comments to this answer. There probably is a discussion going on.
Output (for input "foo"):
FOO
fOO
FoO
foO
FOo
fOo
Foo
foo

Bash script / utility to convert UK English to US spellings in TeX document [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm looking for a quick Bash script to convert British / New Zealand spellings to American in a TeX document (for working with US-based academics and journal submission). This is a formal mathematical biology paper with very little regional terminology or grammar: prior work is given as formulae rather than quotes.
e.g.,
Generalise -> Generalize
Colour -> Color
Centre -> Centre
Figure there must be sed or awk based script to substitute most of the common spelling differences.
See the related TeX forum question for more detail.
https://tex.stackexchange.com/questions/312138/converting-uk-to-us-spellings
n.b. I currently compile PDFLaTeX with kile on Ubuntu 16.04 or Elementary OS 0.3 Freya but I can use another TeX compiler/package if there's a built-in fix elsewhere.
Thanks for your assistance.
I think you need to have a list of substitution handy with you and call it for translation. You would have to enrich your dictionary file to efficiently translate text files.
sourceFile=$1
dict=$2
while read line
do
word=$(echo $line |awk '{print $1}')
updatedWord=$(grep -i $word $dict|awk '{print $2}')
sed -i "s/$word/$updatedWord/g" $sourceFile 2 > /dev/null
done < $dict
Run the above script like:
./scriptName source.txt dictionary.txt
Here is one sample dictionary I used:
>cat dict
characterize characterise
prioritize prioritise
specialize specialise
analyze analyse
catalyze catalyse
size size
exercise exercise
behavior behaviour
color colour
favor favour
contour contour
center centre
fiber fibre
liter litre
parameter parameter
ameba amoeba
anesthesia anaesthesia
diarrhea diarrhoea
esophagus oesophagus
leukemia leukaemia
cesium caesium
defense defence
practice practice
license licence
defensive defensive
advice advice
aging ageing
acknowledgment acknowledgement
judgment judgement
analog analogue
dialog dialogue
fulfill fulfil
enroll enrol
skill, skillful skill, skilful
labeled labelled
signaling signalling
propelled propelled
revealing revealing
Execution result :
cat source
color of this fiber is great and we should analyze it.
./ScriptName source.txt dict.txt
cat source
colour of this fibre is great and we should analyse it.
Here is my solution with awk that I think is more flexible than sed.
This prg. leaves the LaTeX commands (when word begins with "\") and it will preserve the first capital letters of words.
The parameters of LaTeX commands (and normal texts) will be substitute by dictionary file.
When [rev] the third parameter of program is on it will do a revers substitution by same dictionary file.
Any non alpha-beta character functions as word separator (this is required in LaTeX source file).
The prg writes its output on the screen (stdout) so you need to use the redirection to file ( >output_f).
(I think inputencoding of your LaTeX source is 1 byte/char.)
> cat dic.sh
#!/bin/bash
(($#<2))&& { echo "Usage $0 dictionary_file latex_file [rev]"; exit 1; }
((d= $#==3 ? 0:1))
awk -v d=$d '
BEGIN {cm=fx=0; fn="";}
fn!=FILENAME {fx++; fn=FILENAME;}
fx==1 {if(!NF)next; if(d)a[$1]=$2; else a[$2]=$1; next;} #read dict or rev dict file into an associative array
fx==2 { for(i=1; i<=length($0); i++)
{c=substr($0,i,1); #read characters from a given line of LaTeX source
if(cm){printf("%s",c); if(c~"[^A-Za-z0-9\\\]")cm=0;} #LaTeX command is occurred
else if(c~"[A-Za-z]")w=w c; else{pr(); printf("%s",c); if(c=="\\")cm=1;} #collect alpha-bets or handle them
}
pr(); printf("\n"); #handle collected last word in the line
}
function pr( s){ # print collected word or its substitution by dictionary and recreates first letter case
if(!length(w))return;
s=tolower(w);
if(!(s in a))printf("%s",w);
else printf("%s", s==w ? a[s] : toupper(substr(a[s],1,1)) substr(a[s],2));
w="";}
' $1 $2
Dictionary file:
> cat dictionary
apple lemon
raspberry cherry
pear banana
Input LaTeX source:
> cat src.txt
Apple123pear,apple "pear".
\Apple123pear{raspberry}{pear}[apple].
Raspberry12Apple,pear.
Execution result :
> ./dic.sh
Usage ./dic.sh dictionary_file latex_file [rev]
> ./dic.sh dictionary src.txt >out1.txt; cat out1.txt
Lemon123banana,lemon "banana".
\Apple123pear{cherry}{banana}[lemon].
Cherry12Lemon,banana.
> ./dic.sh dictionary out1.txt >out2.txt rev; cat out2.txt
Apple123pear,apple "pear".
\Apple123pear{raspberry}{pear}[apple].
Raspberry12Apple,pear.
> diff src.txt out2.txt # they are identical

gsub issue with awk (gawk)

I need to search a text file for a string, and make a replacement that includes a number that increments with each match.
The string to be "found" could be a single character, or a word, or a phrase.
The replacement expression will not always be the same (as it is in my examples below), but will always include a number (variable) that increments.
For example:
1) I have a test file named "data.txt". The file contains:
Now is the time
for all good men
to come to the
aid of their party.
2) I placed the awk script in a file named "cmd.awk". The file contains:
/f/ {sub ("f","f(" ++j ")")}1
3) I use awk like this:
awk -f cmd.awk data.txt
In this case, the output is as expected:
Now is the time
f(1)or all good men
to come to the
aid of(2) their party.
The problem comes when there is more than one match on a line. For example, if I was searching for the letter "i" like:
/i/ {sub ("i","i(" ++j ")")}1
The output is:
Now i(1)s the time
for all good men
to come to the
ai(2)d of their party.
which is wrong because it doesn't include the "i" in "time" or "their".
So, I tried "gsub" instead of "sub" like:
/i/ {gsub ("i","i(" ++j ")")}1
The output is:
Now i(1)s the ti(1)me
for all good men
to come to the
ai(2)d of thei(2)r party.
Now it makes the replacement for all occurrences of the letter "i", but the inserted number is the same for all matches on the same line.
The desired output should be:
Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.
Note: The number won't always begin with "1" so I might use awk like this:
awk -f cmd.awk -v j=26 data.txt
To get the output:
Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.
And just to be clear, the number in the replacement will not always be inside parenthesis. And the replacement will not always include the matched string (actually it would be quite rare).
The other problem I am having with this is...
I want to use an awk-variable (not environment variable) for the "search string", so I can specify it on the awk command line.
For example:
1) I placed the awk script in a file named "cmd.awk". The file contains something like:
/??a??/ {gsub (a,a "(" ++j ")")}1
2) I would use awk like this:
awk -f cmd.awk -v a=i data.txt
To get the output:
Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.
The question here, is how do I represent the the variable "a" in the /search/ expression ?
awk version:
awk '{for(i=2; i<=NF; i++)$i="(" ++k ")" $i}1' FS=i OFS=i
gensub() sounds ideal here, it allows you to replace the Nth match, so what sounds like a solution is to iterate over the string in a do{}while() loop replacing one match at a time and incrementing j. This simple gensub() approach won't work if the replacement does not contain the original text (or worse, contains it multiple times), see below.
So in awk, lacking perl's "s///e" evaluation feature, and its stateful regex /g modifier (as used by Steve) the best remaining option is to break the lines into chunks (head, match, tail) and stick them back together again:
BEGIN {
if (j=="") j=1
if (a=="") a="f"
}
match($0,a) {
str=$0; newstr=""
do {
newstr=newstr substr(str,1,RSTART-1) # head
mm=substr(str,RSTART,RLENGTH) # extract match
sub(a,a"("j++")",mm) # replace
newstr=newstr mm
str=substr(str,RSTART+RLENGTH) # tail
} while (match(str,a))
$0=newstr str
}
{print}
This uses match() as an epxression instead of a // pattern so you can use a variable. (You can also just use "($0 ~ a) { ... }", but the results of match() are used in this code, so don't try that here.)
You can define j and a on the command line.
gawk supports \y which is the equivalent of perlre's \b, and also supports \< and \> to explictly match the start and end of a word, just take care to add extra escapes from a unix command line (I'm not quite sure what Windows might require or permit).
Limited gensub() version
As referenced above:
match($0,a) {
idx=1; str=$0
do {
prev=str
str=gensub(a,a"(" j ")",idx++,prev)
} while (str!=prev && j++)
$0=str
}
The problems here are:
if you replace substring "i" with substring "k" or "k(1)" then the gensub() index for the next match will be off by 1. You could work around this if you either know that in advance, or work backward through the string instead.
if you replace substring "i" with substring "ii" or "ii(i)" then a similar problem arises (resulting in an infinite loop, because gensub() keeps finding a new match)
Dealing with both conditions robustly is not worth the code.
I'm not saying this can't be done using awk, but I would strongly suggest moving to a more powerful language. Use perl instead.
To include a count of the letter i beginning at 26, try:
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=26 data.txt
This could also be a shell var:
var=26
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=$var data.txt
Results:
Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.
To include a count of specific words, add word boundaries (i.e. \b) around the words, try:
perl -spe 's:\bthe\b:$&."(".++$x.")":ge' -- -x=5 data.txt
Results:
Now is the(6) time
for all good men
to come to the(7)
aid of their party.

in bash, bash remove punctuation between pattern matches?

I am struggling with a conversion of a data file to csv when there is punctuation in the title field.
I have a bash script that obtains the file and processes it, and it almost works. What gets me is when there are commas in a free text title field, which then create extra fields.
I have tried some sed examples to replace between patterns but I have not gotten any of them to work. What I want to do is work between two patterns and replace commas with either nothing or perhaps a semicolon.
Taking this string:
name:A100040,title:Oatmeal is better with raisins, dates, and sugar,current_balance:50000,
Replacing with this:
name:A100040,title:Oatmeal is better with raisins dates and sugar,current_balance:50000,
I should probably use "title:" and ",current_" to denote the start and end of the block where I want to make the change to avoid situations like this:
name:A100040,title:Re-title current periodicals, recent books,current_balance:50000,
So far I have not gotten the substitution to match. In this case I am using !! to make the change obvious:
teststring="name:A100040,title:Oatmeal is better with raisins, dates, and sugar,current_balance:50000,"
echo $teststring |sed '/title:/,/current_/s/,/!!/g'
name:A100040!!title:Oatmeal is better with raisins!! dates!! and sugar!!current_balance:50000!!
Any help appreciated.
This is one way which could undoubtedly be refined:
perl -ple 'm/(.*?)(title:.*?)(current_balance:.*)/; $save = $part = $2; $part =~ s/,/!!/g; s/$save/$part/'
First, using sed or awk to parse CSV is almost always the wrong thing to do, because they do not allow field delimiters to be quoted. That said, it seems like a better approach would be to quote the fields so that your output would be:
name:"A100040",title:"Oatmeal ... , dates, and sugar",current_balance:50000
Using sed you can try: (this is fragile)
sed 's/:\([^:]*\),\([^,:]*\)/:"\1",\2/g'
If you insist on trying to parse the csv with "standard" tools and you consider perl to be standard, you could try:
perl -pe '1 while s/,([^,:]*),/ $1,/g'

Resources