Regex and/or sed to replace lowercase - bash

I have a text file with a single column of data. Take the following data for example
united states
germany
france
canada
Of which I am trying to generate all possible mixed case variations. For example the new file might look like this
United states
uNited states
unIted states
uniTed states
unitEd states
uniteD stated
united States
united sTates
united stAtes
united staTes
united statEs
united stateS
UNited states
And so on until all possible case variations of each word have been generated.
Given the above input and expected output I have three questions
Is regex and sed the right tool for this job?
What alternatives do I have to regex and sed for this task?
If I did use regex and sed what might the correct syntax look like?

1) No
2) Awk and substr()
3) You wouldn't
Start with this:
$ echo 'foo' |
awk '{
for (i=1;i<=length($0);i++) {
print substr($0,1,i-1) toupper(substr($0,i,1)) substr($0,i+1)
}
}'
Foo
fOo
foO
and massage to suit with the obvious logic.

For the fun of sed.
1) Yes. (e.g. GNU sed version 4.2.1)
2) Maybe awk, perl
3) See code below
sed -E "s/^.*$/\n&#\n/;:a;s/\n([^#\n]*)([^#\n])#([^#\n]*)\n/\n\1#\u\2\3\n\1#\l\2\3\n/;ta;s/(^\n#|\n$)//g;s/\n#/\n/g;"
This does assume that "#" is not part of the strings found in the file.
create a certain pattern
(start and end with newline; mark the cursor with #)
start a loop
replace text between newlines and containing the cursor by same text twice,
once with uppercase before cursor, once with lower case
move cursor one towards the start
loop if that replaced something
remove newlines at start and end and cursors
Note that # is not special. It just needs to be a character wich will not occur in input and not in desired output. Hopefully you can find a special character.
If you can have all characters, it gets complicated. Look at the comments to this answer. There probably is a discussion going on.
Output (for input "foo"):
FOO
fOO
FoO
foO
FOo
fOo
Foo
foo

Related

How to sort comma separated words in Vim

In Python code, I frequently run into import statements like this:
from foo import ppp, zzz, abc
Is there any Vim trick, like :sort for lines, to sort to this:
from foo import abc, ppp, zzz
Yep, there is:
%s/import\s*\zs.*/\=join(sort(split(submatch(0), '\s*,\s*')),', ')
The key elements are:
:h :substitute
:h /\zs
:h s/\=
:h submatch()
:h sort()
:h join()
:h split()
To answer the comment, if you want to apply the substitution on a visual selection, it becomes:
'<,'>s/\%V.*\%V\#!/\=join(sort(split(submatch(0), '\s*,\s*')), ', ')
The new key elements are this time:
:h /\%V that says the next character matched shall belong to the visual selection
:h /\#! that I use, in order to express (combined with \%V), that the next character shall not belong to the visual selection. That next character isn't kept in the matched expression.
BTW, we can also use s and i_CTRL-R_= interactively, or put it in a mapping (here triggered on µ):
:xnoremap µ s<c-r>=join(sort(split(#", '\s*,\s*')), ', ')<cr><esc>
Alternatively, you can do the following steps:
Move the words you want to sort to the next line:
from foo import
ppp, zzz, abc
Add a comma at the end of the words list:
from foo import
ppp, zzz, abc,
Select the word list for example with Shift-v. Now hit : and then enter !xargs -n1 | sort | xargs. It should look like this:
:'<,'>!xargs -n1 | sort | xargs
Hit Enter.
from foo import
abc, ppp, zzz,
Now remove the trailing comma and merge the word list back to the original line (for example with Shift-j).
from foo import abc, ppp, zzz
There are Vim plugins, which might be useful to you:
AdvancedSorters : Sorting of certain areas or by special needs.
I came here looking for a fast way to sort comma separated lists, in general, e.g.
relationships = {
'project', 'freelancer', 'task', 'managers', 'team'
}
My habit was to search/replace spaces with newlines and invoke shell sort but that's such a pain.
I ended up finding Chris Toomey's sort-motion plugin, which is just the ticket: https://github.com/christoomey/vim-sort-motion. Highly recommended.
Why not try vim-isort ? https://github.com/fisadev/vim-isort
I use that and vim-yapf-format for beautify the code :) https://github.com/pignacio/vim-yapf-format
Select the comma-separated text in visual mode, :, and run this command:
'<,'>!tr ',' '\n' | sort -f | paste -sd ','
🎩-tip this comment

Search and replace a multi-line pattern with sed

I've seen a lot of examples regarding this problem, but have yet to see the one that is tailored for my use. Although I'm very familiar with sed I'm ashamed to say that I'm a noob when it comes to more advanced features. Here's the problem at hand.
I have a multi-line pattern that I can successfully match with sed like so,
sed -e '/Favorite Animals/, /!/p
Favorite Animals
Monkey
Penguin
Cat
!
Favorite Things
Shoe
Dog
Wheel
Moth
!
and what I personally like about this expression is that I can match a variable number of lines up to the exclamation character. Now let's say I wanted to do a search and replace on that same pattern. Basically I would like to replace that multi-line pattern that was previously demonstrated with any string of my choosing. Any Ideas? I'm hoping for similar syntax to my demonstrated sed command, but beggars can't be choosers.
The idea is so that I can replace one of the groups delimited by the exclamation with a string. I'll call them "entries". I want to be able to update or overwrite these entries. If I had a new updated version of favorite animals I would like to be able to replace the old entry with a new one like this.
Favorite Animals
Sloth
Platypus
Badger
Dog
!
Favorite Things
Shoe
Dog
Wheel
Moth
!
As you can see I'm no longer a fan of monkeys now.
There are a variety of options — the i, c, and a commands can all be used.
Amended answer
This amended answer deals with the modified data file now in the question. Here's a mildly augmented version of the modified data file:
There's material at the start of the file then the key information:
Favourite Animals
Monkey
Penguin
Cat
!
Favourite Things
Shoe
Wheel
Moth
!
and some material at the end of the file too.
All three of these sed scripts produce the same output:
sed '/Favourite Animals/,/!/c\
Favourite Animals\
Sloth\
Platypus\
Badger\
!
' data
sed '/Favourite Animals/i\
Favourite Animals\
Sloth\
Platypus\
Badger\
!
/Favourite Animals/,/!/d' data
sed '/Favourite Animals/a\
Favourite Animals\
Sloth\
Platypus\
Badger\
!
/Favourite Animals/,/!/d' data
Sample output:
There's material at the start of the file then the key information:
Favourite Animals
Sloth
Platypus
Badger
!
Favourite Things
Shoe
Wheel
Moth
!
and some material at the end of the file too.
It is crucial that the scripts all use the unique string, /Favourite Animals/ and do not key off the repeated trailing context /!/. If the i or a use /!/ instead of /Favourite Animals/, the outputs change — and not for the better.
/!/i:
There's material at the start of the file then the key information:
Favourite Animals
Sloth
Platypus
Badger
!
Favorite Things
Shoe
Wheel
Moth
Favourite Animals
Sloth
Platypus
Badger
!
!
and some material at the end of the file too.
/!/a:
There's material at the start of the file then the key information:
Favourite Animals
Sloth
Platypus
Badger
!
Favorite Things
Shoe
Wheel
Moth
!
Favourite Animals
Sloth
Platypus
Badger
!
and some material at the end of the file too.
Extra request
Would it be possible to select a range within a range using sed? Basically, what if I wanted to change or remove one/many of my favorite animals within the previously specified range. That is /Favorite Animals/,/!/... change something within this range.
Yes, of course. For a single mapping:
sed '/Favourite Animals/,/!/ s/Monkey/Gorilla/'
For multiple mappings:
sed '/Favourite Animals/,/!/ {
s/Monkey/Gorilla/
s/Penguin/Zebra/
s/Cat/Dog/
}'
You can also combine those onto a single line if you wish — use semicolons to separate them:
sed '/Favourite Animals/,/!/ { s/Monkey/Gorilla/; s/Penguin/Zebra/; s/Cat/Dog/; }'
Be aware that GNU sed and BSD (Mac OS X) sed have different views on the necessity for the last semicolon — what's shown works with both.
The original answer works with a simpler input file.
Original answer
Consider the file data containing:
There's material
at the start of the file
then the key information:
Favourite Animals
Monkey
Penguin
Cat
!
and material at the end of the file too.
Using c, you might write:
$ sed '/Favourite Animals/,/!/c\
> Replacement material\
> for the favourite animals
> ' data
There's material
at the start of the file
then the key information:
Replacement material
for the favourite animals
and material at the end of the file too.
$
Using i, you would use:
$ sed '/Favourite Animals/i\
> Replacement material\
> for the favourite animals
> /Favourite Animals/,/!/d' data
There's material
at the start of the file
then the key information:
Replacement material
for the favourite animals
and material at the end of the file too.
$
Using a, you might write:
$ sed '/!/a\
> Replacement material\
> for the favourite animals
> /Favourite Animals/,/!/d' data
There's material
at the start of the file
then the key information:
Replacement material
for the favourite animals
and material at the end of the file too.
$
Note that with:
c — you change the whole range
i — you insert before the first pattern in the range before you delete the entire range
a — you append after the last pattern in the range before you delete the entire range
Though, come to think of it, you could insert before the last pattern in the range before deleting the entire range, or append after the first pattern in the range before deleting the entire range. So, the key with i and a is to put the 'replacement text' operation before the range-based delete. But c is most succinct.
The i (insert) command has ugly syntax, but it works:
sed '/Favorite Animals/i\
some new text\
some more new text\
and a little more
/Favorite Animals/,/!/d'

How to split a file containing non-ascii characters into words, in bash?

For example, I have a file with normal text, like:
"Word1 Kuͦn, buͤtten; word4:"
I want to get a file with 1 word per line, keeping the punctiuation, and ordered:
,
:
;
Word1
Kuͦn
buͤtten
word4
The code I use:
grep -Eo '\w+|[^\w ]' input.txt | sort -f >> output.txt
This the code works almost perfectly, except for one thing: it splits diacretical characters apart from the letters they belong to, as if they were separate words:
,
:
;
Word1
Ku
ͦ
n
bu
ͤ
tten
word4
The letters uͦ, uͤ and other with the same diacretics are not in the ASCII table. How can I split my file correctly without deleting or replacing these characters?
Edit:
locale output:
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
Unfortunately, U+366 (COMBINING LATIN SMALL LETTER O) is not an alphabetic character. It is a non-spacing mark, unicode category Mn, which generally maps to the Posix ctype cntrl.
Roughly speaking, an alphabetic grapheme is an alphabetic character possibly followed by one or more combining characters. It's possible to write that as a regex pattern if you have a regex library which implements Unicode general categories. Gnu grep is usually compiled with an interface to the popular pcre (Perl-compatible regular expression) library, which has reasonably good Unicode support. So if you have Gnu grep, you're in luck.
To enable "perl-like" regular expressions, you need to invoke grep with the -P option (or as pgrep). However, that is not quite enough because by default grep will use an 8-bit encoding even if the locale specifies a UTF-8 encoding. So you need to put the regex system into "UTF-8" mode in order to get it to recognize your character encoding.
Putting all that together, you might end up with something like the following:
grep -Po '(*UTF8)(\p{L}\p{M}*|\p{N})+|[\p{P}\p{S}]'
-P patterns are "perl-compatible"
-o output each substring matched
(*UTF8) If the pattern starts with exactly this sequence,
pcre is put into UTF-8 mode.
\p{...} Select a character in a specified Unicode general category
\P{...} Select a character not in a specified Unicode general category
\p{L} General category L: letters
\p{N} General category N: numbers
\p{M} General category M: combining marks
\p{P} General category P: punctuation
\p{S} General category S: symbols
\p{L}\p{M}* A letter possibly followed by various combining marks
\p{L}\p{M}*|\p{N} ... or a number
More information on Unicode general categories and Unicode regular expression matching in general can be found in Unicode Technical Report 18 on regular expression matching. But beware that the syntax described in that TR is a recommendation and is not exactly implemented by most regex libraries. In particular, pcre does not support the useful notation \p{L|N} (letter or number). Instead, you need to use [\p{L}\p{N}].
Documentation about pcre is probably available on your system (man pcre); if not, have a link on me.
If you don't have Gnu grep or in the unlikely case that your version was compiled without pcre support, you might be able to use perl, python or other languages with regex capabilites. However, doing so is surprisingly difficult. After some experimentation, I found the following Perl incantation which seems to work:
perl -CIO -lne 'print $& while /(\p{L}\p{M}*|\p{N})+|[\p{P}\p{S}]/g'
Here, -CIO tells Perl that input and output in UTF-8, and -nle is a standard incantation which means "automatically output new**l**ines after a print; loop through every li**n**e of the input, **e**xecuting the following in the loop".

gsub issue with awk (gawk)

I need to search a text file for a string, and make a replacement that includes a number that increments with each match.
The string to be "found" could be a single character, or a word, or a phrase.
The replacement expression will not always be the same (as it is in my examples below), but will always include a number (variable) that increments.
For example:
1) I have a test file named "data.txt". The file contains:
Now is the time
for all good men
to come to the
aid of their party.
2) I placed the awk script in a file named "cmd.awk". The file contains:
/f/ {sub ("f","f(" ++j ")")}1
3) I use awk like this:
awk -f cmd.awk data.txt
In this case, the output is as expected:
Now is the time
f(1)or all good men
to come to the
aid of(2) their party.
The problem comes when there is more than one match on a line. For example, if I was searching for the letter "i" like:
/i/ {sub ("i","i(" ++j ")")}1
The output is:
Now i(1)s the time
for all good men
to come to the
ai(2)d of their party.
which is wrong because it doesn't include the "i" in "time" or "their".
So, I tried "gsub" instead of "sub" like:
/i/ {gsub ("i","i(" ++j ")")}1
The output is:
Now i(1)s the ti(1)me
for all good men
to come to the
ai(2)d of thei(2)r party.
Now it makes the replacement for all occurrences of the letter "i", but the inserted number is the same for all matches on the same line.
The desired output should be:
Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.
Note: The number won't always begin with "1" so I might use awk like this:
awk -f cmd.awk -v j=26 data.txt
To get the output:
Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.
And just to be clear, the number in the replacement will not always be inside parenthesis. And the replacement will not always include the matched string (actually it would be quite rare).
The other problem I am having with this is...
I want to use an awk-variable (not environment variable) for the "search string", so I can specify it on the awk command line.
For example:
1) I placed the awk script in a file named "cmd.awk". The file contains something like:
/??a??/ {gsub (a,a "(" ++j ")")}1
2) I would use awk like this:
awk -f cmd.awk -v a=i data.txt
To get the output:
Now i(1)s the ti(2)me
for all good men
to come to the
ai(3)d of thei(4)r party.
The question here, is how do I represent the the variable "a" in the /search/ expression ?
awk version:
awk '{for(i=2; i<=NF; i++)$i="(" ++k ")" $i}1' FS=i OFS=i
gensub() sounds ideal here, it allows you to replace the Nth match, so what sounds like a solution is to iterate over the string in a do{}while() loop replacing one match at a time and incrementing j. This simple gensub() approach won't work if the replacement does not contain the original text (or worse, contains it multiple times), see below.
So in awk, lacking perl's "s///e" evaluation feature, and its stateful regex /g modifier (as used by Steve) the best remaining option is to break the lines into chunks (head, match, tail) and stick them back together again:
BEGIN {
if (j=="") j=1
if (a=="") a="f"
}
match($0,a) {
str=$0; newstr=""
do {
newstr=newstr substr(str,1,RSTART-1) # head
mm=substr(str,RSTART,RLENGTH) # extract match
sub(a,a"("j++")",mm) # replace
newstr=newstr mm
str=substr(str,RSTART+RLENGTH) # tail
} while (match(str,a))
$0=newstr str
}
{print}
This uses match() as an epxression instead of a // pattern so you can use a variable. (You can also just use "($0 ~ a) { ... }", but the results of match() are used in this code, so don't try that here.)
You can define j and a on the command line.
gawk supports \y which is the equivalent of perlre's \b, and also supports \< and \> to explictly match the start and end of a word, just take care to add extra escapes from a unix command line (I'm not quite sure what Windows might require or permit).
Limited gensub() version
As referenced above:
match($0,a) {
idx=1; str=$0
do {
prev=str
str=gensub(a,a"(" j ")",idx++,prev)
} while (str!=prev && j++)
$0=str
}
The problems here are:
if you replace substring "i" with substring "k" or "k(1)" then the gensub() index for the next match will be off by 1. You could work around this if you either know that in advance, or work backward through the string instead.
if you replace substring "i" with substring "ii" or "ii(i)" then a similar problem arises (resulting in an infinite loop, because gensub() keeps finding a new match)
Dealing with both conditions robustly is not worth the code.
I'm not saying this can't be done using awk, but I would strongly suggest moving to a more powerful language. Use perl instead.
To include a count of the letter i beginning at 26, try:
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=26 data.txt
This could also be a shell var:
var=26
perl -spe 's:i:$&."(".++$x.")":ge' -- -x=$var data.txt
Results:
Now i(27)s the ti(28)me
for all good men
to come to the
ai(29)d of thei(30)r party.
To include a count of specific words, add word boundaries (i.e. \b) around the words, try:
perl -spe 's:\bthe\b:$&."(".++$x.")":ge' -- -x=5 data.txt
Results:
Now is the(6) time
for all good men
to come to the(7)
aid of their party.

adding word to a field common to all records in a file

I want to add a word to field 1 of a file separated by space,for example
My file,containing two fields:
apple:fruit:tree
orange:fruit:tree
mango:fruit:tree
brinjal:vegetable:plant
potato:vegetable:root
now I want to add word "family" separated by space to the field 1.
Therefore, the resultant file should look some thing like this
apple family:fruit:tree
orange family:fruit:tree
mango family:fruit:tree
brinjal family:vegetable:plant
potato family:vegetable:root
Any ideas on this would be appreciated.
Thanks,
Use can sed:
sed 's/:/ family:/' yourfile.txt
This will replace any : with family: which achieves the desired result. You might have to adjust the regular expression though, in case : also appears elsewhere in the text.
Update: I am not sure what you want with I want to add a word "active" to field 1 of a file separated by space as you give no example for that.
Update 2:
It will only replace the first occurrence of :. However, if you want to replace something in the middle, you just have to capture the data before the delimiter:
sed 's/^\(.*:.*\):/\1 family:/' test.txt
This example adds family before the third field. \(.*:.*\) captures the characters before and after the first : (i.e. the values of the first and second field). The following : will be replaced by this characters (\1 is referring to the first capture group) followed by family:. The rest of the line stays untouched.
Use awk
$ awk -F":" '{$1=$1" family"}1' OFS=":" file
apple family:fruit:tree
orange family:fruit:tree
mango family:fruit:tree
brinjal family:vegetable:plant
potato family:vegetable:root
If you want to add to 2nd field, just do $2=$2" family". Its easier than creating regex such as by using sed.
Now may i redirect you to awk manual to learn about awk. Try to do it yourself next time.

Resources