Capitalize strings in sed or awk - bash

I have three types of strings that I'd like to capitalize in a bash script. I figured sed/awk would be my best bet, but I'm not sure. What's the best way given the following requirements?
single word
e.g. taco -> Taco
multiple words separated by hyphens
e.g. my-fish-tacos -> My-Fish-Tacos
multiple words separated by underscores
e.g. my_fish_tacos -> My_Fish_Tacos

There's no need to use capture groups (although & is a one in a way):
echo "taco my-fish-tacos my_fish_tacos" | sed 's/[^ _-]*/\u&/g'
The output:
Taco My-Fish-Tacos My_Fish_Tacos
The escaped lower case "u" capitalizes the next character in the matched sub-string.

Using awk:
echo 'test' | awk '{
for ( i=1; i <= NF; i++) {
sub(".", substr(toupper($i), 1,1) , $i);
print $i;
# or
# print substr(toupper($i), 1,1) substr($i, 2);
}
}'

Try the following:
sed 's/\([a-z]\)\([a-z]*\)/\U\1\L\2/g'
It works for me using GNU sed, but I don't think BSD sed supports \U and \L.

Here is a solution that does not use the \u, that is not common to all seds.
Save this file into capitalize.sed, then run sed -i -f capitalize.sed FILE
s:^:.:
h
y/qwertyuiopasdfghjklzxcvbnm/QWERTYUIOPASDFGHJKLZXCVBNM/
G
s:$:\n:
:r
/^.\n.\n/{s:::;p;d}
/^[^[:alpha:]][[:alpha:]]/ {
s:.\(.\)\(.*\):x\2\1:
s:\n\(..\):\nx:
tr
}
/^[[:alpha:]][[:alpha:]]/ {
s:\n.\(.\)\(.*\)$:\nx\2\1:
s:..:x:
tr
}
/^[^\n]/ {
s:^.\(.\)\(.*\)$:.\2\1:
s:\n..:\n.:
tr
}

alinsoar's mind-blowing solution doesn't work at all in Plan9 sed, or correctly in busybox sed. But you should still try to figure out how it's supposed to do its thing: you will learn a lot about sed.
Here's a not-as-clever but easier to understand version which works in at least Plan9, busybox, and GNU sed (and probably BSD and MacOS). Plan9 sed needs backslashes removed in the match part of the s command.
#! /bin/sed -f
y/PYFGCRLAOEUIDHTNSQJKXBMWVZ/pyfgcrlaoeuidhtnsqjkxbmwvz/
s/\(^\|[^A-Za-z]\)a/\1A/g
s/\(^\|[^A-Za-z]\)b/\1B/g
s/\(^\|[^A-Za-z]\)c/\1C/g
s/\(^\|[^A-Za-z]\)d/\1D/g
s/\(^\|[^A-Za-z]\)e/\1E/g
s/\(^\|[^A-Za-z]\)f/\1F/g
s/\(^\|[^A-Za-z]\)g/\1G/g
s/\(^\|[^A-Za-z]\)h/\1H/g
s/\(^\|[^A-Za-z]\)i/\1I/g
s/\(^\|[^A-Za-z]\)j/\1J/g
s/\(^\|[^A-Za-z]\)k/\1K/g
s/\(^\|[^A-Za-z]\)l/\1L/g
s/\(^\|[^A-Za-z]\)m/\1M/g
s/\(^\|[^A-Za-z]\)n/\1N/g
s/\(^\|[^A-Za-z]\)o/\1O/g
s/\(^\|[^A-Za-z]\)p/\1P/g
s/\(^\|[^A-Za-z]\)q/\1Q/g
s/\(^\|[^A-Za-z]\)r/\1R/g
s/\(^\|[^A-Za-z]\)s/\1S/g
s/\(^\|[^A-Za-z]\)t/\1T/g
s/\(^\|[^A-Za-z]\)u/\1U/g
s/\(^\|[^A-Za-z]\)v/\1V/g
s/\(^\|[^A-Za-z]\)w/\1W/g
s/\(^\|[^A-Za-z]\)x/\1X/g
s/\(^\|[^A-Za-z]\)y/\1Y/g
s/\(^\|[^A-Za-z]\)z/\1Z/g

This might work for you (GNU sed):
echo "aaa bbb ccc aaa-bbb-ccc aaa_bbb_ccc aaa-bbb_ccc" | sed 's/\<.\|_./\U&/g'
Aaa Bbb Ccc Aaa-Bbb-Ccc Aaa_Bbb_Ccc Aaa-Bbb_Ccc

Related

Convert first character to capital along with special character separator

I would like to convert first character to capital and character coming after dash(-) needs to be converted to capital using bash.
I can split individual elements using - ,
echo "string" | tr [:lower:] [:upper:]
and join all but that doesn't seem effect. Is there any easy way to take care of this using single line?
Input string:
JASON-CONRAD-983636
Expected string:
Jason-Conrad-983636
I recommend using Python for this:
python3 -c 'import sys; print("-".join(s.capitalize() for s in sys.stdin.read().split("-")))'
Usage:
capitalize() {
python3 -c 'import sys; print("-".join(s.capitalize() for s in sys.stdin.read().split("-")))'
}
echo JASON-CONRAD-983636 | capitalize
Output:
Jason-Conrad-983636
In pure bash (v4+) without any third party utils
str=JASON-CONRAD-983636
IFS=- read -ra raw <<<"$str"
final=()
for str in "${raw[#]}"; do
first=${str:0:1}
rest=${str:1}
final+=( "${first^^}${rest,,}" )
done
and print the result
( IFS=- ; printf '%s\n' "${final[*]}" ; )
This might work for you (GNU sed):
sed 's/.*/\L&/;s/\b./\u&/g' file
Lowercase everything. Uppercase first characters of words.
Alternative:
sed -E 's/\b(.)((\B.)*)/\u\1\L\2/g' file
Could you please try following(in case you are ok with awk).
var="JASON-CONRAD-983636"
echo "$var" | awk -F'-' '{for(i=1;i<=NF;i++){$i=substr($i,1,1) tolower(substr($i,2))}} 1' OFS="-"
Although the party is mostly over, please let me join with a perl solution:
perl -pe 's/(^|-)([^-]+)/$1 . ucfirst lc $2/ge' <<<"JASON-CONRAD-983636"
It may be cunning to use the ucfirst function :)

Bash: replace 4 occourance of a string if exist

I have a string that is sometimes
xxx.11_222_33_44_555.yyy
and sometimes
xxx.11_222_33_44.yyy
I would like to:
Check if has 4 occourances of _ (figured out how to do it).
If so - remove string's _33 (the 33 string changes, can be any number), so I am left with xxx.11_222_44.yyy.
Using sed :
sed 's/\(_[0-9]*\)_[0-9]*\(_[0-9]*_[0-9]*\)/\1\2/'
It matches the four underscores and replace the whole by the needed parts.
Test run :
$ echo "xxx.11_222_33_44_555.yyy" | sed 's/\(_[0-9]*\)_[0-9]*\(_[0-9]*_[0-9]*\)/\1\2/'
xxx.11_222_44_555.yyy
$ echo "xxx.11_222_33_44.yyy" | sed 's/\(_[0-9]*\)_[0-9]*\(_[0-9]*_[0-9]*\)/\1\2/'
xxx.11_222_33_44.yyy
perhaps something like this
echo "xxx.11_222_33_44.yyy" | sed -e's/\.\([0-9]\+\)_\([0-9]\+\)_\([0-9]\+\)_\([0-9]\+\)\./.\1_\2_\4./'
which checks if there are 4 groups of numbers separated by _ between the two dots and if yes, it leaves out the third group
try this;
echo "xxx.11_222_33_44_555.yyy" | awk -F'_' 'NF>4{print $1"_"$2"_"$4"_"$5};'
Solution using perl and Lookahead and Lookbehind
$ a="xxx.11_222_33_44_555.yyy"
$ perl -pe 's/\.\d+_\d+_\K\d+_(?=\d+_\d+\.)//' <<< "$a"
xxx.11_222_44_555.yyy

Replace strings in multiple files with corresponding caps using bash on MacOSX

I have multiple .txt files, in which I want to replace the strings
old -> new
Old -> New
OLD -> NEW
The first step is to only replace one string Old->New. Here is my current code, but it does not do the job (the files remain unchanged). The sed line works only if I replace the variables with the actual strings.
#!/bin/bash
old_string="Old"
new_string="New"
sed -i '.bak' 's/$old_string/$new_string/g' *.txt
Also, how do I convert a string to all upper-caps and all lower-caps?
Thank you very much for your advice!
To complement #merlin2011's helpful answer:
If you wanted to create the case variants dynamically, try this:
# Define search and replacement strings
# as all-lowercase.
old_string='old'
new_string='new'
# Loop 3 times and create the case variants dynamically.
# Build up a _single_ sed command that performs all 3
# replacements.
sedCmd=
for (( i = 1; i <= 3; i++ )); do
case $i in
1) # as defined (all-lowercase)
old_string_variant=$old_string
new_string_variant=$new_string
;;
2) # initial capital
old_string_variant="$(tr '[:lower:]' '[:upper:]' <<<"${old_string:0:1}")${old_string:1}"
new_string_variant="$(tr '[:lower:]' '[:upper:]' <<<"${new_string:0:1}")${new_string:1}"
;;
3) # all-uppercase
old_string_variant=$(tr '[:lower:]' '[:upper:]' <<<"$old_string")
new_string_variant=$(tr '[:lower:]' '[:upper:]' <<<"$new_string")
;;
esac
# Append to the sed command string. Note the use of _double_ quotes
# to ensure that variable references are expanded.
sedCmd+="s/$old_string_variant/$new_string_variant/g; "
done
# Finally, invoke sed.
sed -i '.bak' "$sedCmd" *.txt
Note that bash 4 supports case conversions directly (as part of parameter expansion), but OS X, as of 10.9.3, is still on bash 3.2.51.
Alternative solution, using awk to create the case variants and synthesize the sed command:
Aside from being shorter, it is also more robust, because it also handles strings correctly that happen to contain characters that are regex metacharacters (characters with special meaning in an regular expression, e.g., *) or have special meaning in sed's s function's replacement-string parameter (e.g., \), through appropriate escaping; without escaping, the sed command would not work as expected.
Caveat: Doesn't support strings with embedded \n chars. (though that could be fixed, too).
# Define search and replacement strings as all-lowercase literals.
old_string='old'
new_string='new'
# Synthesize the sed command string, utilizing awk and its tolower() and toupper()
# functions to create the case variants.
# Note the need to escape \ chars to prevent awk from interpreting them.
sedCmd=$(awk \
-v old_string="${old_string//\\/\\\\}" \
-v new_string="${new_string//\\/\\\\}" \
'BEGIN {
printf "s/%s/%s/g; s/%s/%s/g; s/%s/%s/g",
old_string, new_string,
toupper(substr(old_string,1,1)) substr(old_string,2), toupper(substr(new_string,1,1)) substr(new_string,2),
toupper(old_string), toupper(new_string)
}')
# Invoke sed with the synthesized command.
# The inner sed command ensures that all regex metacharacters in the strings
# are escaped so that sed treats them as literals.
sed -i '.bak' "$(sed 's#[][(){}^$.*?+\]#\\&#g' <<<"$sedCmd")" *.txt
If you want to do bash variable expansion inside the argument to sed, you need to use double quotes " instead of single quotes '.
sed -i '.bak' "s/$old_string/$new_string/g" *.txt
In terms of getting matches on all three of the literal substitutions, the cleanest solution may be just to run sed three times in a loop like this.
declare -a olds=(old Old OLD)
declare -a news=(new New NEW)
for i in `seq 0 2`; do
sed -i "s/${olds[$i]}/${news[$i]}/g" *.txt
done;
Update: The solution above works on Linux, but apparently OS X has different requirements. Additionally, as #mklement0 mentioned, my for loop is silly. Here is an improved version for OS X.
declare -a olds=(old Old OLD)
declare -a news=(new New NEW)
for (( i = 0; i < ${#olds[#]}; i++ )); do
sed -i '.bak' "s/${olds[$i]}/${news[$i]}/g" *.txt
done;
Assuming each string is separated by spaces from your other strings and that you don't want partial matches within longer strings and that you don't care about preserving white space on output and assuming that if an "old" string matches on a "new" string after a previous conversion operation, then the string should be changed again:
$ cat tst.awk
BEGIN {
split(tolower(old),oldStrs)
split(tolower(new),newStrs)
}
{
for (fldNr=1; fldNr<=NF; fldNr++) {
for (stringNr=1; stringNr in oldStrs; stringNr++) {
oldStr = oldStrs[stringNr]
if (tolower($fldNr) == oldStr) {
newStr = newStrs[stringNr]
split(newStr,newChars,"")
split($fldNr,fldChars,"")
$fldNr = ""
for (charNr=1; charNr in fldChars; charNr++) {
fldChar = fldChars[charNr]
newChar = newChars[charNr]
$fldNr = $fldNr ( fldChar ~ /[[:lower:]]/ ?
newChar : toupper(newChar) )
}
}
}
}
print
}
.
$ cat file
The old Old OLD smOLDering QuICk brown FoX jumped
$ awk -v old="old" -v new="new" -f tst.awk file
The new New NEW smOLDering QuICk brown FoX jumped
Note that the "old" in "smOLDering" did not get changed. Is that desirable?
$ awk -v old="QUIck Fox" -v new="raBid DOG" -f tst.awk file
The old Old OLD smOLDering RaBId brown DoG jumped
$ awk -v old="THE brown Jumped" -v new="FEW dingy TuRnEd" -f tst.awk file
Few old Old OLD smOLDering QuICk dingy FoX turned
Think about whether or not this is your expected output:
$ awk -v old="old new" -v new="new yes" -f tst.awk file
The yes Yes YES smOLDering QuICk brown FoX jumped
A few lines of sample input and expected output in the question would be useful to avoid all the guessing and assumptions.

Bash command to extract characters in a string

I want to write a small script to generate the location of a file in an NGINX cache directory.
The format of the path is:
/path/to/nginx/cache/d8/40/32/13febd65d65112badd0aa90a15d84032
Note the last 6 characters: d8 40 32, are represented in the path.
As an input I give the md5 hash (13febd65d65112badd0aa90a15d84032) and I want to generate the output: d8/40/32/13febd65d65112badd0aa90a15d84032
I'm sure sed or awk will be handy, but I don't know yet how...
This awk can make it:
awk 'BEGIN{FS=""; OFS="/"}{print $(NF-5)$(NF-4), $(NF-3)$(NF-2), $(NF-1)$NF, $0}'
Explanation
BEGIN{FS=""; OFS="/"}. FS="" sets the input field separator to be "", so that every char will be a different field. OFS="/" sets the output field separator as /, for print matters.
print ... $(NF-1)$NF, $0 prints the penultimate field and the last one all together; then, the whole string. The comma is "filled" with the OFS, which is /.
Test
$ awk 'BEGIN{FS=""; OFS="/"}{print $(NF-5)$(NF-4), $(NF-3)$(NF-2), $(NF-1)$NF, $0}' <<< "13febd65d65112badd0aa90a15d84032"
d8/40/32/13febd65d65112badd0aa90a15d84032
Or with a file:
$ cat a
13febd65d65112badd0aa90a15d84032
13febd65d65112badd0aa90a15f1f2f3
$ awk 'BEGIN{FS=""; OFS="/"}{print $(NF-5)$(NF-4), $(NF-3)$(NF-2), $(NF-1)$NF, $0}' a
d8/40/32/13febd65d65112badd0aa90a15d84032
f1/f2/f3/13febd65d65112badd0aa90a15f1f2f3
With sed:
echo '13febd65d65112badd0aa90a15d84032' | \
sed -n 's/\(.*\([0-9a-f]\{2\}\)\([0-9a-f]\{2\}\)\([0-9a-f]\{2\}\)\)$/\2\/\3\/\4\/\1/p;'
Having GNU sed you can even simplify the pattern using the -r option. Now you won't need to escape {} and () any more. Using ~ as the regex delimiter allows to use the path separator / without need to escape it:
sed -nr 's~(.*([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2}))$~\2/\3/\4/\1~p;'
Output:
d8/40/32/13febd65d65112badd0aa90a15d84032
Explained simple the pattern does the following: It matches:
(all (n-5 - n-4) (n-3 - n-2) (n-1 - n-0))
and replaces it by
/$1/$2/$3/$0
You can use a regular expression to separate each of the last 3 bytes from the rest of the hash.
hash=13febd65d65112badd0aa90a15d84032
[[ $hash =~ (..)(..)(..)$ ]]
new_path="/path/to/nginx/cache/${BASH_REMATCH[1]}/${BASH_REMATCH[2]}/${BASH_REMATCH[3]}/$hash"
Base="/path/to/nginx/cache/"
echo '13febd65d65112badd0aa90a15d84032' | \
sed "s|\(.*\(..\)\(..\)\(..\)\)|${Base}\2/\3/\4/\1|"
# or
# sed sed 's|.*\(..\)\(..\)\(..\)$|${Base}\1/\2/\3/&|'
Assuming info is a correct MD5 (and only) string
First of all - thanks to all of the responders - this was extremely quick!
I also did my own scripting meantime, and came up with this solution:
Run this script with a parameter of the URL you're looking for (www.example.com/article/76232?q=hello for example)
#!/bin/bash
path=$1
md5=$(echo -n "$path" | md5sum | cut -f1 -d' ')
p3=$(echo "${md5:0-2:2}")
p2=$(echo "${md5:0-4:2}")
p1=$(echo "${md5:0-6:2}")
echo "/path/to/nginx/cache/$p1/$p2/$p3/$md5"
This assumes the NGINX cache has a key structure of 2:2:2.

sed right align a group of text

this question originated from string pattaren-matching using awk , basically we are splitting a line of text in multiple groups based on a regex pattern, and then printing two groups only. Now the question is can we right align a group while printing through sed?
below is an example
$cat input.txt
it is line one
it is longggggggg one
itttttttttt is another one
now
$sed -e 's/\(.*\) \(.*\) \(.*\) \(.*\)/\1 \3/g' input.txt
it splits and prints group 1 and 3, but the output is
it line
it longggggggg
itttttttttt another
my question is can we do it through sed so that the output comes as
it line
it longggggggg
itttttttttt another
I did it with awk but I feel it can be done through sed, but I am not able to get how I am going to get the length of the second group and then pad correct number of spaces in between the groups, I am open to any suggestions to try out.
This might work for you (GNU sed):
sed -r 's/^(.*) .* (.*) .*$/\1 \2/;:a;s/^.{1,40}$/ &/;ta;s/^( *)(\S*)/\2\1/' file
or:
sed -r 's/^(.*) .* (.*) .*$/printf "%-20s%20s" \1 \2/e' file
You can use looping in sed to achieve what you want:
#!/bin/bash
echo 'aa bb cc dd
11 22 33333333 44
ONE TWO THREEEEEEEEE FOUR' | \
sed -e 's/\(.*\) \(.*\) \(.*\) \(.*\)/\1 \3/g' \
-e '/\([^ ]*\) \([^ ]*\)/ { :x ; s/^\(.\{1,19\}\) \(.\{1,19\}\)$/\1 \2/g ; tx }'
The two 19's control the width of your columns. The :x is a label which is looped to by tx whenever the preceding substitution succeeded. (You could add a p; before tx to "debug" it.
It most easy to use awk in this case...
You could too use a bash loop to calculate the number of space and run this command on the line covered :
while read; do
# ... calculate $SPACE ...
echo $REPLY|sed "s/\([^\ ]*\)\ *[^\ ]*\ *\([^\ ]*\)/\1$SPACES\2/g"
done < file
But I prefer use awk for do all that (or other advanced shell languages ​​such as Perl, Python, PHP shell mode, ...)
TemplateSpace=" "
TemplateSize=${#TemplateSpace}
sed "
# split your group (based on word here but depend on your real need)
s/^ *\(\w\) \(\w\) \(\w\) \(\w\).*$/\1 \3/
# align
s/$/${TemplateSpace}/
s/^\(.\{${TemplateSize}\}\).*$/\1/
s/\(\w\) \(\w\)\( *\)/\1 \3\2/
"
or more simple for avoiding TemplateSize (and there are no dot in content)
TemplateSpace="............................................................."
and replace
s/^\(.\{${TemplateSize}\}.*$/\1/
by
s/^\(${TemplateSpace}\).*$/\1/
s/\./ /g
Del columns 2 and 4. Right justify resulting col 2 at line length of 23 chars.
sed -e '
s/[^ ]\+/ /4;
s/[^ ]\+//2;
s/^\(.\{23\}\).*$/\1/;
s/\(^[^ ]\+[ ]\+\)\([^ ]\+\)\([ ]\+\)/\1\3\2/;
'
or gnu sed with extended regex:
sed -r '
s/\W+\w+\W+(\w+)\W+\w+$/\1 /;
s/^(.{23}).*/\1/;
s/(+\W)(\w+)(\W+)$/\1\3\2/
'
This question is old, but I like to see it as a puzzle.
While I love the loop solution for its brevity, here is one without a loop or shell help.
sed -E "s/ \w+ (\w+) \w+$/ \1/;h;s/./ /g;s/$/# /;s/( *)#\1//;x;H;x;s/\n//;s/^( *)(\w+)/\2\1/"
or without extended regex
sed "s/ .* \(.*\) .*$/ \1/;h;s/./ /g;s/$/# /;s/\( *\)#\1//;x;H;x;s/\n//;s/^\( *\)\([^ ]*\)/\2\1/"

Resources