sed help - convert a string of form ABC_DEF_GHI to AbcDefGhi - shell

How can covert a string of form ABC_DEF_GHI to AbcDefGhi using any online command such as sed etc. ?

Here's a one-liner using gawk:
echo ABC_DEF_GHI | gawk 'function cap(s){return toupper(substr(s,1,1))tolower(substr(s,2))}{n=split($0,x,"_");for(i=1;i<=n;i++)o=o cap(x[i]); print o}'
AbcDefGhi

Optimized awk 1-liner
awk -v RS=_ '{printf "%s%s", substr($0,1,1), tolower(substr($0,2))}'
Optimized sed 1-liner
sed 's/\(.\)\(..\)_\(.\)\(..\)_\(.\)\(..\)/\1\L\2\U\3\L\4\U\5\L\6/'

Edit:
Here's a gawk version:
gawk -F_ '{for (i=1;i<=NF;i++) printf "%s%s",substr($i,1,1),tolower(substr($i,2)); printf "\n"}'
Original:
Using sed for this is pretty scary:
sed -r 'h;s/(^|_)./\n/g;y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/;x;s/((^|_)(.))[^_]*/\3\n/g;G;:a;s/(^.*)([^\n])\n\n(.*)\n([^\n]*)$/\1\n\2\4\3/;ta;s/\n//g'
Here it is broken down:
# make a copy in hold space
h;
# replace all the characters which will remain upper case with newlines
s/(^|_)./\n/g;
# lowercase all the remaining characters
y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/;
# swap the copy into pattern space and the lowercase characters into hold space
x;
# discard all but the characters which will remain upper case
s/((^|_)(.))[^_]*/\3\n/g;
# append the lower case characters to the end of pattern space
G;
# top of the loop
:a;
# shuffle the lower case characters back into their proper positions (see below)
s/(^.*)([^\n])\n\n(.*)\n([^\n]*)$/\1\n\2\4\3/;
# if a replacement was made, branch to the top of the loop
ta;
# remove all the newlines
s/\n//g
Here's how the shuffle works:
At the time it starts, this is what pattern space looks like:
A
D
G
bc
ef
hi
The shuffle loop picks up the string that's between the last newline and the end and moves it to the position before the two consecutive newlines (actually three) and moves the extra newline so it's before the character that it previously followed.
After the first step through the loop, this is what pattern space looks like:
A
D
Ghi
bc
ef
And processing proceeds similarly until there's nothing before the extra newline at which point the match fails and the loop branch is not taken.
If you want to title case a sequence of words separated by spaces, the script would be similar:
$ echo 'BEST MOVIE THIS YEAR' | sed -r 'h;s/(^| )./\n/g;y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/;x;s/((^| ).)[^ ]*/\1\n/g;G;:a;s/(^.*)( [^\n]*)\n\n(.*)\n([^\n]*)$/\1\n\2\4\3/;ta;s/^([^\n]*)(.*)\n([^\n]*)$/\1\3\2/;s/\n//g'
Best Movie This Year

One liner using perl:
$ echo 'ABC_DEF_GHI' | perl -npe 's/([A-Z])([^_]+)_?/$1\L$2\E/g;'
AbcDefGhi

This might work for you:
echo "ABC_DEF_GHI" |
sed 'h;s/\(.\)[^_]*\(_\|$\)/\1/g;x;y/'$(printf "%s" {A..Z} / {a..z})'/;G;:a;s/\(\(^[a-z]\)\|_\([a-z]\)\)\([^\n]*\n\)\(.\)/\5\4/;ta;s/\n//'
AbcDefGhi
Or using GNU sed:
echo "ABC_DEF_GHI" | sed 's/\([A-Z]\)\([^_]*\)\(_\|$\)/\1\L\2/g'
AbcDefGhi

Less scary sed version with tr:
echo ABC_DEF_GHI | sed -e 's/_//g' - | tr 'A-Z' 'a-z'

Related

POSIX: abcdef to ab bc cd de ef

Using POSIX sed or awk, I would like to duplicate every second character in every pair of neighboring characters and list every newly-formed pair on a new line.
example.txt:
abcd 10001.
Expected result:
ab
bc
cd
d
1
10
00
00
01
1.
So far, this is what I have (N.B. omit "--posix" if on macOS). For some reason, adding a literal newline character before \2 does not produce the expected result. Removing the first group and using \1 has the same effect. What am I missing?
sed --posix -E -e 's/(.)(.)/&\2\
/g' example.txt
abb
cdd
100
000
1..
Try:
$ echo "abcd 10001." | awk '{for(i=1;i<length($0);i++) print substr($0,i,2)}'
ab
bc
cd
d
1
10
00
00
01
1.
You may use
sed --posix -e 's/./&\
&/g' example.txt | sed '1d;$d'
The first sed command finds every char in the string and replaces with the same char, then a newline and then the same char again. Since it replaces first and last chars, the first and last resulting lines must be removed, which is achieved with sed '1d;$d'.
Had sed supported lookarounds, one could have used (?!^).(?!$) (any char but not at the start or end of string) and the last sed command would not have been necessary, but it is not possible with sed. You could use it in perl though, perl -pe 's/(?!^).(?!$)/$&\n$&/g' example.txt (see demo online, $& in the RHS is the same as & placeholder in sed, the whole match value).
With GNU awk could you please try following. Written and tested with shown samples and tested it in link
https://ideone.com/qahp0S
awk '
BEGIN{
FS=""
}
{
for(i=1;i<=(NF-1);i++){
print $i$(i+1)
}
}
' Input_file
Explanation: setting field separator as NULL in the BEGIN section of program for all lines here. Then in main program running a for loop which runs from 1st field to till 2nd last field. In that loop's each iteration printing current and next field.
Using same routine, it can be done in bash itself:
s='abcd 10001.'
for((i=0; i<${#s}-1; i++)); do echo "${s:i:2}"; done
ab
bc
cd
d
1
10
00
00
01
1.
Just for fun, a single sed consisting of 3 substitutions:
$ echo "abcd 10001." | sed 's/./&&/g;s/\(^.\|.$\)//g;s/../&\n/g'
The first part duplicates all characters, the second part removes the first and last character, the third part adds a newline character after each character-pair.
If you want to be POSIX compliant you have to do:
$ echo "abcd 10001." | sed -e 's/./&&/g' -e 's/^.//g' -e 's/.$//g' -e 's/../&\n/g'
Here we had to add an extra one as the expression \(^.\|.$) is an ERE and posix sed only accepts a BRE
This might work for you (GNU sed):
sed 's/.\(.\)/&\n\1/;/../P;D' file
Replace the first two characters by the first two characters, a newline and the second character.
Print the first line if it is two characters long, delete the first line and repeat.
Alternative, more long winded:
sed -E ':a;s/^(([^\n]{2}\n)*[^\n])([^\n])([^\n])/\1\3\n\3\4/;ta' file
Or, with no hardcoded new line:
sed -E '/.../{G;s/^(.(.))(.*)(.)/\1\4\2\3/;P;D}' file
Lastly:
sed 's/./&\n&/g;s/^..\|..$/g' file
Process substitution isn't specified by POSIX. The POSIX requirement was only specified for awk and sed, so maybe the next solution is acceptable:
paste -d '\0' <(echo; fold -w1 example.txt) <(fold -w1 example.txt) | grep ..
or
while read -n1 ch; do
printf "%s\n%s" "${ch}" "${ch}"
done < example.txt | grep ..
or
sed 's/./&&/g;s/.//' example.txt | grep -o ..

If a line has a length less than a number, append to its previous line

I have a file that looks like this:
ABCDEFGH
ABCDEFGH
ABC
ABCDEFGH
ABCDEFGH
ABCD
ABCDEFGH
Most of the lines have a fixed length of 8. But there are some lines in between that have a length less than 8. I need a simple line of code that appends each of those short lines to its previous line.
I have tried the following code but it takes lots of memory when working with large files.
cat FILENAME | awk 'BEGIN{OFS=FS="\t"}{print length($1), $1}' | tr
'\n' '\t' | sed 's/8/\n/g' | awk 'BEGIN{OFS="";FS="\t"}{print $2, $4}'
The output I expect:
ABCDEFGH
ABCDEFGHABC
ABCDEFGH
ABCDEFGHABCD
ABCDEFGH
If perl is your option, please try:
perl -0777 -pe 's/(\n)(.{1,7})$/\2/mg' filename
-0777 option tells perl to slurp all lines.
The pattern (\n)(.{1,7}) matches to a line with length less than 8, assigning \1 to a newline and \2 to the string.
The replacement \2 does not contain the preceding newline and is appended to the previous line.
sed <FILENAME 'N;/\n.\{8\}/!s/\n//;P;D'
N; - append next line to pattern space
/\n.\{8\}/ - does second line contain 8 characters?
!s/\n//; - no: join the two lines
P - print first line of pattern space
D - delete first line of pattern space, start next cycle
Default print without \n and append it to the last line when the current line has length 8.
The first and last line are special.
awk 'NR==1 {printf $0;next}
length($0)==8 {printf "\n"}
{printf("%s",$0)}
END { printf "\n" }' FILENAME
When you have GNU sed 4.2 (support -z option), you can try
EDIT (see comments): the inferiour
sed -rz 's/\n(.{0,7})\n/\1\n/g' FILENAME
If you like old traditional tools, you can use ed, the standard text editor:
printf '%s\n' 'g/^.\{,7\}$/-,.j' wq | ed -s filename

sed pattern parts as input for other bash function

I'm trying to replace floating-point numbers like 1.2e + 3 with their integer value 1200. For this I use sed in the following way:
echo '"1.2e+04"' | sed "s/\"\([0-9]\+\.[0-9]\+\)e+\([0-9]\+\)\"/$(echo \1*10^\2|bc -l)/"
but the pattern parts \1 and \2 doesn't get evaluated in the echo.
Is there a way to solve this problem with sed?
Thanks in advance
Within the double quotes, \1 and \2 are interpreted as literal 1 and 2.
You need to put additional backslashes to escape them. In addition, $(command substitution) in
sed replacement seems not to work when combined with back references.
If you are using GNU sed, you can instead say something like:
echo '"1.2e+04"' | sed "s/\"\([0-9]\+\.[0-9]\+\)e+\([0-9]\+\)\"/echo \"\\1*10^\\2\"|bc -l/;e"
which yields:
12000.0
If you want to chop off the decimal point, you'll know what to do ;-).
If you are happy with awk command like this can do the work:
echo 1.2e+4|awk '{printf "%d",$0}'
It is perhaps better to use perl (or other typed language) to manage the variable types:
echo '"1.2e+04"' | perl -lane 'my $a=$_;$a=~ s/"//g;print sprintf("%.10g",$a);print $a;'
In any case, your sed expression is incorrect, it should be:
echo '"1.2e+04"' | sed "s/\"\([0-9]\+\.[0-9]\+\)e+\([0-9]\+\)\"/$(echo \1*10^\3 + \2*10^$(echo \3 - 1 | bc -l)|bc -l)/"
The best way to solve the problem properly is to use an advanced combination of # tshiono and # Romeo solutions:
sed "s/\(.*\)\([0-9]\+\.[0-9]\+e+[0-9]\+\)\(.*\)/printf '\1'\; echo \2 |awk '{printf \"%d\",\$0}'\;printf '\3'\;/e"
So it is possible to convert all such floats into arbitrary contexts.
for example:
echo '"1.2e+04"' | sed "s/\(.*\)\([0-9]\+\.[0-9]\+e+[0-9]\+\)\(.*\)/printf '\1'\; echo \2 |awk '{printf \"%d\",\$0}'\;printf '\3'\;/e"
outputs
"12000"
and
echo 'abc"1.2e+04"def' | sed "s/\(.*\)\([0-9]\+\.[0-9]\+e+[0-9]\+\)\(.*\)/printf '\1'\; echo \2 |awk '{printf \"%d\",\$0}'\;printf '\3'\;/e"
outputs
abc"12000"def

Replacing/removing excess white space between columns in a file

I am trying to parse a file with similar contents:
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
I want the out file to be tab delimited:
I am a string\t12831928
I am another string\t41327318
A set of strings\t39842938
Another string\t3242342
I have tried the following:
sed 's/\s+/\t/g' filename > outfile
I have also tried cut, and awk.
Just use awk:
$ awk -F' +' -v OFS='\t' '{sub(/ +$/,""); $1=$1}1' file
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
Breakdown:
-F' +' # tell awk that input fields (FS) are separated by 2 or more blanks
-v OFS='\t' # tell awk that output fields are separated by tabs
'{sub(/ +$/,""); # remove all trailing blank spaces from the current record (line)
$1=$1} # recompile the current record (line) replacing FSs by OFSs
1' # idiomatic: any true condition invokes the default action of "print"
I highly recommend the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
The difficulty comes in the varying number of words per-line. While you can handle this with awk, a simple script reading each word in a line into an array and then tab-delimiting the last word in each line will work as well:
#!/bin/bash
fn="${1:-/dev/stdin}"
while read -r line || test -n "$line"; do
arr=( $(echo "$line") )
nword=${#arr[#]}
for ((i = 0; i < nword - 1; i++)); do
test "$i" -eq '0' && word="${arr[i]}" || word=" ${arr[i]}"
printf "%s" "$word"
done
printf "\t%s\n" "${arr[i]}"
done < "$fn"
Example Use/Output
(using your input file)
$ bash rfmttab.sh < dat/tabfile.txt
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
Each number is tab-delimited from the rest of the string. Look it over and let me know if you have any questions.
sed -E 's/[ ][ ]+/\\t/g' filename > outfile
NOTE: the [ ] is openBracket Space closeBracket
-E for extended regular expression support.
The double brackets [ ][ ]+ is to only substitute tabs for more than 1 consecutive space.
Tested on MacOS and Ubuntu versions of sed.
Your input has spaces at the end of each line, which makes things a little more difficult than without. This sed command would replace the spaces before that last column with a tab:
$ sed 's/[[:blank:]]*\([^[:blank:]]*[[:blank:]]*\)$/\t\1/' infile | cat -A
I am a string^I12831928 $
I am another string^I41327318 $
A set of strings^I39842938 $
Another string^I3242342 $
This matches – anchored at the end of the line – blanks, non-blanks and again blanks, zero or more of each. The last column and the optional blanks after it are captured.
The blanks before the last column are then replaced by a single tab, and the rest stays the same – see output piped to cat -A to show explicit line endings and ^I for tab characters.
If there are no blanks at the end of each line, this simplifies to
sed 's/[[:blank:]]*\([^[:blank:]]*\)$/\t\1/' infile
Notice that some seds, notably BSD sed as found in MacOS, can't use \t for tab in a substitution. In that case, you have to use either '$'\t'' or '"$(printf '\t')"' instead.
another approach, with gnu sed and rev
$ rev file | sed -r 's/ +/\t/1' | rev
You have trailing spaces on each line. So you can do two sed expressions in one go like so:
$ sed -E -e 's/ +$//' -e $'s/ +/\t/' /tmp/file
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
Note the $'s/ +/\t/': This tells bash to replace \t with an actual tab character prior to invoking sed.
To show that these deletions and \t insertions are in the right place you can do:
$ sed -E -e 's/ +$/X/' -e $'s/ +/Y/' /tmp/file
I am a stringY12831928X
I am another stringY41327318X
A set of stringsY39842938X
Another stringY3242342X
Simple and without invisible semantic characters in the code:
perl -lpe 's/\s+$//; s/\s\s+/\t/' filename
Explanation:
Options:
-l: remove LF during processing (in this case)
-p: loop over records (like awk) and print
-e: code follows
Code:
remove trailing whitespace
change two or more whitespace to tab
Tested on OP data. The trailing spaces are removed for consistency.

Bash index of first character not given

So basically something like expr index '0123 some string' '012345789' but reversed.
I want to find the index of the first character that is not one of the given characters...
I'd rather not use RegEx, if it is possible...
You can remove chars with tr and pick the first from what is left
left=$(tr -d "012345789" <<< "0123_some string"); echo ${left:0:1}
_
once you have the char to find the index follow the same
expr index "0123_some string" ${left:0:1}
5
Using gnu awk and FPAT you can do this:
str="0123 some string"
awk -v FPAT='[012345789]+' '{print length($1)}' <<< "$str"
4
awk -v FPAT='[02345789]+' '{print length($1)}' <<< "$str"
1
awk -v FPAT='[01345789]+' '{print length($1)}' <<< "$str"
2
awk -v FPAT='[0123 ]+' '{print length($1)}' <<< "$str"
5
I know this is in Perl but I got to say that I like it:
$ perl -pe '$i++while s/^\d//;$_=$i' <<< '0123 some string'
4
In case of 1-based index you can use $. which is initialized at 1 when dealing with single lines:
$ perl -pe '$.++while s/^\d//;$_=$.' <<< '0123 some string'
5
I'm using \d because I assume that you by mistake left out the number 6 from the list 012345789
Index is currently pointing to the space:
0123 some string
^ this space
Even if shell globing might look similar, it is not a regex.
It could be done in two steps: cut the string, count characters (length).
#!/bin/dash
a="$1" ### string to process
b='0-9' ### range of characters not desired.
c=${a%%[!$b]*} ### cut the string at the first (not) "$b".
echo "${#c}" ### Print the value of the position index (from 0).
It is written to work on many shells (including bash, of course).
Use as:
$ script.sh "0123_some string"
4
$ script.sh "012s3_some string"
3

Resources