using awk and gensub to remove the part in a string ending with "character+number+S" - shell

My goal is to remove the end "1S" as well as the letter immediately before it, in this case "M". How do I achieve that? My non-working code :
echo "14M3856N61M1S" | gawk '{gensub(/([^(1S)]*)[a-zA-Z](1S$)/, "\\1", "g") ; print $0}'
>14M3856N61M1S
The desired results should be
>14M3856N61
Some additional information here . 1. I do not think substr will work here since my actual target strings would come with various lengths. 2. I prefer not to take the approach of defining special delimiter because this would be used together with "if" as part of the awk conditional operation while the
delimiter is already defined globally.
Thank you in advance!

Why not use a simple substitution to match the 1S at the last and match any character before it?
echo "14M3856N61M1S" | awk '{sub(/[[:alnum:]]{1}1S$/,"")}1'
14M3856N61M1S
Here the [[:alnum:]] corresponds the POSIX character class to match alphanumeric characters (digits and alphabets) and {1} represent to match just one. Or if you are sure about only characters could occur before the pattern 1S, replace [[:alnum:]] with [[:alpha:]].
To answer OP's question to put the match result on a separate variable, use match() as sub() does not return the substituted string but only the count of number of substitutions made.
echo "14M3856N61M1S" | awk 'match($0,/[[:alnum:]]{1}1S$/){str=substr($0,1,RSTART-1); print str}'

EDIT: As per OP's comment I am adding solutions where OP could get the result into a bash variable too as follows.
var=$(echo "14M3856N61M1S" | awk 'match($0,/[a-zA-Z]1S$/){print substr($0,1,RSTART-1)}' )
echo "$var"
14M3856N61
Could you please try following too.
echo "14M3856N61M1S" | awk 'match($0,/[a-zA-Z]1S$/){$0=substr($0,1,RSTART-1)} 1'
14M3856N61
Explanation of above command:
echo "14M3856N61M1S" | ##printing sample string value by echo command here and using |(pipe) for sending standard ouptut of it as standard input to awk command.
awk ' ##Starting awk command here.
match($0,/[a-zA-Z]1S$/){ ##using match keyword of awk here to match 1S at last of the line along with an alphabet(small or capital) before it too.
$0=substr($0,1,RSTART-1) ##If match found in above command then re-creating current line and keeping its value from 1 to till RSTART-1 value where RSTART and RLENGTH values are set by match out of the box variables by awk.
} ##Closing match block here.
1' ##Mentioning 1 will print the edited/non-edited values of lines here.

echo "14M3856N61M1S" | awk -F '.1S$' '{print $1}'
Output:
14M3856N61

Related

Print part of a comma-separated field using AWK

I have a line containing this string:
$DLOAD , 123 , Loadcase name=SUBCASE_1
I am trying to only print SUBCASE_1. Here is my code, but I get a syntax error.
awk -F, '{n=split($3,a,"="); a[n]} {printf(a[1]}' myfile
How can I fix this?
1st solution: In case you want only to get last field(which contains = in it) then with your shown samples please try following
awk -F',[[:space:]]+|=' '{print $NF}' Input_file
2nd solution: OR in case you want to get specifically 3rd field's value after = then try following awk code please. Simply making comma followed by space(s) as field separator and in main program splitting 3rd field storing values into arr array, then printing 2nd item value of arr array.
awk -F',[[:space:]]+' '{split($3,arr,"=");print arr[2]}' Input_file
Possibly the shortest solution would be:
awk -F= '{print $NF}' file
Where you simply use '=' as the field-separator and then print the last field.
Example Use/Output
Using your sample into in a heredoc with the sigil quoted to prevent expansion of $DLOAD, you would have:
$ awk -F= '{print $NF}' << 'eof'
> $DLOAD , 123 , Loadcase name=SUBCASE_1
> eof
SUBCASE_1
(of course in this case it probably doesn't matter whether $DLOAD was expanded or not, but for completeness, in case $DLOAD included another '=' ...)

How to delete certain characters after a pattern using sed or awk?

I have a text file containing number of lines formatted like below
001_A.wav;112.680;115.211;;;Ja. Hello; Hi:
my goal is to clean whatever is after ;;;. Meaning to delete the following characters ,;()~?
I know i can do something like sed 's/[,.;()~?,]//g'. However if I do that, it would give me something like
001_Awav112.680115211Ja Hello Hi
However I would like to delete those character only after ;;; so I would get
001_A.wav;112.680;115.211;;;Ja Hello Hi
How can I accomplish this task?
1st solution: Could you please try following, written and tested with shown samples in GNU awk(where assuming ;;; occurring one time in lines).
awk '
match($0,/.*;;;/){
laterPart=substr($0,RSTART+RLENGTH)
gsub(/[,.:;()~?]/,"",laterPart)
print substr($0,RSTART,RLENGTH) laterPart
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/.*;;;/){ ##Using atch function to match everything till ;;; here.
laterPart=substr($0,RSTART+RLENGTH) ##Creating variable laterPart which has rest of the line apart from matched regex part above.
gsub(/[,.:;()~?]/,"",laterPart) ##Globally substituting ,.:;()~? with NULL in laterPart variable.
print substr($0,RSTART,RLENGTH) laterPart ##Printing sub string of matched regex and laterPart var here.
}' Input_file ##Mentioning Input_file name here.
2nd solution: In case you have multiple occurrences of ;;; in lines and you want to substitute characters from all fields, after 1st occurrence of ;;; then try following.
awk 'BEGIN{FS=OFS=";;;"} {for(i=2;i<=NF;i++){gsub(/[,.:;()~?,]/,"",$i)}} 1' Input_file
You can use
sed ':a; s/\(;;;[^,.:;()~?,]*\)[,.:;()~?,]/\1/; ta' file > newfile
sed ':a; s/\(;;;[^[:punct:]]*\)[[:punct:]]/\1/; ta' file > newfile
Details
:a sets a label
\(;;;[^,.:;()~?,]*\)[,.:;()~?,] matches and captures into Group 1 a ;;; substring and then any zero or more chars other than ,.:;()~?, chars, and then just matches a char from the ,.:;()~?, set
[^[:punct:]]* matches any 0 or more chars other than punctuation chars
[[:punct:]] matches any punctuation char
\1 is the replacement, the contents of Group 1
ta branches back to a label on a successful replacement.
See the online sed demo:
s='001_A.wav;112.680;115.211;;;Ja. Hello; Hi:'
sed ':a; s/\(;;;[^,.:;()~?,]*\)[,.:;()~?,]/\1/; ta' <<< "$s"
# => 001_A.wav;112.680;115.211;;;Ja Hello Hi
sed ':a; s/\(;;;[^[:punct:]]*\)[[:punct:]]/\1/; ta' <<< "$s"
# => 001_A.wav;112.680;115.211;;;Ja Hello Hi
Didn't read your question correctly, but I've changed it now.
I suggest to make use of perl instead, since it has lookup groups.
$ perl -pe 's/^((?:(?!;;;).)*;;;)|[:,.;\(\)~\?,]/\1/g' file.txt
^ is the beginning of the line.
((?:(?!;;;).)*;;;) is the string equivalent of [^;]*, and makes sure that the first ;;; is found and groups it in \1.
|[:,\.;\(\)~\?,] selects the characters :,.;\(\)~\?, and denies it in the result. (Thus leaving "Ja" in it).
You can use the combination of some sed commands with
echo '001_A.wav;112.680;115.211;;;Ja. Hello; Hi:' |
sed 's/;;;/;;;\n\r/' |
sed '/^\r/ s/[,;():~?]//g' |
sed -z 's/;;;\n\r/;;;/g'
Different GNU AWK-solution:
echo "001_A.wav;112.680;115.211;;;Ja. Hello; Hi:" | awk 'BEGIN{FS=OFS=";;;"}{print $1,gensub(/[,;()~?]/,"","g",substr($0,length($1)+1))}'
output:
001_A.wav;112.680;115.211;;;Ja. Hello Hi:
This assumes your description has precedence over example (only ,;()~? will be removed). Explanation: I use ;;; as seperator and output seperator then I print 1st column (what is before ;;;) and get rest by finding its start as length of 1st column plus 1, then remove all specified characters from that part and print it.
If example has precedence over description then you might use [[:punct:]] set of characters, namely:
echo "001_A.wav;112.680;115.211;;;Ja. Hello; Hi:" | awk 'BEGIN{FS=OFS=";;;"}{print $1,gensub(/[[:punct:]]/,"","g",substr($0,length($1)+1))}'
will give
001_A.wav;112.680;115.211;;;Ja Hello Hi

Efficient coding to count capital characters in file

I want to count all the capital characters A-Z from a file.
I take the file as an argument and then i search the whole file for each letter and sum my result. My code is working fine, but is there another way to make it more efficient, without using loop?
sum=0
for var in {A..Z}
do
foo="$(grep -o $var "$1"| wc -l)"
sum=$((sum+foo))
done
I tried to do it like this but it gives me wrong results, because its counting spaces and end line.
cat "$1" | wc -m
You can do it with a single grep command similar to what you're already doing:
grep -o "[A-Z]" "$1" | wc -l
We can really avoid using multiple programs for counting capital letters in a file, this could done easily with a single awk and it will save us some cycles and should be FASTER too.
Could you please try following.
awk '
{
count+=gsub(/[A-Z]/,"&")
}
END{
print "Total number of capital letters in file are: " count
}
' Input_file
In case you want to run it as a script which takes Input_file as an argument change Input_file to $1 too.
Explanation: Adding explanation for above code, only for explanation purposes not for running(following one).
awk ' ##Starting awk program here.
{
count+=gsub(/[A-Z]/,"&") ##Creating a variable named count whose value will be keeping adding to itself, each time a substitution done from gsub.
##where gsub is awk out of the box function to substitute.
##Using gsub I am substituting each capital letter with itself and adding its count to count variable.
}
END{ ##Starting END block for this awk program. which will be executed once Input_file is done with reading.
print "Total number of capital letters in file are: " count ##Printing total number of capital letters which are there in count variable.
}
' Input_file ##mentioning Input_file name here.

change date format from DD/MM/YYYY to YYYY-MM-DD with sed

I'm doing:
sed -e 's|\([0-9][0-9]\)/\([0-2][0-9]\)/\([0-9][0-9][0-9][0-9]\)|\3-\2-\1|g'
but when I run the program I'm getting this error:
"sed: -e expression #4, char 62: invalid reference \3 on `s' command's RHS"
This error generally occurs when capture group have not been escaped properly.
That said, parens for capture groups are escaped in your command, so perhaps you tried to use it with a Extended Regular Expressions flag(sed -r or sed -E) in case you don't need to escape it.
Note that for readability, you can combine character ranges with numerical quantifier:
sed -E "s|([0-9]{2})/([0-9]{2})/([0-9]{4})|\3-\2-\1|" file
It's because of the parens, you need add \ to make it work as group catching:
echo 03/11/2018|sed -e 's|\([0-9][0-9]\)/\([0-2][0-9]\)/\([0-9][0-9][0-9][0-9]\)|\3-\2-\1|g'
2018-11-03
In some sed versions like GNU sed, you can add -E or -r switch, then the escaping will change to opposite way:
echo 03/11/2018|sed -E 's|([0-9][0-9])/([0-2][0-9])/([0-9][0-9][0-9][0-9])|\3-\2-\1|g'
2018-11-03
By deault, you need to use \(....\) to catch things into group, and () will match parens literally.
With -E or -r switch however, it's (...) to catch groups and \(\) to match parens literally.
Btw, {} are the same like ()'s situation.
Could you please try following.
echo "xxx 03/11/2018" |
awk '
match($0,/[0-9][0-9]\/[0-9][0-9]\/[0-9][0-9][0-9][0-9]/){
if(RSTART!=1){
val=substr($0,1,RSTART-1)
}
val=val substr($0,RSTART+6,4)"/"substr($0,RSTART+3,2)"/"substr($0,RSTART,2)
print val substr($0,RSTART+RLENGTH)
}'
Explanation: Adding explanation for above code now.
echo "xxx 03/11/2018" | ##Using echo command for printing string and passing its output into awk command.
awk ' ##Starting awk program here.
match($0,/[0-9][0-9]\/[0-9][0-9]\/[0-9][0-9][0-9][0-9]/){ ##Using match function for matching regex 2 digits / 2 digits / 4 digits.
if(RSTART!=1){ ##Checking if RSTART is NOT equal to 1 then do following.
val=substr($0,1,RSTART-1) ##Creating variable val whose value is substring from 1 to RSTART-1, RSTART and RLENGTH are variables which will be SET once a match of regex is found bymatch function of awk.
} ##Closing block of if condition here.
val=val substr($0,RSTART+6,4)"/"substr($0,RSTART+3,2)"/"substr($0,RSTART,2) ##Creating variable val here as per OPs need YYYY/MM/DD by using substring and its index values changing.
print val substr($0,RSTART+RLENGTH) ##Printing value of variable val and substring from value of RSTART+RLENGTH to till end of line.
}' ##Closing block of match now.
This will take care of before and after text of the matching date too. Like in above example I have added xxx and it will come in before date as follow.
xxx 2018/11/03
OR in case you want to print other lines along with YYYY/MM/DDlines then try following.
echo "xxx 03/11/2018" |
awk '
match($0,/[0-9][0-9]\/[0-9][0-9]\/[0-9][0-9][0-9][0-9]/){
if(RSTART!=1){
val=substr($0,1,RSTART-1)
}
val=val substr($0,RSTART+6,4)"/"substr($0,RSTART+3,2)"/"substr($0,RSTART,2)
print val substr($0,RSTART+RLENGTH)
next
}
1 '
NOTE: Since my awk version is old so I am using match($0,/[0-9][0-9]\/[0-9][0-9]\/[0-9][0-9][0-9][0-9]/ but I believe you could use match($0,/[0-9]{2}\/[0-9]{2}\/[0-9]{4}/) if your awk is new version.

print first 3 characters and / rest of the string with stars

I'have this input like this
John:boofoo
I want to print rest of the string with stars and keep only 3 characters of the string.
The output will be like this
John:boo***
this my command
awk -F ":" '{print $1,$2 ":***"}'
I want to use only print command if possible. Thanks
With GNU sed:
echo 'John:boofoo' | sed -E 's/(:...).*/\1***/'
Output:
John:boo***
With GNU awk for gensub():
$ awk 'BEGIN{FS=OFS=":"} {print $1, substr($2,1,3) gensub(/./,"*","g",substr($2,4))}' file
John:boo***
With any awk:
awk 'BEGIN{FS=OFS=":"} {tl=substr($2,4); gsub(/./,"*",tl); print $1, substr($2,1,3) tl}' file
John:boo***
Could you please try following. This will print stars(keeping only first 3 letters same as it is) how many characters are present in 2nd field after first 3 characters.
awk '
BEGIN{
FS=OFS=":"
}
{
stars=""
val=substr($2,1,3)
for(i=4;i<=length($2);i++){
stars=stars"*"
}
$2=val stars
}
1
' Input_file
Output will be as follows.
John:boo***
Explanation: Adding explanation for above code too here.
awk '
BEGIN{ ##Starting BEGIN section from here.
FS=OFS=":" ##Setting FS and OFS value as : here.
} ##Closing block of BEGIN section here.
{ ##Here starts main block of awk program.
stars="" ##Nullifying variable stars here.
val=substr($2,1,3) ##Creating variable val whose value is 1st 3 letters of 2nd field.
for(i=4;i<=length($2);i++){ ##Starting a for loop from 4(becasue we need to have from 4th character to till last in 2nd field) till length of 2nd field.
stars=stars"*" ##Keep concatenating stars variable to its own value with *.
}
$2=val stars ##Assigning value of variable val and stars to 2nd field here.
}
1 ##Mentioning 1 here to print edited/non-edited lines for Input_file here.
' Input_file ##Mentioning Input_file name here.
Or even with good old sed
$ echo "John:boofoo" | sed 's/...$/***/'
Output:
John:boo***
(note: this just replaces the last 3 characters of any string with "***", so if you need to key off the ':', see the GNU sed answer from Cyrus.)
Another awk variant:
awk -F ":" '{print $1 FS substr($2, 1, 3) "***"}' <<< 'John:boofoo'
John:boo***
Since we have the tags awk, bash and sed: for completeness sake here is a bash only solution:
INPUT="John:boofoo"
printf "%s:%s\n" ${INPUT%%:*} $(TMP1=${INPUT#*:};TMP2=${TMP1:3}; echo "${TMP1:0:3}${TMP2//?/*}")
It uses two arguments to printf after the format string. The first one is INPUT stripped of by everything uncluding and after the :. Lets break down the second argument $(TMP1=${INPUT#*:};TMP2=${TMP1:3}; echo "${TMP1:0:3}${TMP2//?/*}"):
$(...) the string is interpreted as a bash command its output is substituted as last argument to printf
TMP1=${INPUT#*:}; remove everything up to and including the :, store the string in TMP1.
TMP2=${TMP1:3}; geht all characters of TMP1 from offset 3 to the end and store them in TMP2.
echo "${TMP1:0:3}${TMP2//?/*}" output the temporary strings: the first three chars from TMP1 unmodified and all chars from TMP2 as *
the output of the last echo is the last argument to printf
Here is the bash -x output:
+ INPUT=John:boofoo
++ TMP1=boofoo
++ TMP2=foo
++ echo 'boo***'
+ printf '%s:%s\n' John 'boo***'
John:boo***
Another sed : replace all chars after the third by *
sed -E ':A;s/([^:]*:...)(.*)[^*]([*]*)/\1\2\3*/;tA'
Some more awk
awk 'BEGIN{FS=OFS=":"}{s=sprintf("%0*d",length(substr($2,4)),0); gsub(/0/,"*",s);print $1,substr($2,1,3) s}' infile
You can use the %* form of printf, which accepts a variable width. And, if you use '0' as your value to print, combined with the right-aligned text that's zero padded on the left..
Better Readable:
awk 'BEGIN{
FS=OFS=":"
}
{
s=sprintf("%0*d",length(substr($2,4)),0);
gsub(/0/,"*",s);
print $1,substr($2,1,3) s
}
' infile
Test Results:
$ awk --version
GNU Awk 3.1.7
Copyright (C) 1989, 1991-2009 Free Software Foundation.
$ cat f
John:boofoo
$ awk 'BEGIN{FS=OFS=":"}{s=sprintf("%0*d",length(substr($2,4)),0); gsub(/0/,"*",s);print $1,substr($2,1,3) s}' f
John:boo***
Another pure Bash, using the builtin regular expression predicate.
input="John:boofoo"
if [[ $input =~ ^([^:]*:...)(.*)$ ]]; then
printf '%s%s\n' "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]//?/*}"
else
echo >&2 "String doesn't match pattern"
fi
We split the string in two parts: the first part being everything up to (and including) the three chars found after the first colon (stored in ${BASH_REMATCH[1]}), the second part being the remaining part of string (stored in ${BASH_REMATCH[2]}). If the string doesn't match this pattern, we just insult the user.
We then print the first part unchanged, and the second part with every character replaced with *.

Resources