Feed literal string bash variable to awk and gsub - bash

I want to edit a column in a text file by feeding a bash variable containing a literal string to awk and gsub
I have tried various version of the command below. It works for a variable that does not contain any special characters but not for one that needs to be interpreted as a literal string.
#create intial file
echo -e "SOD1:c.112G>A(p.[G38R])"'\t'"SOD1:c.112G>A(p.[G38R]);NA" > testfile
#set variable
var="SOD1:c.112G>A(p.[G38R])"
#test awk
more testfile | awk -F '\t' -v OFS='\t' -v var="${var}" '{gsub(var,"",$2)}1'
I want to delete the variable only in the second column not in the first.
Thanks in advance for your help

You can just put your var definition and your awk command in one line like this:
var='SOD1:c.112G>A\(p.\[G38R\]\)'; awk -F '\t' -v OFS='\t' -v var="$var" '{gsub(var,"",$2)}1' testfile

Related

awk match by variable with dot in it

I have a script that will iterate over a file containing domains (google.com, youtube.com, etc). The purpose of the script is to check how many times each domain is included in the 12th column of a tab seperated value file.
while read domain; do
awk -F '\t' '$12 == '$domain'' data.txt | wc -l
done < domains.txt
However awk seems to be interpretating the dots in the domains as a special character. The following error message is shown:
awk: syntax error at source line 1
context is
$12 ~ >>> google. <<< com
awk: bailing out at source line 1
I am a beginner in bash so any help would be greatly appreciated!
When you write:
domain='google.com'
awk -F '\t' '$12 == '$domain'' data.txt
the $domain is outside of any quotes:
awk -F '\t' '$12 == '$domain' ' data.txt
< > < >
start end start end
and so exposed to the shell for interpretation first and THEN it becomes part of the body of the awk script before awk sees it. So what awk sees is:
awk -F '\t' '$12 == google.com' data.txt
and google.com is not a valid symbol (e.g. variable or function) name nor string nor number. What you MEANT to do was:
awk -F '\t' '$12 == "'"$domain"'"' data.txt
so the shell would see "$domain" instead of just $domain (see https://mywiki.wooledge.org/Quotes for why that's important) and awk would finally see:
awk -F '\t' '$12 == "google.com"' data.txt
which is fine as now "google.com" is a string, not a symbol BUT you should never allow shell variables to expand to become part of an awk script as there are other caveats so what you should really have done is:
awk -F '\t' -v dom="$domain" '$12 == dom' data.txt
See How do I use shell variables in an awk script? for more information.
By the way, even after fixing the above problem do not do this:
while read domain; do
awk -F '\t' -v dom="$domain" '$12 == dom' data.txt | wc -l
done < domains.txt
as it'll be immensely slow and contains insidious bugs (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice). Do something like this instead (untested):
awk -F'\t' '
NR==FNR {
cnt[$1] = 0
next
}
$12 in cnt {
cnt[$12]++
}
END {
for ( dom in cnt ) {
print dom, cnt[dom]
}
}
' domains.txt data.txt
That will be far more efficient, robust, and portable than calling awk inside a shell read loop.
See What are NR and FNR and what does "NR==FNR" imply? for how that awk script works. Get the book Effective AWK Programming, 5th Edition, by Arnold Robbins to learn awk.
awk -F '\t' '$12 == '$domain'' data.txt | wc -l
The single quotes are building an awk program. They are not something visible to awk. So awk sees this:
$12 == google.com
Since there aren't any quotes around google.com, that is a syntax error. You just need to add quotation marks.
awk -F '\t' '$12 == "'"$domain"'"' data.txt
The quotes jammed together like that are a little confusing, but it's just this:
'....' stuff to send to awk. Single quotes are for the shell.
'..."...' a double quote inside the awk program for awk to see
'...'"..." stuff in double quotes _outside_ the awk program for the shell
We can combine those like this:
'..."'"$var"'"...'
That's a bunch of literal awk code ending in a double-quote, followed by the expansion of the shell parameter var, which is double-quoted as usual in the shell for safety, followed by more literal awk code starting with a double quotes. So the end result is a string passed to awk that includes the value of the var inside double quotes.
But you don't have to be so fancy or confusing since awk provides the -v option to set variables from the shell:
awk -v domain="$domain" '$12 == domain' data.txt
Since the domain is not quoted inside the awk code, it is interpreted as the name of a variable. (Periods are not legal in variable names, which is why you got a syntax error with your domains; if you hadn't, though, awk would have treated them as empty and been looking for lines whose twelfth field was likewise blank.)
Use a combination of cut to print the 12th column of the TAB-delimited file, sort and uniq to count the items:
cut -f12 data.txt | sort | uniq -c
This should give the count of how many lines of the input has "google.com" in $12
{m,g}awk -v __="${domain}" '
BEGIN { _*=\
( _ ="\t[^\t]*")*gsub(".",(_)_,_)*sub(".","",_)*\
gsub("[.:&=/-]","[&]",__)*sub("[[][^[]+$",__"\t?",_)*(\
FS=_ } { _+=NF } END { print _-NR }'

grep text after keyword with unknown spaces and remove comments

I am having trouble saving variables from file using grep/sed/awk.
The text in file.txt is on the form:
NUM_ITER = 1000 # Number of iterations
NUM_STEP = 1000
And I would like to save these to bash variables without the comments.
So far, I have attempted this:
grep -oP "^NUM_ITER[ ]*=\K.*#" file.txt
which yields
1000 #
Any suggestions?
I would use awk, like this:
awk -F'[=[:blank:]#]+' '$1 == "NUM_ITER" {print $2}' file
To store it in a variable:
NUM_ITER=$(awk -F'[=[:blank:]#]+' '$1 == "NUM_ITER" {print $2}' file)
As long as a line can only contain a single match, this is easy with sed.
sed -n '# Remove comments
s/[ ]*#.*//
# If keyword found, remove keyword and print value
s/^NUM_ITER[ ]*=[ ]*//p' file.txt
This can be trimmed down to a one-liner if you remove the comments.
sed -n 's/[ ]*#.*//;s/^NUM_ITER[ ]*=[ ]*//p' file.txt
The -n option turns off printing, and the /p flag after the final substitution says to print that line after all only if the substitution was successful.

awk: Preserve multiple field separators

I'm using awk to swap fields in a filename using two different field separators.
I want to know if it's possible to preserve both separators, '/' and '_', in the correct positions in the output.
Example:
I want to change this:
/path/to/example_file_123.txt
into this:
/path/to/file_example_123.txt
I've tried:
awk -F "[/_]" '{ t=$3; $3=$4; $4=t;print}' file.txt
but the field separators are missing from the output:
path to file example 123.txt
I've tried preserving the field separators:
awk -F "[/_]" '{t=$3; $3=$4; $4=t; OFS=FS; print}' file.txt
but I get this:
[/_]path[/_]to[/_]file[/_]example[/_]123.txt
Is there a way of preserving the correct original field separator in awk when you're dealing multiple separators?
Here is one solution:
awk -F/ '{n=split($NF,a,"_");b=a[1];a[1]=a[2];a[2]=b;$NF=a[1];for (i=2;i<=n;i++) $NF=$NF"_"a[i]}1' OFS=/ file
/path/to/file_example_123.txt
You can always use Perl.
Given:
$ echo $e
/path/to/example_file_123.txt
Then:
$ echo $e | perl -ple 's/([^_\/]+)_([^_\/]+)/\2_\1/'
/path/to/file_example_123.txt
$ cat /tmp/1
/path/to/example_file_123.txt
/path/to/example_file_345.txt
$ awk -F'_' '{split($1,a,".*/"); gsub(a[2],"",$1);print $1$2"_"a[2]"_"$3}' /tmp/1
/path/to/file_example_123.txt
/path/to/file_example_345.txt

How to retrieve digits including the separator "."

I am using grep to get a string like this: ANS_LENGTH=266.50 then I use sed to only get the digits: 266.50
This is my full command: grep --text 'ANS_LENGTH=' log.txt | sed -e 's/[^[[:digit:]]]*//g'
The result is : 26650
How can this line be changed so the result still shows the separator: 266.50
You don't need grep if you are going to use sed. Just use sed' // to match the lines you need to print.
sed -n '/ANS_LENGTH/s/[^=]*=\(.*\)/\1/p' log.txt
-n will suppress printing of lines that do not match /ANS_LENGTH/
Using captured group we print the value next to = sign.
p flag at the end allows to print the lines that matches our //.
If your grep happens to support -P option then you can do:
grep -oP '(?<=ANS_LENGTH=).*' log.txt
(?<=...) is a look-behind construct that allows us to match the lines you need. This requires the -P option
-o allows us to print only the value part.
You need to match a literal dot as well as the digits.
Try sed -e 's/[^[[:digit:]\.]]*//g'
The dot will match any single character. Escaping it with the backslash will match only a literal dot.
Here is some awk example:
cat file:
some data ANS_LENGTH=266.50 other=22
not mye data=43
gnu awk (due to RS)
awk '/ANS_LENGTH/ {f=NR} f&&NR-1==f' RS="[ =]" file
266.50
awk '/ANS_LENGTH/ {getline;print}' RS="[ =]" file
266.50
Plain awk
awk -F"[ =]" '{for(i=1;i<=NF;i++) if ($i=="ANS_LENGTH") print $(i+1)}' file
266.50
awk '{for(i=1;i<=NF;i++) if ($i~"ANS_LENGTH") {split($i,a,"=");print a[2]}}' file
266.50

How to execute awk command in shell script

I have an awk command that extracts the 16th column from 3rd line in a csv file and prints the first 4 characters.
awk -F"," 'NR==3{print $16}' sample.csv|sed -e 's/^[ \t]*//'|awk '{print substr($0,0,4)}'
This works fine.
But when I execute it from a shell script, I get and error
#!/bin/ksh
YEAR=awk -F"," 'NR==3{print $16}' sample.csv|sed -e 's/^[ \t]*//'|awk '{print substr($0,0,4)}'
Error message:
-F,: not found
Use command substitution to assign the output of a command to a variable, as shown below:
YEAR=$(awk -F"," 'NR==3{print $16}' sample.csv|sed -e 's/^[ \t]*//'|awk '{print substr($0,0,4)}')
you are asking the shell to do :
VAR=value command [arguments...]
which means: launch command but pass it the VAR=value environment first
(ex: LC_ALL=C grep '[0-9]*' /some/file.txt : will grep a number in file.txt (and this with the LC_ALL variable set to C just for the duration of the call of grep)
So here : you ask the shell to launch the -F"," command (ie, -F, once the shell interpret the "," into , with arguments 'NR==3.......... and with the variable YEAR set to the value awk for the duration of the command invocation.
Just replace it with :
#!/bin/ksh
YEAR="$(awk -F',' 'NR==3{print $16}' sample.csv|sed -e 's/^[ \t]*//'|awk '{print substr($0,1,4)}')"
(I didn't try it, but I hope they work for you and your sample.csv file)
(Note that you use "0" to match character position 1, which works in many awk implementation but not all (ie most (but not all) assume 1 when you write 0))
From your description, it looks like you want to extract the year from the 16th field, which might contain leading spaces. You can accomplish it by calling AWK once:
YEAR=$(awk -F, 'NR==3{sub(/^[ \t]*/, "", $16); print ">" substr($16,1,4) "<" }')
Better yet, you don't even have to use awk. Since you are already writing shell script, let's do it all in shell script:
{ read line; read line; read line; } < sample.csv # Get the third line
IFS=, set $line # Breaks line into comma-separated fields
IFS=" " set ${16} # Trick to remove leading spaces, field 16 becomes field 1
YEAR=${1:0:4} # Extract the first 4 char from field 1
Do this:
year=$(awk -F, 'NR==3{sub(/^[ \t]+/,"",$16); print substr($16,1,4); exit }' sample.csv)

Resources