Parsing iw wlan0 scan output - bash

I wrote wlan manager script to handle open/ad-hoc/wep/wpa2 networks. Now im trying to parse iw wlan0 scan output to get nice scan feature to my script. My goal is to get output like this :
SSID channel signal encryption
wlan-ap 6 70% wpa2-psk
test 1 55% wep
What i have achived already is output like this :
$ iw wlan0 scan | grep 'SSID\|freq\|signal\|capability' | tac
SSID: Koti783
signal: -82.00 dBm
capability: ESS Privacy ShortPreamble SpectrumMgmt ShortSlotTime (0x0531)
freq: 2437
I have been trying to study bash/sed/awk but havent found yet a way to achieve what im trying. So what is good way to achieve that?

Here is my final solution based of Sudo_O answer:
$1 == "BSS" {
MAC = $2
wifi[MAC]["enc"] = "Open"
}
$1 == "SSID:" {
wifi[MAC]["SSID"] = $2
}
$1 == "freq:" {
wifi[MAC]["freq"] = $NF
}
$1 == "signal:" {
wifi[MAC]["sig"] = $2 " " $3
}
$1 == "WPA:" {
wifi[MAC]["enc"] = "WPA"
}
$1 == "WEP:" {
wifi[MAC]["enc"] = "WEP"
}
END {
printf "%s\t\t%s\t%s\t\t%s\n","SSID","Frequency","Signal","Encryption"
for (w in wifi) {
printf "%s\t\t%s\t\t%s\t%s\n",wifi[w]["SSID"],wifi[w]["freq"],wifi[w]["sig"],wifi[w]["enc"]
}
}'
Output:
$ sudo iw wlan0 scan | awk -f scan.awk
SSID Frequency Signal Encryption
netti 2437 -31.00 dBm Open
Koti783 2437 -84.00 dBm WPA
WLAN-AP 2462 -85.00 dBm WPA

it's generally bad practice to try parsing complex output of programs intended for humans to read (rather than machines to parse).
e.g. the output of iw might change depending on the language settings of the system and/or the version of iw, leaving you with a "manager" that only works on your development machine.
instead you might use the same interface that iw uses to get it's information: the library backend libnl
you might also want to have a look at the wireless-tools (iwconfig, iwlist,...) that use the libiw library.

Here is an GNU awk script to get you going that grabs the SSIDs and the channel for each unique BSS:
/^BSS / {
MAC = $2
}
/SSID/ {
wifi[MAC]["SSID"] = $2
}
/primary channel/ {
wifi[MAC]["channel"] = $NF
}
# Insert new block here
END {
printf "%s\t\t%s\n","SSID","channel"
for (w in wifi) {
printf "%s\t\t%s\n",wifi[w]["SSID"],wifi[w]["channel"]
}
}
It should be easy for you to add the new blocks for signal and encryption considering all the studying you have been doing.
Save the script to file such as wifi.awk and run like:
$ sudo iw wlan0 scan | awk -f wifi.awk
The output will be in the formatted requested:
SSID channel
wlan-ap 6
test 1

Here is a simple Bash function which uses exclusively Bash internals and spawns only one sub-shell:
#!/bin/bash
function iwScan() {
# disable globbing to avoid surprises
set -o noglob
# make temporary variables local to our function
local AP S
# read stdin of the function into AP variable
while read -r AP; do
## print lines only containing needed fields
[[ "${AP//'SSID: '*}" == '' ]] && printf '%b' "${AP/'SSID: '}\n"
[[ "${AP//'signal: '*}" == '' ]] && ( S=( ${AP/'signal: '} ); printf '%b' "${S[0]},";)
done
set +o noglob
}
iwScan <<< "$(iw wlan0 scan)"
Output:
-66.00,FRITZ!Box 7312
-56.00,ALICE-WLAN01
-78.00,o2-WLAN93
-78.00,EasyBox-7A2302
-62.00,dlink
-74.00,EasyBox-59DF56
-76.00,BELAYS_Network
-82.00,o2-WLAN20
-82.00,BPPvM
The function can be easily modified to provide additional fields by adding a necessary filter into the while read -r AP while-loop, eg:
[[ "${AP//'last seen: '*}" == '' ]] && ( S=( ${AP/'last seen: '} ); printf '%b' "${S[0]},";)
Output:
-64.00,1000,FRITZ!Box 7312
-54.00,492,ALICE-WLAN01
-76.00,2588,o2-WLAN93
-78.00,652,LN8-Gast
-72.00,2916,WHITE-BOX
-66.00,288,ALICE-WLAN
-78.00,800,EasyBox-59DF56
-80.00,720,EasyBox-7A2302
-84.00,596,ALICE-WLAN08

I am using such solution for openwrt:
wlan_scan.sh
#!/bin/sh
sudo iw dev wlan0 scan | awk -f wlan_scan.awk | sort
wlan_scan.awk
/^BSS/ {
mac = gensub ( /^BSS[[:space:]]*([0-9a-fA-F:]+).*?$/, "\\1", "g", $0 );
}
/^[[:space:]]*signal:/ {
signal = gensub ( /^[[:space:]]*signal:[[:space:]]*(\-?[0-9.]+).*?$/, "\\1", "g", $0 );
}
/^[[:space:]]*SSID:/ {
ssid = gensub ( /^[[:space:]]*SSID:[[:space:]]*([^\n]*).*?$/, "\\1", "g", $0 );
printf ( "%s %s %s\n", signal, mac, ssid );
}
result
-62.00 c8:64:c7:54:d9:05 a
-72.00 70:72:3c:1c:af:17 b
-81.00 78:f5:fd:be:33:cb c

There is a bug in the awk script above.
The following code will not work if the SSID has spaces in the name. The received result will be the first token of the SSID name only.
$1 == "SSID:" {
wifi[MAC]["SSID"] = $2
}
When printing $0, $1, $2:
$0: SSID: DIRECT-82-HP OfficeJet 8700
$1: SSID:
$2: DIRECT-82-HP
One possibly solution is to take a substr of $0 which contains leading spaces, the token "SSID: " and the provided multi-token network name.
Any other suggestions?

I've taken awk code from Ari Malinen and reworked it a bit, because iw output is not stable and changes, also there are other issues like spaces in SSID. I put it on github in case if I'll change it in the future.
#!/usr/bin/env awk -f
$1 ~ /^BSS/ {
if($2 !~ /Load:/) { #< Escape "BBS Load:" line
gsub("(\\(.*|:)", "", $2)
MAC = toupper($2)
wifi[MAC]["enc"] = "OPEN"
wifi[MAC]["WPS"] = "no"
wifi[MAC]["wpa1"] = ""
wifi[MAC]["wpa2"] = ""
wifi[MAC]["wep"] = ""
}
}
$1 == "SSID:" {
# Workaround spaces in SSID
FS=":" #< Changing field separator on ":", it should be
# forbidded sign for SSID name
$0=$0
sub(" ", "", $2) #< remove first whitespace
wifi[MAC]["SSID"] = $2
FS=" "
$0=$0
}
$1 == "capability:" {
for(i=2; i<=NF; i++) {
if($i ~ /0x[0-9]{4}/) {
gsub("(\\(|\\))", "", $i)
if (and(strtonum($i), 0x10))
wifi[MAC]["wep"] = "WEP"
}
}
}
$1 == "WPA:" {
wifi[MAC]["wpa1"] = "WPA1"
}
$1 == "RSN:" {
wifi[MAC]["wpa2"] = "WPA2"
}
$1 == "WPS:" {
wifi[MAC]["WPS"] = "yes"
}
$1 == "DS" {
wifi[MAC]["Ch"] = $5
}
$1 == "signal:" {
match($2, /-([0-9]{2})\.00/, m)
wifi[MAC]["Sig"] = m[1]
}
$1 == "TSF:" {
gsub("(\\(|d|,)", "", $4)
match($5, /([0-9]{2}):([0-9]{2}):/, m)
day = $4
hour = m[1]
min = m[2]
wifi[MAC]["TSF"] = day"d"hour"h"min"m"
}
END {
for (w in wifi) {
if (wifi[w]["wep"]) {
if (wifi[w]["wpa1"] || wifi[w]["wpa2"])
wifi[w]["enc"] = wifi[w]["wpa1"]wifi[w]["wpa2"]
else
wifi[w]["enc"] = "WEP"
}
printf "%s:%s:%s:%s:%s:%s:%s\n", w, wifi[w]["SSID"], wifi[w]["enc"], \
wifi[w]["WPS"], wifi[w]["Ch"], wifi[w]["Sig"], wifi[w]["TSF"]
}
}
Output:
A5FEF2C499BB:test-ssid2:OPEN:no:9:43:0d00h00m
039EFACA9A8B:test-ssid2:WPA1:no:9:33:0d00h00m
038BF3C1988B:test-ssid2:WPA2:no:9:35:0d00h00m
028EF3C2997B:test-ssid2:WPA1:no:9:35:0d00h03m
if you wonder what if($2 !~ /Load:/) does, well on some routers there might be "BSS Load:" string.

Related

Generic "append to file if not exists" function in Bash

I am trying to write a util function in a bash script that can take a multi-line string and append it to the supplied file if it does not already exist.
This works fine using grep if the pattern does not contain \n.
if grep -qF "$1" $2
then
return 1
else
echo "$1" >> $2
fi
Example usage
append 'sometext\nthat spans\n\tmutliple lines' ~/textfile.txt
I am on MacOS btw which has presented some problems with some of the solutions I've seen posted elsewhere being very linux specific. I'd also like to avoid installing any other tools to achieve this if possible.
Many thanks
If the files are small enough to slurp into a Bash variable (you should be OK up to a megabyte or so on a modern system), and don't contain NUL (ASCII 0) characters, then this should work:
IFS= read -r -d '' contents <"$2"
if [[ "$contents" == *"$1"* ]]; then
return 1
else
printf '%s\n' "$1" >>"$2"
fi
In practice, the speed of Bash's built-in pattern matching might be more of a limitation than ability to slurp the file contents.
See the accepted, and excellent, answer to Why is printf better than echo? for an explanation of why I replaced echo with printf.
Using awk:
awk '
BEGIN {
n = 0 # length of pattern in lines
m = 0 # number of matching lines
}
NR == FNR {
pat[n++] = $0
next
}
{
if ($0 == pat[m])
m++
else if (m > 0 && $0 == pat[0])
m = 1
else
m = 0
}
m == n {
exit
}
END {
if (m < n) {
for (i = 0; i < n; i++)
print pat[i] >>FILENAME
}
}
' - "$2" <<EOF
$1
EOF
if necessary, one would need to properly escape any metacharacters inside FS | OFS :
jot 7 9 |
{m,g,n}awk 'BEGIN { FS = OFS = "11\n12\n13\n"
_^= RS = (ORS = "") "^$" } _<NF || ++NF'
9
10
11
12
13
14
15
jot 7 -2 | (... awk stuff ...)
-2
-1
0
1
2
3
4
11
12
13

Processing a delimited line in bash

Given a single line of input with 'n' arguments which are space delimited. The input arguments themselves are variable. The input is given through an external file.
I want to move specific elements to variables depending on regular expressions. As such, I was thinking of declaring a pointer variable first to keep track of where on the line I am. In addition, the assignment to variable is independent of numerical order, and depending on input some variables may be skipped entirely.
My current method is to use
awk '{print $1}' file.txt
However, not all elements are fixed and I need to account for elements that may be absent, or may have multiple entries.
UPDATE: I found another method.
file=$(cat /file.txt)
for i in ${file[#]}; do
echo $i >> split.txt;
done
With this way, instead of a single line with multiple arguments, we get multiple lines with a single argument. as such, we can now use var#=(grep --regexp="[pattern]" split.txt. Now I just need to figure out how best to use regular expressions to filter this mess.
Let me take an example.
My input strings are:
RON KKND 1534Z AUTO 253985G 034SRT 134OVC 04/32
RON KKND 5256Z 143623G72K 034OVC 074OVC 134SRT 145PRT 13/00
RON KKND 2234Z CON 342523G CLS 01/M12 RMK
So the variable assignment for each of the above would be:
var1=RON var2=KKND var3=1534Z var4=TRUE var5=FALSE var6=253985G varC=2 varC1=034SRT varC2=134OVC var7=04/32
var1=RON var2=KKND var3=5256Z var4=FALSE var5=FALSE var6=143623G72K varC=4 varC1=034OVC varC2=074OVC varC3=134SRT varC4=145PRT var7=13/00
var1=RON var2=KKND var3=2234Z var4=FALSE var5=TRUE var6=342523G varC=0 var7=01/M12
So, the fourth argument might be var4, var5, or var6.
The fifth argument might be var5, var6, or match another criteria.
The sixth argument may or may not be var6. Between var6 and var7 can be determined by matching each argument with */*
Boiling this down even more, The positions on the input of var1, var2 and var3 are fixed but after that I need to compare, order, and assign. In addition, the arguments themselves can vary in character length. The relative position of each section to be divided is fixed in relation to its neighbors. var7 will never be before var6 in the input for example, and if var4 and var5 are true, then the 4th and 5th argument would always be 'AUTO CON' Some segments will always be one argument, and others more than one. The relative position of each is known. As for each pattern, some have a specific character in a specific location, and others may not have any flag on what it is aside from its position in the sequence.
So I need awk to recognize a pointer variable as every argument needs to be checked until a specific match is found
#Check to see if var4 or var5 exists. if so, flag and increment pointer
pointer=4
if (awk '{print $$pointer}' file.txt) == "AUTO" ; then
var4="TRUE"
pointer=$pointer+1
else
var4="FALSE"
fi
if (awk '{print $$pointer}' file.txt) == "CON" ; then
var5="TRUE"
pointer=$pointer+1
else
var5="FALSE"
fi
#position of var6 is fixed once var4 and var5 are determined
var6=$(awk '{print $$pointer}' file.txt)
pointer=$pointer+1
#Count the arguments between var6 and var7 (there may be up to ten)
#and separate each to decode later. varC[0-9] is always three upcase
# letters followed by three numbers. Use this counter later when decoding.
varC=0
until (awk '{print $$pointer}' file.txt) == "*/*" ; do
varC($varC+1)=(awk '{print $$pointer}' file.txt)
varC=$varC+1
pointer=$pointer+1
done
#position of var7 is fixed after all arguments of varC are handled
var7=$(awk '{print $$pointer}' file.txt)
pointer=$pointer+1
I know the above syntax is incorrect. The question is how do I fix it.
var7 is not always at the end of the input line. Arguments after var7 however do not need to be processed.
Actually interpreting the patterns I haven't gotten to yet. I intend to handle that using case statements comparing the variables with regular expressions to compare against. I don't want to use awk to interpret the patterns directly as that would get very messy. I have contemplated using for n in $string, but to do that would mean comparing every argument to every possible combination directly (And there are multiple segments each with multiple patterns) and is such impractical. I'm trying to make this a two step process.
Please try the following:
#!/bin/bash
# template for variable names
declare -a namelist1=( "var1" "var2" "var3" "var4" "var5" "var6" "varC" )
declare -a ary
# read each line and assign ary to the elements
while read -r -a ary; do
if [[ ${ary[3]} = AUTO ]]; then
ary=( "${ary[#]:0:3}" "TRUE" "FALSE" "${ary[4]}" "" "${ary[#]:5:3}" )
elif [[ ${ary[3]} = CON ]]; then
ary=( "${ary[#]:0:3}" "FALSE" "TRUE" "${ary[4]}" "" "${ary[#]:5:3}" )
else
ary=( "${ary[#]:0:3}" "FALSE" "FALSE" "${ary[3]}" "" "${ary[#]:4:5}" )
fi
# initial character of the 7th element
ary[6]=${ary[7]:0:1}
# locate the index of */* entry in the ary and adjust the variable names
for (( i=0; i<${#ary[#]}; i++ )); do
if [[ ${ary[$i]} == */* ]]; then
declare -a namelist=( "${namelist1[#]}" )
for (( j=1; j<=i-7; j++ )); do
namelist+=( "$(printf "varC%d" "$j")" )
done
namelist+=( "var7" )
fi
done
# assign variables to array elements
for (( i=0; i<${#ary[#]}; i++ )); do
# echo -n "${namelist[$i]}=${ary[$i]} " # for debugging
declare -n p="${namelist[$i]}"
p="${ary[$i]}"
done
# echo "var1=$var1 var2=$var2 var3=$var3 ..." # for debugging
done < file.txt
Note that the script above just assigns bash variables and does not print anything
unless you explicitly echo or printf the variables.
Updated: This code shows how to decide variable value based on pattern match , multiple times.
one code block in pure bash and the other in gawk manner
bash code block requires associative Array support, which is not available in very early versions
grep is also required to do pattern matching
tested with GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu) and grep (GNU grep) 2.20
and stick to printf other than echo after I learn why-is-printf-better-than-echo
when using bash I consider it good practice to be more defensive
#!/bin/bash
declare -ga outVars
declare -ga lineBuf
declare -g NF
#force valid index starts from 1
#consistent with var* name pattern
outVars=(unused var1 var2 var3 var4 var5 var6 varC var7)
((numVars=${#outVars[#]} - 1))
declare -gr numVars
declare -r outVars
function e_unused {
return
}
function e_var1 {
printf "%s" "${lineBuf[1]}"
}
function e_var2 {
printf "%s" "${lineBuf[2]}"
}
function e_var3 {
printf "%s" "${lineBuf[3]}"
}
function e_var4 {
if [ "${lineBuf[4]}" == "AUTO" ] ;
then
printf "TRUE"
else
printf "FALSE"
fi
}
function e_var5 {
if [ "${lineBuf[4]}" == "CON" ] ;
then
printf "TRUE"
else
printf "FALSE"
fi
}
function e_varC {
local var6_idx=4
if [ "${lineBuf[4]}" == "AUTO" -o "${lineBuf[4]}" == "CON" ] ;
then
var6_idx=5
fi
local var7_idx=$NF
local i
local count=0
for ((i=NF;i>=1;i--));
do
if [ $(grep -cE '^.*/.*$' <<<${lineBuf[$i]}) -eq 1 ];
then
var7_idx=$i
break
fi
done
((varC = var7_idx - var6_idx - 1))
if [ $varC -eq 0 ];
then
printf 0
return;
fi
local cFamily=""
local append
for ((i=var6_idx;i<=var7_idx;i++));
do
if [ $(grep -cE '^[0-9]{3}[A-Z]{3}$' <<<${lineBuf[$i]}) -eq 1 ];
then
((count++))
cFamily="$cFamily varC$count=${lineBuf[$i]}"
fi
done
printf "%s %s" $count "$cFamily"
}
function e_var6 {
if [ "${lineBuf[4]}" == "AUTO" -o "${lineBuf[4]}" == "CON" ] ;
then
printf "%s" "${lineBuf[5]}"
else
printf "%s" "${lineBuf[4]}"
fi
}
function e_var7 {
local i
for ((i=NF;i>=1;i--));
do
if [ $(grep -cE '^.*/.*$' <<<${lineBuf[$i]}) -eq 1 ];
then
printf "%s" "${lineBuf[$i]}"
return
fi
done
}
while read -a lineBuf ;
do
NF=${#lineBuf[#]}
lineBuf=(unused ${lineBuf[#]})
for ((i=1; i<=numVars; i++));
do
printf "%s=" "${outVars[$i]}"
(e_${outVars[$i]})
printf " "
done
printf "\n"
done <file.txt
The gawk specific extension Indirect Function Call is used in the awk code below
the code assigns a function name for every desired output variable.
different pattern or other transformation can be applied in its specific function
doing so to avoid tons of if-else-if-else
and is also easier to read and extend.
for the special varC family, the function pick_varC played a trick
after varC is determined ,its value consists of multiple output fields.
if varC=2, the value of varC is returned as 2 varC1=034SRT varC2=134OVC
that is actual value of varC appending all follow members.
gawk '
BEGIN {
keys["var1"] = "pick_var1";
keys["var2"] = "pick_var2";
keys["var3"] = "pick_var3";
keys["var4"] = "pick_var4";
keys["var5"] = "pick_var5";
keys["var6"] = "pick_var6";
keys["varC"] = "pick_varC";
keys["var7"] = "pick_var7";
}
function pick_var1 () {
return $1;
}
function pick_var2 () {
return $2;
}
function pick_var3 () {
return $3;
}
function pick_var4 () {
for (i=1;i<=NF;i++) {
if ($i == "AUTO") {
return "TRUE";
}
}
return "FALSE";
}
function pick_var5 () {
for (i=1;i<=NF;i++) {
if ($i == "CON") {
return "TRUE";
}
}
return "FALSE";
}
function pick_varC () {
for (i=1;i<=NF;i++) {
if (($i=="AUTO" || $i=="CON")) {
break;
}
}
var6_idx = 5;
if ( i!=4 ) {
var6_idx = 4;
}
var7_idx = NF;
for (i=1;i<=NF;i++) {
if ($i~/.*\/.*/) {
var7_idx = i;
}
}
varC = var7_idx - var6_idx - 1;
if ( varC == 0) {
return varC;
}
count = 0;
cFamily = "";
for (i = 1; i<=varC;i++) {
if ($(var6_idx+i)~/[0-9]{3}[A-Z]{3}/) {
cFamily = sprintf("%s varC%d=%s",cFamily,i,$(var6_idx+i));
count++;
}
}
varC = sprintf("%d %s",count,cFamily);
return varC;
}
function pick_var6 () {
for (i=1;i<=NF;i++) {
if (($i=="AUTO" || $i=="CON")) {
break;
}
}
if ( i!=4 ) {
return $4;
} else {
return $5
}
}
function pick_var7 () {
for (i=1;i<=NF;i++) {
if ($i~/.*\/.*/) {
return $i;
}
}
}
{
for (k in keys) {
pickFunc = keys[k];
printf("%s=%s ",k,#pickFunc());
}
printf("\n");
}
' file.txt
test input
RON KKND 1534Z AUTO 253985G 034SRT 134OVC 04/32
RON KKND 5256Z 143623G72K 034OVC 074OVC 134SRT 145PRT 13/00
RON KKND 2234Z CON 342523G CLS 01/M12 RMK
script output
var1=RON var2=KKND var3=1534Z var4=TRUE var5=FALSE varC=2 varC1=034SRT varC2=134OVC var6=253985G var7=04/32
var1=RON var2=KKND var3=5256Z var4=FALSE var5=FALSE varC=4 varC1=034OVC varC2=074OVC varC3=134SRT varC4=145PRT var6=143623G72K var7=13/00
var1=RON var2=KKND var3=2234Z var4=FALSE var5=TRUE varC=0 var6=342523G var7=01/M12

awk: More elegant way to filter a file with another one

I've recently approached the incredibly fast awk since I needed to parse very big files.
I had to parse this kind of input...
ID 001R_FRG3G Reviewed; 256 AA.
AC Q6GZX4;
[...]
SQ SEQUENCE 256 AA; 29735 MW; B4840739BF7D4121 CRC64;
MAFSAEDVLK EYDRRRRMEA LLLSLYYPND RKLLDYKEWS PPRVQVECPK APVEWNNPPS
EKGLIVGHFS GIKYKGEKAQ ASEVDVNKMC CWVSKFKDAM RRYQGIQTCK IPGKVLSDLD
AKIKAYNLTV EGVEGFVRYS RVTKQHVAAF LKELRHSKQY ENVNLIHYIL TDKRVDIQHL
EKDLVKDFKA LVESAHRMRQ GHMINVKYIL YQLLKKHGHG PDGPDILTVK TGSKGVLYDD
SFRKIYTDLG WKFTPL
//
ID 002L_FRG3G Reviewed; 320 AA.
AC Q6GZX3;
[...]
SQ SEQUENCE 320 AA; 34642 MW; 9E110808B6E328E0 CRC64;
MSIIGATRLQ NDKSDTYSAG PCYAGGCSAF TPRGTCGKDW DLGEQTCASG FCTSQPLCAR
IKKTQVCGLR YSSKGKDPLV SAEWDSRGAP YVRCTYDADL IDTQAQVDQF VSMFGESPSL
AERYCMRGVK NTAGELVSRV SSDADPAGGW CRKWYSAHRG PDQDAALGSF CIKNPGAADC
KCINRASDPV YQKVKTLHAY PDQCWYVPCA ADVGELKMGT QRDTPTNCPT QVCQIVFNML
DDGSVTMDDV KNTINCDFSK YVPPPPPPKP TPPTPPTPPT PPTPPTPPTP PTPRPVHNRK
VMFFVAGAVL VAILISTVRW
//
ID 004R_FRG3G Reviewed; 60 AA.
AC Q6GZX1; dog;
[...]
SQ SEQUENCE 60 AA; 6514 MW; 12F072778EE6DFE4 CRC64;
MNAKYDTDQG VGRMLFLGTI GLAVVVGGLM AYGYYYDGKT PSSGTSFHTA SPSFSSRYRY
...filter it with a file like this...
Q6GZX4
dog
...to get an output like this:
Q6GZX4 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL 256
dog MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY 60
To do this, I came up with this code:
BEGIN{
while(getline<"filterFile.txt">0)B[$1];
}
{
if ($1=="ID")
len=$4;
else{
if ($1=="AC"){
acc=0;
line = substr($0,6,length($0)-6);
split(line,A,"; ");
for (i in A){
if (A[i] in B){
acc=A[i];
}
}
if (acc){
printf acc"\t";
}
}
if (acc){
if(substr($0, 1, 5) == " "){
printf $1$2$3$4$5$6;
}
if ($1 == "//"){
print "\t"len
}
}
}
}
However, since I've seen many examples of similar tasks done with awk, I think there probably is a much more elegant and efficient way to do it. But I can't really grasp the super-compact examples usually found around the internet.
Since this is my input, my output and my code I think this is a good occasion to understand more of awk optimization in terms of performance and coding-style, if some awk-guru has some time and patience to spend in this task.
Perl to the rescue:
#!/usr/bin/perl
use warnings;
use strict;
open my $FILTER, '<', 'filterFile.txt' or die $!;
my %wanted; # Hash of the wanted ids.
chomp, $wanted{$_} = 1 for <$FILTER>;
$/ = "//\n"; # Record separator.
while (<>) {
my ($id_string) = /^ AC \s+ (.*) /mx;
my #ids = split /\s*;\s*/, $id_string;
if (my ($id) = grep $wanted{$_}, #ids) {
print "$id\t";
my ($seq) = /^ SQ \s+ .* $ ((?s:.*)) /mx;
$seq =~ s/\s+//g; # Remove whitespace.
$seq =~ s=//$==; # Remove the final //.
print "$seq\t", length $seq, "\n";
}
}
An awk solution with a different field separator (in this way, you avoid to use substr and split):
BEGIN {
while (getline<"filterFile.txt">0) filter[$1] = 1;
FS = "[ \t;]+"; OFS = ""; ORS = "";
}
{
if (flag) {
if (len)
if ($1 == "//") {
print "\t" len "\n";
flag = 0; len = 0;
} else {
$1 = $1;
print;
}
else if ($1 == "SQ") len = $3;
} else if ($1 == "AC") {
for (i = 1; ++i < NF;)
if (filter[$i]) {
flag = 1;
print $i "\t";
break;
}
}
}
END { if (flag) print "\t" len }
Note: this code is not designed to be short but to be fast. That's why I didn't try to remove nested if/else conditions, but I tried to reduce as possible the global number of tests for a whole file.
However, after several changes since my first version and after several benchmarks, I must admit that choroba perl version is a little faster.
For that kind of task, an idea is to pipe your second file through awk or sed in order to create on the fly a new awk script parsing the big file. As an example:
Control file (f1):
test
dog
Data (f2):
tree 5
test 2
nothing
dog 1
An idea to start with:
sed 's/^\(.*\)$/\/\1\/ {print $2}/' f1 | awk -f - f2
(where -f - means: read the awk script from the standard input rather than from a named file).
may not be much shorter than the original but multiple awk scripts will make the code simpler. First awk generates the records of interest, second extracts the information, third formats
$ awk 'NR==FNR{keys[$0];next}
{RS="//";
for(k in keys)
if($0~k)
{print "key",k; print $0}}' keys file
| awk '/key/{key=$2;f=0;;next}
/SQ/{f=1;print "\n\n"key,$3;next}
f{gsub(" ","");printf $0}
END{print}'
| awk -vRS= -vOFS="\t" '{print $1,$3,$2}'
will print
Q6GZX4 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL 256
dog MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY 60
Your code looks almost OK as-is. Keep it simple, single-pass like that.
Only a couple suggestions:
1) The business around the split is too messy/brittle. Maybe try it this way:
acc="";
n=split($0,A,"[; ]+");
for (i=2;i<=n;++i){
if (A[i] in B){
acc=A[i];
break;
}
}
2) Don't use input data in the first argument to your printfs. You never know when something that looks like printf formatting might come in and really mess things up:
printf "%s\t",acc";
printf "%s%s%s%s%s%s",$1,$2,$3,$4,$5,$6;
Update with one more possible "elegance":
3) The awk style of pattern{action} is already a form of if/then, so you can avoid a lot of your outer if/then nesting:
$1="ID" {len=$4}
$1="AC" {
acc="";
...
}
acc {
if(substr($0, 1, 5) == " "){
...
}
In Vim it's actually one-liner to find the pattern:
/^AC.\{-}Q6GZX4;\_.\{-}\nSQ\_.\{-}\n\zs\_.\{-}\ze\/\//
where Q6GZX4; is your pattern to find in order to match the sequence characters.
The above basically will do:
Search for the line with AC at the beginning (^) which is followed by Q6GZX4;.
Follow across multiple lines (\_.\{-}) to the line starting with SQ (\nSQ).
Then follow to the next line ignoring what's in the current (\_.\{-}\n).
Now start selecting the main pattern (\zs) which is basically everything across multiple lines (\_.\{-}) until (\ze) the // pattern if found.
Then execute normal Vim commands (norm) which selects the pattern (gn) and yank it into x register ("xy).
You may now print register (echo #x) or remove whitespace characters from it.
This can be extended into Ex editor script as below (e.g. cmd.ex):
let s="Q6GZX4"
exec '/^AC.\{-}' . s . ';\_.\{-}\nSQ\_.\{-}\n\zs\_.\{-}\ze\/\//norm gn"xy'
let #x=substitute(#x,'\W','','g')
silent redi>>/dev/stdout
echon s . " " . #x
redi END
q!
Then run from the command-line as:
$ ex inputfile < cmd.ex
Q6GZX4 MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGWKFTPL
The above example can be further extended for multiple files or matches.
awk 'FNR == NR { aFilter[ $1 ";"] = $1; next }
/^AC/ {
if (String !~ /^$/) print Taken "\t" String "\t" Len
Taken = ""; String = ""
for ( i = 2; i <= NF && Taken ~ /^$/; i++) {
if( $i in aFilter) Taken = aFilter[ $i]
}
Take = Taken !~ /^$/
next
}
Take && /^SQ/ { Len = $3; next }
Take && /^[[:blank:]]/ {
gsub( /[[:blank:]]*/, "")
String = String $0
}
END { if( String !~ /^$/) print Taken "\t" String "\t" Len }
' filter.txt YourFile
Not really shorter, maybe a bit more generic. The heavy part is to extract the value that serve as filter from the line

Substituting variables in a text string

I have a text string in a variable in bash which looks like this:
filename1.txt
filename2.txt
varname1 = v1value
$(varname1)/filename3.txt
$(varname1)/filename4.txt
varname2 = $(varname1)/v2value
$(varname2)/filename5.txt
$(varname2)/filename6.txt
I want to substitute all of the variables in place, producing this:
filename1.txt
filename2.txt
v1value/filename3.txt
v1value/filename4.txt
v1value/v2value/filename5.txt
v1value/v2value/filename6.txt
Can anyone suggest a clean way to do this in the shell?
In awk:
BEGIN {
FS = "[[:space:]]*=[[:space:]]*"
}
NF > 1 {
map[$1] = $2
next;
}
function replace( count)
{
for (key in map) {
count += gsub("\\$\\("key"\\)", map[key])
}
return count
}
{
while (replace() > 0) {}
print
}
In lua:
local map = {}
--for line in io.lines("file.in") do -- To read from a file.
for line in io.stdin:lines() do -- To read from standard input.
local key, value = line:match("^(%w*)%s*=%s*(.*)$")
if key then
map[key] = value
else
local count
while count ~= 0 do
line, count = line:gsub("%$%(([^)]*)%)", map)
end
print(line)
end
end
I found a reasonable solution using m4:
function make_substitutions() {
# first all $(varname)s are replaced with ____varname____
# then each assignment statement is replaced with an m4 define macro
# finally this text is then passed through m4
echo "$1" |\
sed 's/\$(\([[:alnum:]][[:alnum:]]*\))/____\1____/' | \
sed 's/ *\([[:alnum:]][[:alnum:]]*\) *= *\(..*\)/define(____\1____, \2)/' | \
m4
}
Perhaps
echo "$string" | perl -nlE 'm/(\w+)\s*=\s*(.*)(?{$h{$1}=$2})/&&next;while(m/\$\((\w+)\)/){$x=$1;s/\$\($x\)/$h{$x}/e};say$_'
prints
filename1.txt
filename2.txt
v1value/filename3.txt
v1value/filename4.txt
v1value/v2value/filename5.txt
v1value/v2value/filename6.txt

Extracting multiple parts of a string using bash

I have a caret delimited (key=value) input and would like to extract multiple tokens of interest from it.
For example: Given the following input
$ echo -e "1=A00^35=D^150=1^33=1\n1=B000^35=D^150=2^33=2"
1=A00^35=D^22=101^150=1^33=1
1=B000^35=D^22=101^150=2^33=2
I would like the following output
35=D^150=1^
35=D^150=2^
I have tried the following
$ echo -e "1=A00^35=D^150=1^33=1\n1=B000^35=D^150=2^33=2"|egrep -o "35=[^/^]*\^|150=[^/^]*\^"
35=D^
150=1^
35=D^
150=2^
My problem is that egrep returns each match on a separate line. Is it possible to get one line of output for one line of input? Please note that due to the constraints of the larger script, I cannot simply do a blind replace of all the \n characters in the output.
Thank you for any suggestions.This script is for bash 3.2.25. Any egrep alternatives are welcome. Please note that the tokens of interest (35 and 150) may change and I am already generating the egrep pattern in the script. Hence a one liner (if possible) would be great
You have two options. Option 1 is to change the "white space character" and use set --:
OFS=$IFS
IFS="^ "
set -- 1=A00^35=D^150=1^33=1 # No quotes here!!
IFS="$OFS"
Now you have your values in $1, $2, etc.
Or you can use an array:
tmp=$(echo "1=A00^35=D^150=1^33=1" | sed -e 's:\([0-9]\+\)=: [\1]=:g' -e 's:\^ : :g')
eval value=($tmp)
echo "35=${value[35]}^150=${value[150]}"
To get rid of the newline, you can just echo it again:
$ echo $(echo "1=A00^35=D^150=1^33=1"|egrep -o "35=[^/^]*\^|150=[^/^]*\^")
35=D^ 150=1^
If that's not satisfactory (I think it may give you one line for the whole input file), you can use awk:
pax> echo '
1=A00^35=D^150=1^33=1
1=a00^35=d^157=11^33=11
' | awk -vLIST=35,150 -F^ ' {
sep = "";
split (LIST, srch, ",");
for (i = 1; i <= NF; i++) {
for (idx in srch) {
split ($i, arr, "=");
if (arr[1] == srch[idx]) {
printf sep "" arr[1] "=" arr[2];
sep = "^";
}
}
}
if (sep != "") {
print sep;
}
}'
35=D^150=1^
35=d^
pax> echo '
1=A00^35=D^150=1^33=1
1=a00^35=d^157=11^33=11
' | awk -vLIST=1,33 -F^ ' {
sep = "";
split (LIST, srch, ",");
for (i = 1; i <= NF; i++) {
for (idx in srch) {
split ($i, arr, "=");
if (arr[1] == srch[idx]) {
printf sep "" arr[1] "=" arr[2];
sep = "^";
}
}
}
if (sep != "") {
print sep;
}
}'
1=A00^33=1^
1=a00^33=11^
This one allows you to use a single awk script and all you need to do is to provide a comma-separated list of keys to print out.
And here's the one-liner version :-)
echo '1=A00^35=D^150=1^33=1
1=a00^35=d^157=11^33=11
' | awk -vLST=1,33 -F^ '{s="";split(LST,k,",");for(i=1;i<=NF;i++){for(j in k){split($i,arr,"=");if(arr[1]==k[j]){printf s""arr[1]"="arr[2];s="^";}}}if(s!=""){print s;}}'
given a file 'in' containing your strings :
$ for i in $(cut -d^ -f2,3 < in);do echo $i^;done
35=D^150=1^
35=D^150=2^

Resources