awk shell variable with field separator - bash

I am trying to create a hash:
awk -F ';' '/DHCP/ {for(i=1; i<=5; i++) {getline; print $2$1}}' file \
| awk '{print $1"=>\"0000:0000:0000:1000::"$2"/64\""}'
returns me the following :
host1=>"0000:0000:0000:1000::2/64"
host2=>"0000:0000:0000:1000::3/64"
host3=>"0000:0000:0000:1000::4/64"
host4=>"0000:0000:0000:1000::5/64"
host5=>"0000:0000:0000:1000::6/64"
This is all fine, but notice the 5 in the for loop in awk. How can I retrieve the total number of lines of the file into that for loop?
I can use wc -l into a variable, but how to use the shell variable and field separator ; together with awk ?
ADD
This is what the file looks like :
#rest are dynamically assigned by DHCP server
2 ; host1 ; server1 ; ; ;
3 ; host2 ; sX ;;
4 ; host3 ; plic ;;
5 ; host4 ; cluc ;;
6 ; host6 ; blah ;;

awk -F'[ \t]*;[ \t]*' 'NR > 1 && NF > 1 { print $2"=>\"0000:0000:0000:1000::"$1"/64\"" }' file
I've gotten rid of the check for DHCP -- I just test if we're past the first line. And NF > 1 makes sure that we don't do anything on a blank line.
I combined the two uses of awk into one by using a more elaborate field separator. It matches ; and any whitespace around it.

awk -v IT=$(cat file1|wc -l) -F ';' '/DHCP/ {for(i=1; i<=IT; i++) {getline; print $2$1}}' file \| awk '{print $1"=>\"0000:0000:0000:1000::"$2"/64\""}'
The -v flag passes external variables to awk.
#Ed Morton on my system catting executed faster
(precise)cronkilla#localhost:/tmp$ time wc -l < file1
4
real 0m0.003s
user 0m0.001s
sys 0m0.002s
(precise)cronkilla#localhost:/tmp$ time cat file1 | wc -l
4
real 0m0.003s
user 0m0.001s
sys 0m0.001s

Related

If else script in bash using grep and awk

I am trying to make a script to check if the value snp (column $4 of the test file) is present in another file (map file). If so, print the value snp and the value distance taken from the map file (distance is the column $4 of the map file). If the snp value from the test file is not present in the map file, print the snp value but put a 0 (zero) in the second column as distance value.
The script is:
for chr in {1..22};
do
for snp in awk '{print $4}' test$chr.bim
i=$(grep $snp map$chr.txt | wc -l | awk '{print $1}')
if [[ $i == "0" ]]
then
echo "$snp 0" >> position.$chr
else
distance=$(grep $snp map$chr.txt | awk '{print $4}')
echo "$snp $distance" >> position.$chr
fi
done
done
my map file is made like this:
Chromosome Position(bp) Rate(cM/Mb) Map(cM)
chr22 16051347 8.096992 0.000000
chr22 16052618 8.131520 0.010291
chr22 16053624 8.131967 0.018472
and so on..
my test file is made like this:
22 16051347 0 16051347 C A
22 16052618 0 16052618 G T
22 17306184 0 17306184 T G
and so on..
I'm getting the following syntax errors:
position.sh: line 6: syntax error near unexpected token `i=$(grep $snp map$chr.txt | wc -l | awk '{print $1}')'
position.sh: line 6: `i=$(grep $snp map$chr.txt | wc -l | awk '{print $1}')'
Any tip?
The attempt to use awk as the argument to for is basically a syntax error, and you have a number of syntax problems and inefficiencies here.
Try this:
for chr in {1..22}; do
awk '{print $4}' "test$chr.bim" |
while IFS="" read -r snp; do
if ! grep -q "$snp" "map$chr.txt"; then
echo "$snp 0"
else
awk -v snp="$snp" '
$0 ~ snp { print snp, $4 }' "map$chr.txt"
fi >> "position.$chr"
done
done
The entire thing could probably be further refactored to a single Awk script.
for chr in {1..22}; do
awk 'NR == FNR { ++a[$4]; next }
$2 in a { print a[$2], $4; ++found[$2] }
END { for(k in a) if (!found[k]) print a[k], 0 }' \
"test$chr.bim" "map$chr.txt" >> "position.$chr"
done
The correct for syntax for what I'm guessing you wanted would look like
for snp in $(awk '{print $4}' "test$chr.bim"); do
but this has other problems; see don't read lines with for

awk ignores line with 0 only

I don’t know why I can not give only 0 to awk in a direct statement, e.g. if I want to output the square of a number:
$ echo 4 | awk '$0=$1*$1'
16
$ echo 3 | awk '$0=$1*$1'
9
$ echo 0 | awk '$0=$1*$1'
Why do I get nothing on the last try?
PS. it works if I write $1 in a bracketed statement:
$ echo 0 | awk '{print $1*$1}'
0
No, awk does not ignore a line with 0.
However, your awk command: $0=$1*$1 does not do what you think.
By default awk prints $0 if there is an statement that evaluates to true (not zero).
So, this will always print $0:
awk '1'
And this will never print $0:
awk '0'
To do what you want: to always print $0 after it has been re-calculated, you need to do:
awk '{$0=$1*$1; print}'
And so:
$ echo "0" | awk '{$0=$1*$1; print}'
0
$ echo "2" | awk '{$0=$1*$1; print}'
4
Or, without changing the value of $0, do:
$ echo "2" | awk '{print $0*$0}'
Or (shorter but less readable):
$ echo "2" | awk '{$0=$0*$0}1'
4
And, even shorter:
$ echo "4" | awk '{$0*=$0}1'
16
This last awk script is actually composed of two command lines:
awk '
<default pattern> { $0*=$0 }
1 { <default action> }
'
Which become, replacing the action by print and the condition by all:
awk ' /.*/{$0*=$0}
1 {print $0}'
Both lines are applied to all input lines. For all lines $0 is changed, and for all input lines a print $0 is executed.

Different ways of grepping for large amounts of data

So I have a huuuuge file and a big list of items that I want to grep out from that file.
For the sake of this example, let the files be denoted thus-
seq 1 10000 > file.txt #file.txt contains numbers from 1 to 10000
seq 1 5 10000 > list #list contains every fifth number from 1 to 10000
My question is, which is the best way to grep out the lines corresponding to 'list' from 'file.txt'
I tried it in two ways-
time while read i ; do grep -w "$i" file.txt ; done < list > output
That command took - real 0m1.300s
time grep -wf list file.txt > output
This one was slower, clocking in at- real 0m1.402s.
Is there a better (faster) way to do this? Is there a best way that I'm missing?
You're comparing apples and oranges
this command greps words from list in file.txt
time for i in `cat list`; do grep -w "$i" file.txt ; done > output
this command greps patterns from file.txt in list
time grep -f file.txt list > output
you need to fix one file as the source of strings to match and the other file as the target data in which to match strings - also use the same grep options like -w or -F
It sounds like list is the source of patterns and file.txt is target datafile - here are my timings for the original adjusted commands plus one awk and two sed solutions - the sed solutions differ in whether the patterns are given as separate sed commands or one extended regex
timings
one grep
real 0m0.016s
user 0m0.001s
sys 0m0.001s
2000 output1
loop grep
real 0m10.120s
user 0m0.060s
sys 0m0.212s
2000 output2
awk
real 0m0.022s
user 0m0.007s
sys 0m0.000s
2000 output3
sed
real 0m4.260s
user 0m4.211s
sys 0m0.022s
2000 output4
sed -r
real 0m0.144s
user 0m0.085s
sys 0m0.047s
2000 output5
script
n=10000
seq 1 $n >file.txt
seq 1 5 $n >list
echo "one grep"
time grep -Fw -f list file.txt > output1
wc -l output1
echo "loop grep"
time for i in `cat list`; do grep -Fw "$i" file.txt ; done > output2
wc -l output2
echo "awk"
time awk 'ARGIND==1 {list[$1]; next} $1 in list' list file.txt >output3
wc -l output3
echo "sed"
sed 's/^/\/^/;s/$/$\/p/' list >list.sed
time sed -n -f list.sed file.txt >output4
wc -l output4
echo "sed -r"
tr '\n' '|' <list|sed 's/^/\/^(/;s/|$/)$\/p/' >list.sedr
time sed -nr -f list.sedr file.txt >output5
wc -l output5
You can try awk:
awk 'NR==FNR{a[$1];next} $1 in a' file.txt list
In my system, awk is faster than grep with the sample data.
Test:
$ time grep -f file.txt list > out
real 0m1.231s
user 0m1.056s
sys 0m0.175s
$ time awk 'NR==FNR{a[$1];next} $1 in a' file.txt list > out1
real 0m0.068s
user 0m0.067s
sys 0m0.001s
Faster or not, you've useless use of cat up there
Why not?
grep -f list file.txt # Aren't files meant other way
Or use a bit more customized awk
awk 'NR==FNR{a[$1];next} $1 in a{print $1;next}' list file.txt

how to delete all characters starting from the nth position for every word using bash?

I have a file containing 1,700,000 words. I want to do naive stemming of the words, if a word's length is more than 6 characters, I delete all characters after 6th position. For example:
Input:
Everybody is around
Everyone keeps talking
Output:
Everyb is around
Everyo keeps talkin
I wrote the following script:
INPUT=train.txt
while read line; do
for word in $line; do
new="$(echo $word | awk '{print substr($0,1,6);exit}')"
echo -n $new >> train_stem_6.txt
echo -n ' ' >> train_stem_6.txt
done
echo ' ' >> train_stem_6.txt
done < "$INPUT"
This answers the question perfectly, but it is extremely slow, and since I have 1,700,000 words, it takes forever.
Is there a faster way to do this using bash script.
Thanks a lot,
You can use this gnu awk using custom RS:
awk -v RS='[[:space:]]' '{ORS=RT; print substr($0, 1, 6)}' file
Everyb is around
Everyo keeps talkin
Timings of 3 commands on 11 MB input file:
sed:
time sed -r 's/([a-zA-Z]{6})[a-zA-Z]+/\1/g' file >/dev/null
real 0m2.913s
user 0m2.878s
sys 0m0.020s
awk command by #andlrc:
time awk '{for(i=1;i<=NF;i++){$i=substr($i, 1, 6)}}1' file >/dev/null
real 0m1.191s
user 0m1.174s
sys 0m0.011s
My suggested awk command:
time awk -v RS='[[:space:]]' '{ORS=RT; print substr($0, 1, 6)}' file >/dev/null
real 0m1.926s
user 0m1.905s
sys 0m0.013s
So both awk commands are taking pretty much same time to finish the job and sed tends to be slower on bigger files.
3 commands on 167mb file
$ time awk -v RS='[[:space:]]+' 'RT{ORS=RT} {$1=substr($1, 1, 6)} 1' test > /dev/null
real 0m29.070s
user 0m28.898s
sys 0m0.060s
$ time awk '{for(i=1;i<=NF;i++){$i=substr($i, 1, 6)}}1' test >/dev/null
real 0m13.897s
user 0m13.805s
sys 0m0.036s
$ time sed -r 's/([a-zA-Z]{6})[a-zA-Z]+/\1/g' test > /dev/null
real 0m40.525s
user 0m40.323s
sys 0m0.064s
Do you consider using sed?
sed -r 's/([a-zA-Z]{6})[a-zA-Z]+/\1/g'
You can use awk for this:
awk '{for(i=1;i<=NF;i++){$i=substr($i, 1, 6)}}1' train.txt
Breakdown:
{
for(i=1;i<=NF;i++) { # Iterate over each word
$i = substr($i, 1, 6); # Shrink it to a maximum of 6 characters
}
}
1 # Print the row
This will however treat Awesome, as a word and therefore remove e,
Pure bash, (i.e. not POSIX), as a one-liner:
while read x ; do set -- $x ; for f in $* ; do echo -n ${f:0:6}" " ; done ; echo ; done < train.txt
...and the same code reformatted for clarity:
while read x ; do
set -- $x
for f in $* ; do
echo -n ${f:0:6}" "
done
echo
done < train.txt
Note: repeated whitespace becomes a single space.
Test run, first make a function using above code, with standard input:
len6() { while read x ; do set -- $x ; for f in $* ; do echo -n ${f:0:6}" " ; done ; echo ; done ; }
Invoke:
COLUMNS=90 man bash | tail | head -n 5 | len6
Output:
gracef when proces suspen is attemp When a proces is stoppe the
shell immedi execut the next comman in the sequen It suffic to
place the sequen of comman betwee parent to force it into a subshe
which may be stoppe as a unit.

How can I specify a row in awk in for loop?

I'm using the following awk command:
my_command | awk -F "[[:space:]]{2,}+" 'NR>1 {print $2}' | egrep "^[[:alnum:]]"
which successfully returns my data like this:
fileName1
file Name 1
file Nameone
f i l e Name 1
So as you can see some file names have spaces. This is fine as I'm just trying to echo the file name (nothing special). The problem is calling that specific row within a loop. I'm trying to do it this way:
i=1
for num in $rows
do
fileName=$(my_command | awk -F "[[:space:]]{2,}+" 'NR==$i {print $2}' | egrep "^[[:alnum:]])"
echo "$num $fileName"
$((i++))
done
But my output is always null
I've also tried using awk -v record=$i and then printing $record but I get the below results.
f i l e Name 1
EDIT
Sorry for the confusion: rows is a variable that list ids like this 11 12 13
and each one of those ids ties to a file name. My command without doing any parsing looks like this:
id File Info OS
11 File Name1 OS1
12 Fi leNa me2 OS2
13 FileName 3 OS3
I can only use the id field to run a the command that I need, but I want to use the File Info field to notify the user of the actual File that the command is being executed against.
I think your $i does not expand as expected. You should quote your arguments this way:
fileName=$(my_command | awk -F "[[:space:]]{2,}+" "NR==$i {print \$2}" | egrep "^[[:alnum:]]")
And you forgot the other ).
EDIT
As an update to your requirement you could just pass the rows to a single awk command instead of a repeatitive one inside a loop:
#!/bin/bash
ROWS=(11 12)
function my_command {
# This function just emulates my_command and should be removed later.
echo " id File Info OS
11 File Name1 OS1
12 Fi leNa me2 OS2
13 FileName 3 OS3"
}
awk -- '
BEGIN {
input = ARGV[1]
while (getline line < input) {
sub(/^ +/, "", line)
split(line, a, / +/)
for (i = 2; i < ARGC; ++i) {
if (a[1] == ARGV[i]) {
printf "%s %s\n", a[1], a[2]
break
}
}
}
exit
}
' <(my_command) "${ROWS[#]}"
That awk command could be condensed to one line as:
awk -- 'BEGIN { input = ARGV[1]; while (getline line < input) { sub(/^ +/, "", line); split(line, a, / +/); for (i = 2; i < ARGC; ++i) { if (a[1] == ARGV[i]) {; printf "%s %s\n", a[1], a[2]; break; }; }; }; exit; }' <(my_command) "${ROWS[#]}"
Or better yet just use Bash instead as a whole:
#!/bin/bash
ROWS=(11 12)
while IFS=$' ' read -r LINE; do
IFS='|' read -ra FIELDS <<< "${LINE// +( )/|}"
for R in "${ROWS[#]}"; do
if [[ ${FIELDS[0]} == "$R" ]]; then
echo "${R} ${FIELDS[1]}"
break
fi
done
done < <(my_command)
It should give an output like:
11 File Name1
12 Fi leNa me2
Shell variables aren't expanded inside single-quoted strings. Use the -v option to set an awk variable to the shell variable:
fileName=$(my_command | awk -v i=$i -F "[[:space:]]{2,}+" 'NR==i {print $2}' | egrep "^[[:alnum:]])"
This method avoids having to escape all the $ characters in the awk script, as required in konsolebox's answer.
As you already heard, you need to populate an awk variable from your shell variable to be able to use the desired value within the awk script so thi:
awk -F "[[:space:]]{2,}+" 'NR==$i {print $2}' | egrep "^[[:alnum:]]"
should be this:
awk -v i="$i" -F "[[:space:]]{2,}+" 'NR==i {print $2}' | egrep "^[[:alnum:]]"
Also, though, you don't need awk AND grep since awk can do anything grep van do so you can change this part of your script:
awk -v i="$i" -F "[[:space:]]{2,}+" 'NR==i {print $2}' | egrep "^[[:alnum:]]"
to this:
awk -v i="$i" -F "[[:space:]]{2,}+" '(NR==i) && ($2~/^[[:alnum:]]/){print $2}'
and you don't need a + after a numeric range so you can change {2,}+ to just {2,}:
awk -v i="$i" -F "[[:space:]]{2,}" '(NR==i) && ($2~/^[[:alnum:]]/){print $2}'
Most importantly, though, instead of invoking awk once for every invocation of my_command, you can just invoke it once for all of them, i.e. instead of this (assuming this does what you want):
i=1
for num in rows
do
fileName=$(my_command | awk -v i="$i" -F "[[:space:]]{2,}" '(NR==i) && ($2~/^[[:alnum:]]/){print $2}')
echo "$num $fileName"
$((i++))
done
you can do something more like this:
for num in rows
do
my_command
done |
awk -F '[[:space:]]{2,}' '$2~/^[[:alnum:]]/{print NR, $2}'
I say "something like" because you don't tell us what "my_command", "rows" or "num" are so I can't be precise but hopefully you see the pattern. If you give us more info we can provide a better answer.
It's pretty inefficient to rerun my_command (and awk) every time through the loop just to extract one line from its output. Especially when all you're doing is printing out part of each line in order. (I'm assuming that my_command really is exactly the same command and produces the same output every time through your loop.)
If that's the case, this one-liner should do the trick:
paste -d' ' <(printf '%s\n' $rows) <(my_command |
awk -F '[[:space:]]{2,}+' '($2 ~ /^[::alnum::]/) {print $2}')

Resources