bash variable created with awk moves in echo output - bash

my while loop is reading a file that looks like:
Chr start stop value
chr1 12345 4345666 -1
to compare with another file to create mean values of those values (probes.txt):
chr1 12345 12345 0.124
chr1 12346 12346 0.421
now the code goes:
$file | while read line
do
first=$(echo $line | awk '{print $1}' )
second=$(echo $line | awk '{print $2}')
third=$(echo $line | awk '{print $3}')
logsum=$(awk -v first=$first -v second=$second -v third=$third '$1==first && $2>= second && $3<=third { sum += $4; n++} END { print sum / n; }' probes.txt
echo "$line" "$logsum"
done
the output that I am expecting would be:
chr1 12345 4345666 -1 0.232
but instead the $logsum ends up at the front overriding parts of the $line:
0.232345 4345666 -1 0.232
i have also tried printf and get the same issue with
printf "%s %s \n" "$line" "$logsum"
i think the problem is the $logsum variable as it seems ok if I
echo "$logsum" "$line"
instead.
Does anyone know what is happening here and how to fix it?
edit I am working on a Mac in case this is an issue
fixed with dos2unix

Apart from checking for \r\n characters, as suggested by #kvantour, I'd recommend to do all of this in a single AWK script. This will be more efficient.
Let's say, if you save this to script.awk:
NR == 1 { print $0,"logsum"; next }
{
sum = 0; n = 0; avg = 0;
while(( getline line < fn) > 0) {
split(line, arr);
if (arr[1]==$1 && arr[2]>=$2 && arr[3]<=$3) {
sum += arr[4]; n++;
}
}
if (n>0) avg = (sum / n);
print $0, avg;
}
You can call it like this:
awk -v fn=probes.txt -f script.awk YOURFILE.txt
Example output:
Chr start stop value logsum
chr1 12345 4345666 -1 0.2725

Related

how to get a word if given x and y word?

Ch SSID BSSID Security Signal(%)W-Mode ExtCH NT
52 xxxxxx-F3F6BD 12:13:31:xx:xx:xx WPA2PSK/AES 100 11a/n/ac ABOVE In
112 ROGER 92:02:db:xx:xx:xx WPAPSKWPA2PSK/TKIPAES 73 11a/n/ac BELOW In
112 router 11:22:33:xx:xx:xx WPA2PSK/AES 73 11a/n/ac BELOW In
36 TIM-9xxxxx b4:a5:ef:xx:xx:xx WPA2PSK/AES 55 11a/n/ac ABOVE In
36 TIM-27xxxxxx 12:13:31:xx:xx:xx WPA2PSK/AES 44 11a/n/ac ABOVE In
I want to get the specific value/word if given an X word and Y word. The Y word is always from the first line:Ch SSID BSSID Security....
For example, if the X word is ROUGER and Y word is BSSID, i will get 92:02:db:xx:xx:xx .
If the X word is router and Y word is Ch, i will get 112
Try this awk program:
NR == 1 {
# Memorization of first line column names with position
i = 1
while (i <= NF) {
hash_cols[$i] = i
i++
}
next
}
NR > 1 && $0 ~ S_PATTERN && hash_cols[S_COL] > 0 {
print $hash_cols[S_COL]
}
Run on commande line like this:
$ cat route.txt | awk -f route.awk -v S_PATTERN="ROGER" -v S_COL="BSSID"
92:02:db:xx:xx:xx
Here the cat route.txt could be replaced by any commands that you want.
Also, you could place awk command in a shell script:
#! /bin/bash
cat \
| awk \
-v S_PATTERN="$1" \
-v S_COL="$2" \
'
NR == 1 {
i = 1
while (i <= NF) {
hash_cols[$i] = i
i++
}
next
}
NR > 1 && $0 ~ S_PATTERN && hash_cols[S_COL] > 0 {
print $hash_cols[S_COL]
}
'
And launch it:
$ cat route.txt | ./route.sh "ROGER" "BSSID"
92:02:db:xx:xx:xx
Try this awk.
awk -v row="ROGER" -v column="BSSID" ' match($0,column);match($0,row) ' jing.txt | \
awk -v column="BSSID" ' {for(i=1;i<=NF;i++) if($i==column) break; getline; print $i } '
92:02:db:xx:xx:xx
Or even shorter
$ awk -v row="ROGER" ' NR==1;match($0,row) ' jing.txt |\
awk -v column="BSSID" ' {for(i=1;i<=NF;i++) if($i==column) break; getline; print $i } '
92:02:db:xx:xx:xx

Extract last part of the string when separator is not always the same using awk

I have file with lines that look like this. FILE here
ID=4;Dbxref=766;Name=LOC2;gene_biotype=protein_coding
ID=5;Dbxref=800;Name=LOC3;gene_biotype=lncRNA
ID=6;Dbxref=900;Name=LOC4;gene_biotype=protein_coding;partial=true;start_range=.,338076
ID=7;Dbxref=905;Name=LOC5;gene_biotype=pseudogene;pseudo=true
I'm trying to grab the last part of the string ... but the ending isn't always consistent
I've tried:
while read -r line ; do
ID=`echo $line | awk -F"ID=" '{print $2}' | awk -F";" '{print $1}'`
Biotype=`echo $line | awk -F"gene_biotype=" '{print $2}'`
echo -e $ID"\t"$Biotype >> file.txt
done << (grep $'\tgene\t' originalfile.txt)
Biotype is the part that isn't working. Ideally the output would look like
4 protein_coding
5 lncRNA
6 protein_coding;partial=true;start_range=.,338076
7 pseudogene;pseudo=true
I've also tried:
Biotype=`echo $line | awk -F"gene_biotype=" '{print $NF}'`
But it ends up saving nothing. Any advice appreciated ...
Using a sed that understands -E to use EREs (e.g. GNU sed or OSX/BSD sed):
$ sed -E 's/[^=]*=([^;]*)(;[^;]*){2}[^=]*=/\1\t/' file
4 protein_coding
5 lncRNA
6 protein_coding;partial=true;start_range=.,338076
7 pseudogene;pseudo=true
With any POSIX sed:
$ sed 's/[^=]*=\([^;]*\)\(;[^;]*\)\{2\}[^=]*=/\1\t/' file
4 protein_coding
5 lncRNA
6 protein_coding;partial=true;start_range=.,338076
7 pseudogene;pseudo=true
Also, here's an approach to generally working with the type of tag=value data you have in future and that is to first create an array (f[] below) that maps each tag/name to it's associated value and then you can just access the values by their names for comparisons, printing, etc.:
$ cat tst.awk
BEGIN { FS=";"; OFS="\t" }
{
delete f
for (i=1; i<=NF; i++) {
tag = val = $i
sub(/=.*/,"",tag)
sub(/[^=]+=/,"",val)
f[tag] = val
}
<< do something with "f[tag]"s >>
}
which you could solve your current problem with as:
$ cat tst.awk
BEGIN { FS=";"; OFS="\t" }
{
delete f
for (i=1; i<=NF; i++) {
tag = val = $i
sub(/=.*/,"",tag)
sub(/[^=]+=/,"",val)
f[tag] = val
}
sub(/.*;gene_biotype=/,"")
print f["ID"], $0
}
$ awk -f tst.awk file
4 protein_coding
5 lncRNA
6 protein_coding;partial=true;start_range=.,338076
7 pseudogene;pseudo=true
but you can also do far more with including printing lines based on compound conditions of different values, printing columns in different order than they were input, etc. For example:
$ cat tst.awk
BEGIN { FS=";"; OFS="\t" }
{
delete f
for (i=1; i<=NF; i++) {
tag = val = $i
sub(/=.*/,"",tag)
sub(/[^=]+=/,"",val)
f[tag] = val
}
}
( (f["Dbxref"] > 800) && (f["partial"] == "true") ) || (f["gene_biotype"] == "protein_coding") {
print f["Name"], f["ID"]
}
.
$ awk -f tst.awk file
LOC2 4
LOC4 6

Write a shell script to calculate the number of employees

I need to find the count of employees whose salary is less than average salary of all employees.
The file with the employee details will be given as a command line argument when your script will run
example->
Input: File:
empid;empname;salary
100;A;30000
102;B;45000
103;C;15000
104;D;40000
Output:
2
my solution->
f=`awk -v s=0 'BEGIN{FS=":"}{if(NR>1){s+=$3;row++}}END{print s/row}' $file`;
awk -v a="$f" 'BEGIN{FS=":"}{if(NR!=1 && $3<a)c++}END{print c}' $file;
This is what i have tried so far
but output comes out to be
0
This one-liner should solve the problem:
awk -F';' 'NR>1{e[$1]=$3;s+=$3}
END{avg=s/(NR-1);for(x in e)if(e[x]<avg)c++;print c}' file
If you run it with your example file, it is gonna print:
2
explanation:
NR>1 skip the header
e[$1]=$3;s+=$3 : build a hashtable, and sum the salarays
END{avg=s/(NR-1); : calc the averge
for(x in e)if(e[x]<avg)c++;print c :go through the hashtables, count the element, which value < avg and output.
Could you please try following.
awk '
BEGIN{
FS=";"
}
FNR==NR{
if(FNR>1)
{
total+=$NF
count++
}
next
}
FNR==1{
avg=total/count
}
avg>$NF
' Input_file Input_file
Your script is fine except it's setting FS=":"; it should be setting FS=";" since that is what is separating your fields in the input.
avg=$(awk -F";" 'NR>1 { s+=$3;i++} END { print s/i }' f)
awk -v avg=$avg -F";" 'NR>1 && $3<avg' f
1) Ignore header and compute Average, avg
2) Ignore header and if salary is less than avg print
file=$1
salary=`sed "s/;/ /" $file | sed "s/;/ /" | awk '{print $3}' | tail -n+2`
sum=0
n=0
for line in $salary
do
((sum+=line))
((n++))
done
avg=$((sum / n))
count=0
for line in $salary
do
if [ $line -lt $avg ]
then
((count++))
fi
done
echo "No. of Emp : $count"

Find a number of a file in a range of numbers of another file

I have this two input files:
file1
1 982444
1 46658343
3 15498261
2 238295146
21 47423507
X 110961739
17 7490379
13 31850803
13 31850989
file2
1 982400 982480
1 46658345 46658350
2 14 109
2 5000 9000
2 238295000 238295560
X 110961739 120000000
17 7490200 8900005
And this is my desired output:
Desired output:
1 982444
2 238295146
X 110961739
17 7490379
This is what I want: Find the column 1 element of file1 in column 1 of file2. If the number is the same, take the number of column 2 of file1 and check if it is included in the range of numbers of column2 and 3 of file2. If it is included, print the line of file1 in the output.
Maybe is a little confusing to understand, but I'm doing my best. I have tried some things but I'm far away from the solution and any help will be really appreciated. In bash, awk or perl please.
Thanks in advance,
Just using awk. The solution doesn't loop through file1 repeatedly.
#!/usr/bin/awk -f
NR == FNR {
# I'm processing file2 since NR still matches FNR
# I'd store the ranges from it on a[] and b[]
# x[] acts as a counter to the number of range pairs stored that's specific to $1
i = ++x[$1]
a[$1, i] = $2
b[$1, i] = $3
# Skip to next record; Do not allow the next block to process a record from file2.
next
}
{
# I'm processing file1 since NR is already greater than FNR
# Let's get the index for the last range first then go down until we reach 0.
# Nothing would happen as well if i evaluates to nothing i.e. $1 doesn't have a range for it.
for (i = x[$1]; i; --i) {
if ($2 >= a[$1, i] && $2 <= b[$1, i]) {
# I find that $2 is within range. Now print it.
print
# We're done so let's skip to the next record.
next
}
}
}
Usage:
awk -f script.awk file2 file1
Output:
1 982444
2 238295146
X 110961739
17 7490379
A similar approach using Bash (version 4.0 or newer):
#!/bin/bash
FILE1=$1 FILE2=$2
declare -A A B X
while read F1 F2 F3; do
(( I = ++X[$F1] ))
A["$F1|$I"]=$F2
B["$F1|$I"]=$F3
done < "$FILE2"
while read -r LINE; do
read F1 F2 <<< "$LINE"
for (( I = X[$F1]; I; --I )); do
if (( F2 >= A["$F1|$I"] && F2 <= B["$F1|$I"] )); then
echo "$LINE"
continue
fi
done
done < "$FILE1"
Usage:
bash script.sh file1 file2
Let's mix bash and awk:
while read col min max
do
awk -v col=$col -v min=$min -v max=$max '$1==col && min<=$2 && $2<=max' f1
done < f2
Explanation
For each line of file2, read the min and the max, together with the value of the first column.
Given these values, check in file1 for those lines having same first column and being 2nd column in the range specified by file 2.
Test
$ while read col min max; do awk -v col=$col -v min=$min -v max=$max '$1==col && min<=$2 && $2<=max' f1; done < f2
1 982444
2 238295146
X 110961739
17 7490379
Pure bash , based on Fedorqui solution:
#!/bin/bash
while read col_2 min max
do
while read col_1 val
do
(( col_1 == col_2 && ( min <= val && val <= max ) )) && echo $col_1 $val
done < file1
done < file2
cut -d' ' -f1 input2 | sed 's/^/^/;s/$/\\s/' | \
grep -f - <(cat input2 input1) | sort -n -k1 -k3 | \
awk 'NF==3 {
split(a,b,",");
for (v in b)
if ($2 <= b[v] && $3 >= b[v])
print $1, b[v];
if ($1 != p) a=""}
NF==2 {p=$1;a=a","$2}'
Produces:
X 110961739
1 982444
2 238295146
17 7490379
Here's a Perl solution. It could be much faster but less concise if I built a hash out of file2, but this should be fine.
use strict;
use warnings;
use autodie;
my #bounds = do {
open my $fh, '<', 'file2';
map [ split ], <$fh>;
};
open my $fh, '<', 'file1';
while (my $line = <$fh>) {
my ($key, $val) = split ' ', $line;
for my $bound (#bounds) {
next unless $key eq $bound->[0] and $val >= $bound->[1] and $val <= $bound->[2];
print $line;
last;
}
}
output
1 982444
2 238295146
X 110961739
17 7490379

bash, find nearest next value, forward and backward

I have a data.txt file
1 2 3 4 5 6 7
cat data.txt
13 245 1323 10.1111 10.2222 60.1111 60.22222
13 133 2325 11.2222 11.333 61.2222 61.3333
13 245 1323 12.3333 12.4444 62.3333 62.44444444
13 245 1323 13.4444 13.5555 63.4444 63.5555
Find next nearest: My target value is 11.6667 and it should find the nearest next value in column 4 as 12.3333
Find previous nearest: My target value is 62.9997 and it should find the nearest previous value in column 6 as 62.3333
I am able to find the next nearest (case 1) by
awk -v c=4 -v t=11.6667 '{a[NR]=$c}END{
asort(a);d=a[NR]-t;d=d<0?-d:d;v = a[NR]
for(i=NR-1;i>=1;i--){
m=a[i]-t;m=m<0?-m:m
if(m<d){
d=m;v=a[i]
}
}
print v
}' f
12.3333
Any bash solution? for finding the previous nearest (case 2)?
Try this:
$ cat tst.awk
{
if ($fld > tgt) {
del = $fld - tgt
if ( (del < minGtDel) || (++gtHit == 1) ) {
minGtDel = del
minGtVal = $fld
}
}
else if ($fld < tgt) {
del = tgt - $fld
if ( (del < minLtDel) || (++ltHit == 1) ) {
minLtDel = del
minLtVal = $fld
}
}
else {
minEqVal = $fld
}
}
END {
print (minGtVal == "" ? "NaN" : minGtVal)
print (minLtVal == "" ? "NaN" : minLtVal)
print (minEqVal == "" ? "NaN" : minEqVal)
}
.
$ awk -v fld=4 -v tgt=11.6667 -f tst.awk file
12.3333
11.2222
NaN
$ awk -v fld=6 -v tgt=62.9997 -f tst.awk file
63.4444
62.3333
NaN
$ awk -v fld=6 -v tgt=62.3333 -f tst.awk file
63.4444
61.2222
62.3333
For the first part:
awk -v v1="11.6667" '$4>v1 {print $4;exit}' file
12.3333
And second part:
awk -v v2="62.9997" '$6>v2 {print p;exit} {p=$6}' file
62.3333
Both in one go:
awk -v v1="11.6667" -v v2="62.9997" '$4>v1 && !p1 {p1=$4} $6>v2 && !p2 {p2=p} {p=$6} END {print p1,p2}' file
12.3333 62.3333
I don't know if this is what you're looking for, but this is what I came up with, not knowing awk:
#!/bin/sh
IFSBAK=$IFS
IFS=$'\n'
best=
for line in `cat $1`; do
IFS=$' \t'
arr=($line)
num=${arr[5]}
[[ -z $best ]] && best=$num
if [ $(bc <<< "$num < 62.997") -eq 1 ]; then
if [ $(bc <<< "$best < $num") -eq 1 ]; then
best=$num
fi
fi
IFS=$'\n'
done
IFS=$IFSBAK
echo $best
If you want, you can add the column and the input value 62.997 as paramters, I didn't to demonstrate that it would look for specifically what you want.
Edited to remove assumption that file is sorted.
You solution looks unnecessarily complicated (storing a whole array and sorting it) and I think you would see the bash solution if you re-thought your awk.
In awk you can detect the first line with
FNR==1 {do something}
so on the first line, set a variable BestYet to the value in the column you are searching.
On subsequent lines, simply test if the value in the column you are checking is
a) less than your target AND
b) greater than `BestYet`
if it is, update BestYet. At the end, print BestYet.
In bash, apply the same logic, but read each line into a bash array and use ${a[n]} to get the n'th element.

Resources