awk script for decimal values - bash

I am using this script to extract lines if column 7 is < 1.0E-08 AND
column eight has one or more than one values > 0.2 and 0.3
Is it the right approach ?
InputFile: head -1 test.txt
A2 DR28 P3379 72 7 5.008 8.252e-14
0.05132,0.04248,0.002704,0.116,0.04439,0.2,0.3
A2 DR28 P3379 72 7 5.008 0.05
0.05132,0.04248,0.002704,0.116,0.04439,0.006,0.004
Script: first I did
awk '{if($7 < 1.0E-08 || $8 > 0.2) print}' test.txt
This gives the first line as output but i want to use && (AND) instead of || (OR)
when I use AND (&&)
awk '{if($7 < 1.0E-08 && $8 > 0.2) print}' test.txt
no result though line one fits this criteria.
I also try this but here just considering column eight as a cut-off point
awk -F',' '$8 > 0.2' test.txt
this script work fine but I need to consider column 7 too as I have few lines in output so just want to make sure that i am not missing anything

not tested, but something like this should work
$ awk 'function anyGreater(x,v) {
n=split(x,f8,",");
for(i=1;i<=n;i++) if(f8[i]>v) return 1;
return 0}
$7<1.0E-08 && anyGreater($8,0.2)' file

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

Loop to create a a DF from values in bash

Im creating various text files from a file like this:
Chrom_x,Pos,Ref,Alt,RawScore,PHRED,ID,Chrom_y
10,113934,A,C,0.18943,5.682,rs10904494,10
10,126070,C,T,0.030435000000000007,3.102,rs11591988,10
10,135656,T,G,0.128584,4.732,rs10904561,10
10,135853,A,G,0.264891,6.755,rs7906287,10
10,148325,A,G,0.175257,5.4670000000000005,rs9419557,10
10,151997,T,C,-0.21169,0.664,rs9286070,10
10,158202,C,T,-0.30357,0.35700000000000004,rs9419478,10
10,158946,C,T,2.03221,19.99,rs11253562,10
10,159076,G,A,1.403107,15.73,rs4881551,10
What I am trying to do is extract, in bash, all values beetwen two values:
gawk '$6>=0 && $NF<=5 {print $0}' file.csv > 0_5.txt
And create files from 6 to 10, from 11 to 15... from 95 to 100. I was thinking in creating a loop for this with something like
#!/usr/bin/env bash
n=( 0,5,6,10...)
if i in n:
gawk '$6>=n && $NF<=n+1 {print $0}' file.csv > n_n+1.txt
and so on.
How can i convert this as a loop and create files with this specific values.
While you could use a shell loop to provide inputs to an awk script, you could also just use awk to natively split the values into buckets and write the lines to those "bucket" files itself:
awk -F, ' NR > 1 {
i=int((($6 - 1) / 5))
fname=(i*5) "_" (i+1)*5 ".txt"
print $0 > fname
}' < input
The code skips the header line (NR > 1) and then computes a "bucket index" by dividing the value in column six by five. The filename is then constructed by multiplying that index (and its increment) by five. The whole line is then printed to that filename.
To use a shell loop (and call awk 20 times on the input), you could use something like this:
for((i=0; i <= 19; i++))
do
floor=$((i * 5))
ceiling=$(( (i+1) * 5))
awk -F, -v floor="$floor" -v ceiling="$ceiling" \
'NR > 1 && $6 >= floor && $6 < ceiling { print }' < input \
> "${floor}_${ceiling}.txt"
done
The basic idea is the same; here, we're creating the bucket index with the outer loop and then passing the range into awk as the floor and ceiling variables. We're only asking awk to print the matching lines; the output from awk is captured by the shell as a redirection into the appropriate file.

Average of first ten numbers of text file using bash

I have a file of two columns. The first column is dates and the second contains a corresponding number. The two commas are separated by a column. I want to take the average of the first three numbers and print it to a new file. Then do the same for the 2nd-4th number. Then 3rd-5th and so on. For example:
File1
date1,1
date2,1
date3,4
date4,1
date5,7
Output file
2
2
4
Is there any way to do this using awk or some other tool?
Input
akshay#db-3325:/tmp$ cat file.txt
date1,1
date2,1
date3,4
date4,1
date5,7
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, '{
x = $2;
i = NR % n;
ma += (x - q[i]) / n;
q[i] = x;
if(NR>=n)print ma;
}' file.txt
2
2
4
OR below one useful for plotting and keeping reference axis (in your case date) at center of average point
Script
akshay#db-3325:/tmp$ cat avg.awk
BEGIN {
m=int((n+1)/2)
}
{L[NR]=$2; sum+=$2}
NR>=m {d[++i]=$1}
NR>n {sum-=L[NR-n]}
NR>=n{
a[++k]=sum/n
}
END {
for (j=1; j<=k; j++)
print d[j],a[j] # remove d[j], if you just want values only
}
Output
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, -v OFS=, -f avg.awk file.txt
date2,2
date3,2
date4,4
$ awk -F, '{a[NR%3]=$2} (NR>=3){print (a[0]+a[1]+a[2])/3}' file
2
2
4
Add a little bit math tricks here, set $2 to a[NR%3] for each record. So the value in each element would be updated cyclically. And the sum of a[0], a[1], a[2] would be the sum of past 3 numbers.
updated based on the changes made due to the helpful feedback from Ed Morton
here's a quick and dirty script to do what you've asked for. It doesn't have much flexibility in it but you can easily figure out how to extend it.
To run save it into a file and execute it as an awk script either with a shebang line or by calling awk -f
// {
Numbers[NR]=$2;
if ( NR >= 3 ) {
printf("%i\n", (Numbers[NR] + Numbers[NR-1] + Numbers[NR-2])/3)
}
}
BEGIN {
FS=","
}
Explanation:
Line 1: Match all lines, "/" is the match operator and in this case we have an empty match which means "do this thing on every line". Line 3: Use the Record Number (NR) as the key and store the value from column 2 Line 4: If we have 3 or more values read from the file Line 5: Do the maths and print as an integer BEGIN block: Change the Field Separator to a comma ",".

Filter a file using shell script tools

I have a file which contents are
E006:Jane:HR:9800:Asst
E005:Bob:HR:5600:Exe
E002:Barney:Purc:2300:PSE
E009:Miffy:Purc:3600:Mngr
E001:Franny:Accts:7670:Mngr
E003:Ostwald:Mrktg:4800:Trainee
E004:Pearl:Accts:1800:SSE
E009:Lala:Mrktg:6566:SE
E018:Popoye:Sales:6400:QAE
E007:Olan:Sales:5800:Asst
I want to fetch List all employees whose emp codes are between E001 and E018 using command including pipes is it possible to get ?
Use sed:
sed -n -e '/^E001:/,/^E018:/p' data.txt
That is, print the lines that are literally between those lines that start with E001 and E018.
If you want to get the employees that are numerically between those, one way to do that would be to do comparisons inline using something like awk (as suggested by hochl). Or, you could take this approach preceded by a sort (if the lines are not already sorted).
sort data.txt | sed -n -e '/^E001:/,/^E018:/p'
You can use awk for such cases:
$ gawk 'BEGIN { FS=":" } /^E([0-9]+)/ { n=substr($1, 2)+0; if (n >= 6 && n <= 18) { print } }' < data.txt
E006:Jane:HR:9800:Asst
E009:Miffy:Purc:3600:Mngr
E009:Lala:Mrktg:6566:SE
E018:Popoye:Sales:6400:QAE
E007:Olan:Sales:5800:Asst
Is that the result you want? This example intentionally only prints employees between 6 and 18 to show that it filters out records. You may print some fields only using $1 or $2 as in print $1 " " $2.
You can try something like this: cut -b2- | awk '{ if ($1 < 18) print "E" $0 }'
Just do string comparison: Since all your sample data matches, I changed the boundaries for illustration
awk -F: '"E004" <= $1 && $1 <= "E009" {print}'
output
E006:Jane:HR:9800:Asst
E005:Bob:HR:5600:Exe
E009:Miffy:Purc:3600:Mngr
E004:Pearl:Accts:1800:SSE
E009:Lala:Mrktg:6566:SE
E007:Olan:Sales:5800:Asst
You can pass the strings as variables if you don't want to hard-code them in the awk script
awk -F: -v start=E004 -v stop=E009 'start <= $1 && $1 <= stop {print}'

Delete lines containing a range pattern in 4th column

In a file 4th column contains a floating point numbers
dsfsd sdfsd sdfds 4.5 dfsdfsd
I want to delete the entire line if the number between -0.1 and 0.1 (or some other range).
Can sed or awk do that for me?
thanks
I recommend using the "pattern { expression }" syntax:
awk '($4 < -0.1) || ($4 > 0.1) {print}' test.txt
Or, even more concicely:
awk '($4 < -0.1) || ($4 > 0.1)' test.txt
Since {print} is the default action. I've assumed that you have a file "test.txt" containing your data.
awk:
{ if ($4 > 0.1 || $4 < -0.1) print $0 }

Resources