Min and max coordinates for same values in different column - bash

I have one question, I think about script for my data and I am totally lost.
INPUT:
1 BR.100.200
2 BR.100.200
3 BR.100.200
4 BR.100.200
1 BAL.11.235
2 BAL.11.235
3 BAL.11.235
1 JOJ.21.354
2 JOJ.21.354
OUTPUT :
BR.100.200 1 4
BAL.11.235 1 3
JOJ.21.354 1 2
Than I want: if the $2 is same for columns, write for this same values maximal and minimal values in $1. Please i prefer awk language or bash or sed.
Thank you
Filip

Could probz be made better but this works
awk '!x[$2]{x[$2]=$1}y[$2]<$1{y[$2]=$1}x[$2]>$1{x[$2]=$1}END{for(i in y)print i,x[i],y[i]}' file
More readable
awk '!min[$2]{min[$2]=$1} max[$2]<$1{max[$2]=$1} min[$2]>$1{min[$2]=$1} END{for(i in max)print i, min[i], max[i]}' file

#!/usr/bin/awk -f
NF == 0 { next }
$2 in min {
if ($1 < min[$2]) {
min[$2] = $1
} else if ($1 > max[$2]) {
max[$2] = $1
}
next
}
{
min[$2] = max[$2] = $1
keys[i++] = $2
}
END {
for (i = 0; i in keys; ++i) {
key = keys[i]
if (i) {
print ""
}
printf "%s\t%s\t%s\n", key, min[key], max[key]
}
}
Run with:
awk -f script.awk your_file.txt
Output:
BR.100.200 1 4
BAL.11.235 1 3
JOJ.21.354 1 2

awk '{if (NR == 1) {temp1=$2;min=$1;max=$1;} else if ((NR % 2)!=0) {temp2=$2; if (temp1 == temp2) {max=$1} else {print (temp1,min,max); temp1=$2;min=$1;max=$1} } } END{if ((NR % 2)!=0) {temp2=$2; if (temp1 == temp2) {max=$1} else {print (temp1,min,max);} print (temp2,min,max) } }' inputfile

Related

Awk separate column output

What I wanted to do is to create a Table (maximum=4 rows) from a one-column file using awk.
I have a file:
1 a,b
2 r,i
3 w
4 r,t
5 o,s
6 y
The desire output:
1 a,b 5 o,s
2 r,i 6 y
3 w
4 r,t
So far, I just separating the rows into different files and "paste" them into one. I would appreciate of any of more sophisticated method.
$ cat tst.awk
BEGIN {
numRows = 4
OFS = "\t"
}
{
rowNr = (NR - 1 ) % numRows + 1
if ( rowNr == 1 ) {
numCols++
}
val[rowNr,numCols] = $0
}
END {
for (rowNr=1; rowNr<=numRows; rowNr++) {
for (colNr=1; colNr<=numCols; colNr++) {
printf "%s%s", val[rowNr,colNr], (colNr<numCols ? OFS : ORS)
}
}
}
$
$ awk -f tst.awk file
1 a,b 5 o,s
2 r,i 6 y
3 w
4 r,t
Combination of awk to join lines and column to pretty-print them:
awk -v max=4 '
{ i = (NR-1) % max + 1; line[i] = line[i] "\t" $0 }
END { for(i=1; i<=max && i<=length(line); i++) print line[i] }' file | column -t -s $'\t'
Output:
1 a,b 5 o,s
2 r,i 6 y
3 w
4 r,t
Another:
$ awk ' {
i=(NR%4) # using modulo indexed array
a[i]=a[i] (a[i]==""?"":" ") $0 # append to it
}
END { # in the END
for(i=1;i<=4;i++) # loop all indexes in order
print a[i%4] # dont forget the modulo
}' file
1 a,b 5 o,s
2 r,i 6 y
3 w
4 r,t
Naturally it will be ugly if there are missing columns.
Here is another awk approach:-
awk '
{
A[++c] = $0
}
END {
m = sprintf ( "%.0f", ( c / 4 ) )
for ( i = 1; i <= 4; i++ )
{
printf "%s\t", A[i]
for ( j = 1; j <= m; j++ )
printf "%s\t", A[i+(j*4)]
printf "\n"
}
}
' file
you can combine split and paste
split -l 4 file part- && paste part-*
-l <number> means to split file to smaller files of <number> lines each.
part- is a prefix of our choice to be used for the new files. Note that they will be in alphabetical order, e.g. part-aa, part-ab etc. So paste will paste them as expected.

find difference and similarities between two text files using awk

I have two files:
file 1
1
2
34:rt
4
file 2
1
2
34:rt
7
I want to display rows that are in file 2 but not in file 1, vice versa, and the same values in both text files. So file the expected result should look like:
1 in both
2 in both
34:rt in both
4 in file 1
7 in file 2
This is what I have so far but I am not sure if this is the right structure:
awk '
FNR == NR {
a[$0]++;
next;
}
!($0 in a) {
// print not in file 1
}
($0 in a) {
for (i = 0; i <= NR; i++) {
if (a[i] == $0) {
// print same in both
}
}
delete a[$0] # deletes entries which are processed
}
END {
for (rest in a) {
// print not in file 2
}
}' $PWD/file1 $PWD/file2
Any suggestions?
If the order is not relevant then you can do:
awk '
NR==FNR { a[$0]++; next }
{
print $0, ($0 in a ? "in both" : "in file2");
delete a[$0]
}
END {
for(x in a) print x, "in file1"
}' file1 file2
1 in both
2 in both
34:rt in both
7 in file2
4 in file1
Or using comm as suggested by choroba in comments:
comm --output-delimiter="|" file1 file2 |
awk -F'|' '{print (NF==3 ? $NF " in both" : NF==2 ? $NF "in file2" : $NF " in file1")}'
1 in both
2 in both
34:rt in both
4 in file1
7 in file2

Separating and counting number of elements in a list with conditions

I would like to separate and count the number of elements within my input list.
The input.txt contains 2 columns, $1 is the element ID and $2 is it's ratio (number).
ENSG001 12.3107448237
ENSG007 4.3602275
ENSG008 2.9918420285
ENSG009 1.035588
ENSG010 0.999864
ENSG012 0.569833
ENSG013 0.495325
ENSG014 0.253893
ENSG015 0.125389
ENSG017 0.012568
ENSG018 -0.135689
ENSG020 -0.4938497942
ENSG022 -0.6429221854
ENSG024 -1.1759339381
ENSG029 -4.2722999766
ENSG030 -11.8447513281
I want to separate the ratios into the following categories:
Greater than or equal to 2
Between 1 and 2
Between 0.5 and 1
Between -0.5 and 0.5
Between -1 and -0.5
Between -2 and -1
Less than or equal to 2
and then print the count from each category into a single separate output file results.txt:
Total 16
> 2 3
1 to 2 1
0.5 to 1 2
-0.5 to 0.5 6
-0.5 to -1 1
-1 to -2 1
< -2 2
I can do this on the command line using the following:
awk $2 > 2 {print $1,$2} input.txt | wc -l
awk $2 > 0.5 && $2 < 1 {print $1,$2} input.txt | wc -l
awk $2 > -0.5 && $2 < 0.5 {print $1,$2} input.txt | wc -l
awk $2 > -0.5 && $2 < -1 {print $1,$2} input.txt | wc -l
awk $2 > -1 && $2 < -0.5 {print $1,$2} input.txt | wc -l
awk $2 > -2 && $2 < -1 {print $1,$2} input.txt | wc -l
awk $2 < -2 {print $1,$2} input.txt | wc -l
I think there is a quicker way of doing it using a shell script with while or for loop but I don't know how to. Any suggestions would be brilliant.
you can just process the file once, the straightforward way would be:
awk '$2>=2{a++;next}
$2>0.5 && $2 <1 {b++;next}
$2>-0.5 && $2 <0.5 {c++;next}
...
$2<=-2{x++;next}
END{print "total:",NR;
print ">2:",a;
print "1-2:",b;
...
print "<-2:",x
}' file
You could simply sort the entries numerically, using sort, and later count the number of entries in each interval. For example, considering your input:
cut -f 2 -d ' ' input.txt | sort -nr | awk '
BEGIN { split("2 1 0.5 -0.5 -1 -2", inter); i = 1; }
{
if (i > 6) { ++c; next; }
if ($1 >= inter[i]) ++c;
else if (i == 1) { print c, "greater than", inter[i++]; c = 1; }
else { print c, "between", inter[i - 1], "and", inter[i++]; c = 1; }
}
END { print c, "lower than", inter[i - 1]; }'
If your input is already sorted, you may even shorten your command line, using:
awk 'BEGIN { split("2 1 0.5 -0.5 -1 -2", inter); i = 1; }
{
if (i > 6) { ++c; next; }
if ($2 >= inter[i]) ++c;
else if (i == 1) { print c, "greater than", inter[i++]; c = 1; }
else { print c, "between", inter[i - 1], "and", inter[i++]; c = 1; }
}
END { print c, "lower than", inter[i - 1]; }' input.txt
And the resulting output -- which you may format as you will:
3 greater than 2
1 between 2 and 1
2 between 1 and 0.5
6 between 0.5 and -0.5
1 between -0.5 and -1
1 between -1 and -2
2 lower than -2
One approach would be to implement this with a single awk command by maintaining a running count for each category you are interested in.
#!/bin/bash
if [ $# -ne 1 ]
then
echo "Usage: $0 INPUT"
exit 1
fi
awk ' {
if ($2 > 2) count[0]++
else if ($2 > 1) count[1]++
else if ($2 > 0.5) count[2]++
else if ($2 > -0.5) count[3]++
else if ($2 > -1) count[4]++
else if ($2 > -2) count[5]++
else count[6]++
} END {
print " > 2\t", count[0]
print " 1 to 2\t", count[1]
print " 0.5 to 1\t", count[2]
print "-0.5 to 0.5\t", count[3]
print "-1 to -0.5\t", count[4]
print "-2 to -1\t", count[5]
print " < -2\t", count[6]
}' $1
awk -f script.awk input.txt
with script.awk:
{
if ($2>=2) counter1++
else if ($2>=1) counter2++
else if ($2>=0.5) counter3++
else if ($2>=-0.5) counter4++
else if ($2>=-1) counter5++
else if ($2>=-2) counter6++
else counter7++
}
END{
print "Greater than 2: "counter1
print "Between 1 and 2: "counter2
print "Between 0.5 and 1: "counter3
print "Between -0.5 and 0.5: "counter4
print "Between -1 and -0.5: "counter5
print "Between -2 and -1: "counter6
print "Less than 2: "counter7
}
script toto:
awk '
$2>2 { count[1]++; label[1]="Greater than or equal to 2"; }
($2>1 && $2<=2) { count[2]++; label[2]="Between 1 and 2"; }
($2>0.5 && $2<=1) { count[3]++; label[3]="Between 0.5 and 1"; }
($2>-0.5 && $2<=0.5) { count[4]++; label[4]="Between -0.5 and 0.5"; }
($2>-1 && $2<=-0.5) { count[5]++; label[5]="Between -1 and -0.5"; }
($2>-2 && $2<=-1) { count[6]++; label[6]="Between -2 and -1"; }
$2<=-2 { count[7]++; label[7]="Less than or equal to 2"; }
END { for (i=1;i<=7;i++)
{ printf "%-30s %s\n" ,label[i], count[i];
}
}
' /tmp/input.txt
and the result:
. /tmp/toto
Greater than or equal to 2 3
Between 1 and 2 1
Between 0.5 and 1 2
Between -0.5 and 0.5 6
Between -1 and -0.5 1
Between -2 and -1 1
Less than or equal to 2 2

Using AWK find a smallest number in a second column bigger than x

I have a file with two columns,
sdfsd 1.3
sdfds 3
sdfsdf 2.1
dsfsdf -1
if x is 2
I want to print sdfsdf 2.1
How to express it in awk (bash or sed is fine too)
It's awfully tempting to do this:
sort -k 2 -g | awk '$2 >= 2 { print; exit }'
Tested and works on your example. If no second column is at least 2, it prints nothing.
awk:
BEGIN {
min=0
mint=""
threshold=2
}
{
if($2 > threshold && ($2 < min || min == 0)) {
min = $2
mint = $1
}
}
END
{
print mint, min
}

How can I remove selected lines with an awk script?

I'm piping a program's output through some awk commands, and I'm almost where I need to be. The command thus far is:
myprogram | awk '/chk/ { if ( $12 > $13) printf("%s %d\n", $1, $12 - $13); else printf("%s %d\n", $1, $13 - $12) } ' | awk '!x[$0]++'
The last bit is a poor man's uniq, which isn't available on my target. Given the chance the command above produces an output such as this:
GR_CB20-chk_2, 0
GR_CB20-chk_2, 3
GR_CB200-chk_2, 0
GR_CB200-chk_2, 1
GR_HB20-chk_2, 0
GR_HB20-chk_2, 6
GR_HB20-chk_2, 0
GR_HB200-chk_2, 0
GR_MID20-chk_2, 0
GR_MID20-chk_2, 3
GR_MID200-chk_2, 0
GR_MID200-chk_2, 2
What I'd like to have is this:
GR_CB20-chk_2, 3
GR_CB200-chk_2, 1
GR_HB20-chk_2, 6
GR_HB200-chk_2, 0
GR_MID20-chk_2, 3
GR_MID200-chk_2, 2
That is, I'd like to print only line that has a maximum value for a given tag (the first 'field'). The above example is representative of the at data in that the output will be sorted (as though it had been piped through a sort command).
Based on my answer to a similar need, this script keeps things in order and doesn't accumulate a big array. It prints the line with the highest value from each group.
#!/usr/bin/awk -f
{
s = substr($0, 0, match($0, /,[^,]*$/))
if (s != prevs) {
if ( FNR > 1 ) print prevline
prevval = $2
prevline = $0
}
else if ( $2 > prevval ) {
prevval = $2
prevline = $0
}
prevs = s
}
END {
print prevline
}
If you don't need the items to be in the same order they were output from myprogram, the following works:
... | awk '{ if ($2 > x[$1]) x[$1] = $2 } END { for (k in x) printf "%s %s", k, x[k] }'

Resources