Find the probability in 2nd column for a selection in 1st column - shell

I have two columns as follows
ifile.dat
1 10
3 34
1 4
3 32
5 3
2 2
4 20
3 13
4 50
1 40
2 20
5 2
I would like to calculate the probability in 2nd column for some selection in 1st column.
ofile.dat
1-2 0.417 #Here 1-2 means all values in 1st column ranging from 1 to 2;
#0.417 is the probability of corresponding values in 2nd column
# i.e. count(10,4,2,40,20)/total = 5/12
3-4 0.417 #count(34,32,20,13,50)/total = 5/12
5-6 0.167 #count(3,2)/total = 2/12
Similarly if I choose the range of selection with 3 number, then the desire output will be
ofile.dat
1-3 0.667
4-6 0.333
RavinderSingh13 and James Brown had given nice scripts (see answer), but these are not working for lager values than 10 in 1st column.
ifile2.txt
10 10
30 34
10 4
30 32
50 3
20 2
40 20
30 13
40 50
10 40
20 20
50 2
~

EDIT2: Considering OP's edited samples could you please try following. I have tested it successfully with OP's 1st and latest edit samples and it worked perfectly fine with both of them.
Also one more thing, I made this solution such that a "corner case" where range could leave printing elements in case it is NOT crossing range value at last lines. Like OP's 1st sample where range=2 but max value is 5 so it will NOT leave 5 in here.
sort -n Input_file |
awk -v range="2" '
!b[$1]++{
c[++count]=$1
}
{
d[$1]=(d[$1]?d[$1] OFS:"")$2
tot_element++
till=$1
}
END{
for(i=1;i<=till;i++){
num+=split(d[i],array," ")
if(++j==range){
start=start?start:1
printf("%s-%s %.02f\n",start,i,num/tot_element)
start=i+1
j=num=""
delete array
}
if(j!="" && i==till){
printf("%s-%s %.02f\n",start,i,num/tot_element)
}
}
}
'
Output will be as follows.
1-10 0.25
11-20 0.17
21-30 0.25
31-40 0.17
41-50 0.17
EDIT: In case your Input_file don't have 2nd column then try following.
sort -k1 Input_file |
awk -v range="1" '
!b[$1]++{
c[++count]=$1
}
{
d[$1]=(d[$1]?d[$1] OFS:"")$0
tot_element++
till=$1
}
END{
for(i=1;i<=till;i+=(range+1)){
for(j=i;j<=i+range;j++){
num=split(d[c[j]],array," ")
total+=num
}
print i"-"i+range,tot_element?total/tot_element:0
total=num=""
}
}
'
Could you please try following, written and tested with shown samples.
sort -k1 Input_file |
awk -v range="1" '
!b[$1]++{
c[++count]=$1
}
{
d[$1]=(d[$1]?d[$1] OFS:"")$2
tot_element++
till=$1
}
END{
for(i=1;i<=till;i+=(range+1)){
for(j=i;j<=i+range;j++){
num=split(d[c[j]],array," ")
total+=num
}
print i"-"i+range,tot_element?total/tot_element:0
total=num=""
}
}
'
In case you don't have to include any 0 value then try following.
sort -k1 Input_file |
awk -v range="1" '
!b[$1]++{
c[++count]=$1
}
{
d[$1]=(d[$1]!=0?d[$1] OFS:"")$2
tot_element++
till=$1
}
END{
for(i=1;i<=till;i+=(range+1)){
for(j=i;j<=i+range;j++){
num=split(d[c[j]],array," ")
total+=num
}
print i"-"i+range,tot_element?total/tot_element:0
total=num=""
}
}
'

Another:
$ awk '
BEGIN {
a[1]=a[2]=1 # define the groups here
a[3]=a[4]=2 # others will go to an overflow group 3
}
{
b[(($1 in a)?a[$1]:3)]++ # group 3 defined here
}
END { # in the end
for(i in b) # loop all groups in no particular order
print i,b[i]/NR # and output
}' file
Output
1 0.416667
2 0.416667
3 0.166667
Update. Yet another awk with range configuration file. $1 is the start of range, $2 the end and $3 is the group name:
1 3 1-3
4 9 4-9
10 30 10-30
40 100 40-100
Awk program:
$ awk '
BEGIN {
OFS="\t"
}
NR==FNR {
for(i=$1;i<=$2;i++)
a[i]=$3
next
}
{
b[(($1 in a)?a[$1]:"others")]++ # the overflow group is now called "others"
}
END {
for(i in b)
print i,b[i]/NR
}' rangefile datafile
Output with both your datasets catenated together (and awk output piped to sort -n):
1-3 0.285714
4-9 0.142857
10-30 0.285714
40-100 0.142857

Related

Find the durations and their maximum between the dataset in an interval in shell script

This is related to my older question Find the durations and their maximum between the dataset in shell script
I have a dataset as:
ifile.txt
2
3
2
3
2
20
2
0
2
0
0
2
1
2
5
6
7
0
3
0
3
4
5
I would like to find out different duration and their maximum between the 0 values in 6 values interval.
My desire output is:
ofile.txt
6 20
1 2
1 2
1 2
5 7
1 3
3 5
Where
6 is the number of counts until next 0 within 6 values (i.e. 2,3,2,3,2,20) and 20 is the maximum value among them;
1 is the number of counts until next 0 within next 6 values (i.e. 2,0,2,0,0,2) and 2 is the maxmimum;
Next 1 and 2 are withing same 6 values;
5 is the number of counts until next 0 within next 6 values (i.e. 1,2,5,6,7,0) and 7 is the maximum among them;
And so on
As per the answer in my previous question, I was trying with this:
awk '(NR%6)==0
$0!=0{
count++
max=max>$0?max:$0
}
$0==0{
if(count){
print count,max
}
count=max=""
}
END{
if(count){
print count,max
}
}
' ifile.txt
A format command added to the EDIT2 solution given by RavinderSingh13 which will print exact desire output:
awk '
$0!=0{
count++
max=max>$0?max:$0
found=""
}
$0==0{
print count,max
count=max=0
next
}
FNR%6==0{
print count,max
count=max=0
found=1
}
END{
if(!found){
print count,max
}
}
' Input_file | awk '!/^ /' | awk '$1 != 0'
Output will be as follows.
6 20
1 2
1 2
1 2
5 7
1 3
3 5
EDIT2: Adding another solution which will print values in every 6 elements along with zeros coming in between.
awk '
$0!=0{
count++
max=max>$0?max:$0
found=""
}
$0==0{
print count,max
count=max=0
next
}
FNR%6==0{
print count,max
count=max=0
found=1
}
END{
if(!found){
print count,max
}
}
' Input_file
Output will be as follows.
6 20
1 2
1 2
0 0
1 2
5 7
1 3
3 5
EDIT: As per OP's comment OP doesn't want to reset of count of non-zeros when a zero value comes in that case try following.
awk '
$0!=0{
count++
max=max>$0?max:$0
found=""
}
FNR%6==0{
print count,max
count=max=0
found=1
}
END{
if(!found){
print count,max
}
}
' Input_file
Output will be as follows.
6 20
3 2
5 7
.......
Could you please try following(written and tested with posted samples only).
awk '
$0!=0{
count++
max=max>$0?max:$0
found=""
}
$0==0{
count=FNR%6==0?count:0
found=""
}
FNR%6==0{
print count,max
count=max=0
found=1
}
END{
if(!found){
print count,max
}
}
' Input_file

Detect increment made in any column

I have following data as input. I am trying to find the increment per group.
col1 col2 col3 group
1 2 100 alpha
1 2 100 alpha
1 2 100 alpha
3 4 200 beta
3 4 200 beta
3 4 200 beta
3 4 300 beta
5 6 700 charlie
7 8 400 tango
7 8 300 tango
7 8 700 tango
Example output:
tango: 300
charlie:0
beta:100
alpha:0
I am trying this approch but answers are incorrect as sometimes values increases in between the samples:
awk 'NR>1{print $NF}' foo |while read line;do grep -w $line foo|sort -k3n ;done |awk '!a[$4]++' |sort -k4
1 2 100 alpha
3 4 200 beta
5 6 700 charlie
7 8 300 tango
awk 'NR>1{print $NF}' foo |while read line;do grep -w $line foo|sort -k3n ;done |tac|awk '!a[$4]++' |sort -k4
1 2 100 alpha
3 4 300 beta
5 6 700 charlie
7 8 700 tango
Awk solution:
awk 'NR==1{ next }
g && $4 != g{ print g":"(v - gr[g]) }
!($4 in gr){ gr[$4]=$3 }{ g=$4; v=$3 }
END{ print g":"(v - gr[g]) }' file
NR==1{ next } - skip the 1st record
g - variable aimed to hold group name
v - variable aimed to hold group value
!($4 in gr){ gr[$4]=$3 } - on the 1st occurrence of a distinct group name $4 - save its first value $3 into array gr
g && $4 != g{ print g":"(v - gr[g]) } - if the current group name $4 differs from the previous one g - print the delta between the last and 1st values of the previous group
The output:
alpha:0
beta:100
charlie:0
tango:300
The following should do the trick, this solution does not require the file to be sorted by group name.
awk '(NR==1){next}
{groupc[$4]++}
(groupc[$4]==1){groupv[$4]=$3}
{groupl[$4]=$3}
END{for(i in groupc) { print i":",groupl[i]-groupv[i]} }
' foo
The following things happen :
skip the first line (NR==1){next}
count how many time group is occuring {groupc[$4]++}
if the group count equal 1 define its first value under groupv
define the last seen value as groupl
at the END, run over all array keys (which are the groups), and print the last minus the first value.
output :
tango: 300
alpha: 0
beta: 100
charlie: 0
Following awk may help you in same too. It will provide output in same sequence as per your Input_file's last column values.
awk '
FNR==1{
next}
prev!=$NF && prev{
val=prev_val!=a[prev]?prev_val-a[prev]:0;
printf("%s %d\n",prev,val>0?val:0)}
!a[$NF]{
a[$NF]=$(NF-1)}
{
prev=$NF;
prev_val=$(NF-1)}
END{
val=prev_val!=a[prev]?prev_val-a[prev]:0;
printf("%s %d\n",prev,val>0?val:0)}
' Input_file
Output will be as follows. Will add explanation too shortly.
alpha 0
beta 100
charlie 0
tango 300
Explanation: Adding explanation of code too now for learning purposes of all.
awk '
FNR==1{ ##To skip first line of Input_file which is heading I am putting condition if FNR==1 then do next, where next will skip all further statements of awk.
next}
prev!=$NF && prev{ ##Checking conditions here if variable prev value is NOT equal to current line $NF and variable prev is NOT NULL then do following:
val=prev_val!=a[prev]?prev_val-a[prev]:0;##create a variable val, if prev_val is not equal to a[prev] then subttract prev_val and s[prev] else it will be zero.
printf("%s %d\n",prev,val>0?val:0)} ##printing the value of variable prev(which is nothing but value of last column) and then print value of val if greater than 0 or print 0 in place of val here.
!a[$NF]{ ##Checking if array a value whose index is $NF is NULL then fill it with current $NF value, actually this is to get the very first value of any column so that later we could subtract it with the its last value as per OP request.
a[$NF]=$(NF-1)}
{
prev=$NF; ##creating variable named prev and assigning its value to last column of the current line.
prev_val=$(NF-1)} ##creating variable named prev_val whose value will be second last columns value of current line.
END{ ##starting end block of awk code here, it will come when Input_file is done with reading.
val=prev_val!=a[prev]?prev_val-a[prev]:0;##getting value of variable val where checking if prev_val is not equal to a[prev] then subtract prev_val and s[prev] else it will be zero.
printf("%s %d\n",prev,val>0?val:0)} ##printing the value of variable prev(which is nothing but value of last column) and then print value of val if greater than 0 or print 0 in place of val here.
' Input_file ##Mentioning the Input_file name here.
$ cat tst.awk
NR==1 { next }
!($4 in beg) { beg[$4] = $3 }
{ end[$4] = $3 }
END {
for (grp in beg) {
print grp, end[grp] - beg[grp]
}
}
$ awk -f tst.awk file
tango 300
alpha 0
beta 100
charlie 0

How to edit few lines in a column using awk?

I have a ascii data file e.g.:
ifile.txt
2
3
2
3
4
5
6
4
I would like to multiply 3 into all the numbers after 6th line. So outfile will be:
ofile.txt
2
3
2
3
4
15
18
12
my algorithm/ script is
awk '{if ($1<line 6); printf "%10.5f\n", $1}' ifile.txt > ofile.txt
awk '{if ($1>=line 6); printf "%10.5f\n", $1*3}' ifile.txt >> ofile.txt
The simplest way to do this is:
awk 'NR > 6 { $1 *= 3 } 1' ifile.txt
Multiply the first field by 3 when the record (line) number NR is greater than 6.
The structure of an awk program is condition { action }, where the default condition is true and the default action is { print }, so the 1 at the end is the shortest way of always printing every line.

Sum of all rows of all columns - Bash

I have a file like this
1 4 7 ...
2 5 8
3 6 9
And I would like to have as output
6 15 24 ...
That is the sum of all the lines for all the columns. I know that to sum all the lines of a certain column (say column 1) you can do like this:
awk '{sum+=$1;}END{print $1}' infile > outfile
But I can't do it automatically for all the columns.
One more awk
awk '{for(i=1;i<=NF;i++)$i=(a[i]+=$i)}END{print}' file
Output
6 15 24
Explanation
{for (i=1;i<=NF;i++) Set field to 1 and increment through
$i=(a[i]+=$i) Set the field to the sum + the value in field
END{print} Print the last line which now contains the sums
As with the other answers this will retain the order of the fields regardless of the number of them.
You want to sum every column differently. Hence, you need an array, not a scalar:
$ awk '{for (i=1;i<=NF;i++) sum[i]+=$i} END{for (i in sum) print sum[i]}' file
6
15
24
This stores sum[column] and finally prints it.
To have the output in the same line, use:
$ awk '{for (i=1;i<=NF;i++) sum[i]+=$i} END{for (i in sum) printf "%d%s", sum[i], (i==NF?"\n":" ")}' file
6 15 24
This uses the trick printf "%d%s", sum[i], (i==NF?"\n":" "): print the digit + a character. If we are in the last field, let this char be new line; otherwise, just a space.
There is a very simple command called numsum to do this:
numsum -c FileName
-c --- Print out the sum of each column.
For example:
cat FileName
1 4 7
2 5 8
3 6 9
Output :
numsum -c FileName
6 15 24
Note:
If the command is not installed in your system, you can do it with this command:
apt-get install num-utils
echo "1 4 7
2 5 8
3 6 9 " \
| awk '{for (i=1;i<=NF;i++){
sums[i]+=$i;maxi=i}
}
END{
for(i=1;i<=maxi;i++){
printf("%s ", sums[i])
}
print}'
output
6 15 24
My recollection is that you can't rely on for (i in sums) to produce the keys any particular order, but maybe this is "fixed" in newer versions of gawk.
In case you're using an old-line Unix awk, this solution will keep your output in the same column order, regardless of how "wide" your file is.
IHTH
AWK Program
#!/usr/bin/awk -f
{
print($0);
len=split($0,a);
if (maxlen < len) {
maxlen=len;
}
for (i=1;i<=len;i++) {
b[i]+=a[i];
}
}
END {
for (i=1;i<=maxlen;i++) {
printf("%s ", b[i]);
}
print ""
}
Output
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
3 6 9 12 15
Your answer is correct. It is just missed to print "sum". Try this:
awk '{sum+=$1;} END{print sum;}' infile > outfile

Aggregate rows with specified granularity

Input:
11 1
12 2
13 3
21 1
24 2
33 1
50 1
Let's say 1st column specify index. I'd like to reduce size of my data as follows:
I sum values from second column with granularity of 10 according to indices. An example:
First I consider range of 0-9 of indices. There aren't any indices from that range so sum equals 0. Next I go to the next range 10-19. There're 3 indices (11,12,13) which meet the range. I sum values from 2nd column for them, it equals 1+2+3=6. And so on...
Desirable output:
0 0
10 6
20 3
30 1
40 0
50 1
That's what I made up:
M=0;
awk 'FNR==NR
{
if ($1 < 10)
{ A[$1]+=$2;next }
else if($1 < $M+10)
{
A[$M]+=$2;
next
}
else
{ $M=$M+10;
A[$M]+=2;
next
}
}END{for(i in A){print i" "A[i]}}' input_file
Sorry but I'm not quite good at AWK.
After some changes:
awk 'FNR==NR {
M=10;
if ($1 < 10){
A[$1]+=$2;next
} else if($1 < M+10) {
A[M]+=$2;
next
} else {
M=sprintf("%d",$1/10);
M=M*10;
A[M]+=$2;
next
}
}END{for(i in A){print i" "A[i]}}' input
This is GNU awk
{
ind=int($1/10)*10
if (mxi<ind) mxi=ind
a[ind]++
}
END {
for (i=0; i<=mxi; i+=10) {
s=(a[i]*(a[i]+1))/2
print i " " s
}
}

Resources