Aggregate rows with specified granularity - shell

Input:
11 1
12 2
13 3
21 1
24 2
33 1
50 1
Let's say 1st column specify index. I'd like to reduce size of my data as follows:
I sum values from second column with granularity of 10 according to indices. An example:
First I consider range of 0-9 of indices. There aren't any indices from that range so sum equals 0. Next I go to the next range 10-19. There're 3 indices (11,12,13) which meet the range. I sum values from 2nd column for them, it equals 1+2+3=6. And so on...
Desirable output:
0 0
10 6
20 3
30 1
40 0
50 1
That's what I made up:
M=0;
awk 'FNR==NR
{
if ($1 < 10)
{ A[$1]+=$2;next }
else if($1 < $M+10)
{
A[$M]+=$2;
next
}
else
{ $M=$M+10;
A[$M]+=2;
next
}
}END{for(i in A){print i" "A[i]}}' input_file
Sorry but I'm not quite good at AWK.
After some changes:
awk 'FNR==NR {
M=10;
if ($1 < 10){
A[$1]+=$2;next
} else if($1 < M+10) {
A[M]+=$2;
next
} else {
M=sprintf("%d",$1/10);
M=M*10;
A[M]+=$2;
next
}
}END{for(i in A){print i" "A[i]}}' input

This is GNU awk
{
ind=int($1/10)*10
if (mxi<ind) mxi=ind
a[ind]++
}
END {
for (i=0; i<=mxi; i+=10) {
s=(a[i]*(a[i]+1))/2
print i " " s
}
}

Related

Bash iterate through fields of a TSV file and divide it by the sum of the column

I have a tsv file with several columns, and I would like to iterate through each field, and divide it by the sum of that column:
Input:
A 1 2 1
B 1 0 3
Output:
A 0.5 1 0.25
B 0.5 0 0.75
I have the following to iterate through the fields, but I am not sure how I can find the sum of the column that the field is located in:
awk -v FS='\t' -v OFS='\t' '{for(i=2;i<=NF;i++){$i=$i/SUM_OF_COLUMN}} 1' input.tsv
You may use this 2-pass awk:
awk '
BEGIN {FS=OFS="\t"}
NR == FNR {
for (i=2; i<=NF; ++i)
sum[i] += $i
next
}
{
for (i=2; i<=NF; ++i)
$i = (sum[i] ? $i/sum[i] : 0)
}
1' file file
A 0.5 1 0.25
B 0.5 0 0.75
With your shown samples please try following awk code in a single pass of Input_file. Simply creating 2 arrays 1 for sum of columns with their indexes and other for values of fields along with their field numbers and in END block of this program traversing till value of FNR(all lines) and then printing values of arrays as per need (where when we are traversing through values then dividing their actual values with sum of that respective column).
awk '
BEGIN{FS=OFS="\t"}
{
arr[FNR,1]=$1
for(i=2;i<=NF;i++){
sum[i]+=$i
arr[FNR,i]=$i
}
}
END{
for(i=1;i<=FNR;i++){
printf("%s\t",arr[i,1])
for(j=2;j<=NF;j++){
printf("%s%s",sum[j]?(arr[i,j]/sum[j]):"N/A",j==NF?ORS:OFS)
}
}
}
' Input_file

Find the probability in 2nd column for a selection in 1st column

I have two columns as follows
ifile.dat
1 10
3 34
1 4
3 32
5 3
2 2
4 20
3 13
4 50
1 40
2 20
5 2
I would like to calculate the probability in 2nd column for some selection in 1st column.
ofile.dat
1-2 0.417 #Here 1-2 means all values in 1st column ranging from 1 to 2;
#0.417 is the probability of corresponding values in 2nd column
# i.e. count(10,4,2,40,20)/total = 5/12
3-4 0.417 #count(34,32,20,13,50)/total = 5/12
5-6 0.167 #count(3,2)/total = 2/12
Similarly if I choose the range of selection with 3 number, then the desire output will be
ofile.dat
1-3 0.667
4-6 0.333
RavinderSingh13 and James Brown had given nice scripts (see answer), but these are not working for lager values than 10 in 1st column.
ifile2.txt
10 10
30 34
10 4
30 32
50 3
20 2
40 20
30 13
40 50
10 40
20 20
50 2
~
EDIT2: Considering OP's edited samples could you please try following. I have tested it successfully with OP's 1st and latest edit samples and it worked perfectly fine with both of them.
Also one more thing, I made this solution such that a "corner case" where range could leave printing elements in case it is NOT crossing range value at last lines. Like OP's 1st sample where range=2 but max value is 5 so it will NOT leave 5 in here.
sort -n Input_file |
awk -v range="2" '
!b[$1]++{
c[++count]=$1
}
{
d[$1]=(d[$1]?d[$1] OFS:"")$2
tot_element++
till=$1
}
END{
for(i=1;i<=till;i++){
num+=split(d[i],array," ")
if(++j==range){
start=start?start:1
printf("%s-%s %.02f\n",start,i,num/tot_element)
start=i+1
j=num=""
delete array
}
if(j!="" && i==till){
printf("%s-%s %.02f\n",start,i,num/tot_element)
}
}
}
'
Output will be as follows.
1-10 0.25
11-20 0.17
21-30 0.25
31-40 0.17
41-50 0.17
EDIT: In case your Input_file don't have 2nd column then try following.
sort -k1 Input_file |
awk -v range="1" '
!b[$1]++{
c[++count]=$1
}
{
d[$1]=(d[$1]?d[$1] OFS:"")$0
tot_element++
till=$1
}
END{
for(i=1;i<=till;i+=(range+1)){
for(j=i;j<=i+range;j++){
num=split(d[c[j]],array," ")
total+=num
}
print i"-"i+range,tot_element?total/tot_element:0
total=num=""
}
}
'
Could you please try following, written and tested with shown samples.
sort -k1 Input_file |
awk -v range="1" '
!b[$1]++{
c[++count]=$1
}
{
d[$1]=(d[$1]?d[$1] OFS:"")$2
tot_element++
till=$1
}
END{
for(i=1;i<=till;i+=(range+1)){
for(j=i;j<=i+range;j++){
num=split(d[c[j]],array," ")
total+=num
}
print i"-"i+range,tot_element?total/tot_element:0
total=num=""
}
}
'
In case you don't have to include any 0 value then try following.
sort -k1 Input_file |
awk -v range="1" '
!b[$1]++{
c[++count]=$1
}
{
d[$1]=(d[$1]!=0?d[$1] OFS:"")$2
tot_element++
till=$1
}
END{
for(i=1;i<=till;i+=(range+1)){
for(j=i;j<=i+range;j++){
num=split(d[c[j]],array," ")
total+=num
}
print i"-"i+range,tot_element?total/tot_element:0
total=num=""
}
}
'
Another:
$ awk '
BEGIN {
a[1]=a[2]=1 # define the groups here
a[3]=a[4]=2 # others will go to an overflow group 3
}
{
b[(($1 in a)?a[$1]:3)]++ # group 3 defined here
}
END { # in the end
for(i in b) # loop all groups in no particular order
print i,b[i]/NR # and output
}' file
Output
1 0.416667
2 0.416667
3 0.166667
Update. Yet another awk with range configuration file. $1 is the start of range, $2 the end and $3 is the group name:
1 3 1-3
4 9 4-9
10 30 10-30
40 100 40-100
Awk program:
$ awk '
BEGIN {
OFS="\t"
}
NR==FNR {
for(i=$1;i<=$2;i++)
a[i]=$3
next
}
{
b[(($1 in a)?a[$1]:"others")]++ # the overflow group is now called "others"
}
END {
for(i in b)
print i,b[i]/NR
}' rangefile datafile
Output with both your datasets catenated together (and awk output piped to sort -n):
1-3 0.285714
4-9 0.142857
10-30 0.285714
40-100 0.142857

Find the durations and their maximum between the dataset in an interval in shell script

This is related to my older question Find the durations and their maximum between the dataset in shell script
I have a dataset as:
ifile.txt
2
3
2
3
2
20
2
0
2
0
0
2
1
2
5
6
7
0
3
0
3
4
5
I would like to find out different duration and their maximum between the 0 values in 6 values interval.
My desire output is:
ofile.txt
6 20
1 2
1 2
1 2
5 7
1 3
3 5
Where
6 is the number of counts until next 0 within 6 values (i.e. 2,3,2,3,2,20) and 20 is the maximum value among them;
1 is the number of counts until next 0 within next 6 values (i.e. 2,0,2,0,0,2) and 2 is the maxmimum;
Next 1 and 2 are withing same 6 values;
5 is the number of counts until next 0 within next 6 values (i.e. 1,2,5,6,7,0) and 7 is the maximum among them;
And so on
As per the answer in my previous question, I was trying with this:
awk '(NR%6)==0
$0!=0{
count++
max=max>$0?max:$0
}
$0==0{
if(count){
print count,max
}
count=max=""
}
END{
if(count){
print count,max
}
}
' ifile.txt
A format command added to the EDIT2 solution given by RavinderSingh13 which will print exact desire output:
awk '
$0!=0{
count++
max=max>$0?max:$0
found=""
}
$0==0{
print count,max
count=max=0
next
}
FNR%6==0{
print count,max
count=max=0
found=1
}
END{
if(!found){
print count,max
}
}
' Input_file | awk '!/^ /' | awk '$1 != 0'
Output will be as follows.
6 20
1 2
1 2
1 2
5 7
1 3
3 5
EDIT2: Adding another solution which will print values in every 6 elements along with zeros coming in between.
awk '
$0!=0{
count++
max=max>$0?max:$0
found=""
}
$0==0{
print count,max
count=max=0
next
}
FNR%6==0{
print count,max
count=max=0
found=1
}
END{
if(!found){
print count,max
}
}
' Input_file
Output will be as follows.
6 20
1 2
1 2
0 0
1 2
5 7
1 3
3 5
EDIT: As per OP's comment OP doesn't want to reset of count of non-zeros when a zero value comes in that case try following.
awk '
$0!=0{
count++
max=max>$0?max:$0
found=""
}
FNR%6==0{
print count,max
count=max=0
found=1
}
END{
if(!found){
print count,max
}
}
' Input_file
Output will be as follows.
6 20
3 2
5 7
.......
Could you please try following(written and tested with posted samples only).
awk '
$0!=0{
count++
max=max>$0?max:$0
found=""
}
$0==0{
count=FNR%6==0?count:0
found=""
}
FNR%6==0{
print count,max
count=max=0
found=1
}
END{
if(!found){
print count,max
}
}
' Input_file

awk command to sum pairs of lines and filter out under particular condition

I have a file with numbers and I want to sum numbers from two lines and this for each column, then in my last step I want to filter out pairs of lines that has a count bigger or equal than 3 of '0' sum counts. I write a small example to make it clear:
This is my file (without the comments ofc), it contains 2 pairs of lines (=4 lines) with 5 columns.
2 6 0 8 9 # pair 1.A
0 1 0 5 1 # pair 1.B
0 2 0 3 0 # pair 2.A
0 0 0 0 0 # pair 2.B
And I need to sum up pairs of lines so I get something like this (intermediate step)
2 7 0 13 10 # sum pair 1, it has one 0
0 2 0 3 0 # sum pair 2, it has three 0
Then I want to print the original lines, but only those which the sum of 0 (of the sum of the two lines) is lower than 3, therefore I should get printed this:
2 6 0 8 9 # pair 1.A
0 1 0 5 1 # pair 1.B
Because the sum of the second pair of lines has three 0, then it should be excluded
So from the first file I need to get the last output.
So far what I have been able to do is to sum pairs of lines, count zeros, and identify those with a count lower than 3 of 0 but I don't know how to print the two lines that contributed to the SUM, I am only able to print one of the two lines (the last one). This is the awk I am using:
awk '
NR%2 { split($0, a); next }
{ for (i=1; i<=NF; i++) if (a[i]+$i == 0) SUM +=1;
if (SUM < 3) print $0; SUM=0 }' myfile
(That's what I get now)
0 1 0 5 1 # pair 1.B
Thanks!
Another variation, could be useful to avoid loop iterations in some input cases:
awk '!(NR%2){ zeros=0; for(i=1;i<=NF;i++) { if(a[i]+$i==0) zeros++; if(zeros>=3) next }
print prev ORS $0 }{ split($0,a); prev=$0 }' file
The output:
2 6 0 8 9
0 1 0 5 1
Well, after digging a little bit more I found that it was rather simple to print the previous line (I was complicating myself)
awk '
NR%2 { split($0, a) ; b=$0; next }
{ for (i=1; i<=NF; i++) if (a[i]+$i == 0) SUM +=1;
if (SUM < 3) print b"\n"$0; SUM=0}' myfile
So I just have to save the first line in a variable b and print when the condition is favorable.
Hope it can help other people too
$ cat tst.awk
!(NR%2) {
split(prev,p)
zeroCnt = 0
for (i=1; i<=NF; i++) {
zeroCnt += (($i + p[i]) == 0 ? 1 : 0)
}
if (zeroCnt < 3) {
print prev ORS $0
}
}
{ prev = $0 }
$ awk -f tst.awk file
2 6 0 8 9
0 1 0 5 1

Bash First Element in List Recognition

I'm very new to Bash so I'm sorry if this question is actually very simple. I am dealing with a text file that contains many vertical lists of numbers 2-32 counting up by 2, and each number has a line of other text following it. The problem is that some of the lists are missing numbers. Any pointers for a code that could go through and check to see if each number is there, and if not add a line and put the number in.
One list might look like:
2 djhfbadsljfhdsalkfjads;lfkjs
4 dfhadslkfjhasdlkfjhdsalfkjsahf
6 dsa;fghds;lfhsdalfkjhds;fjdsklj
8 daflgkdsakfjhasdlkjhfasdjkhf
12 dlsagflakdjshgflksdhflksdahfl
All the way down to 32. How would I in this case make it so the 10 is recognized as missing and then added in above the 12? Thanks!
Here's one awk-based solution (formatted for readability, not necessarily how you would type it):
awk ' { value[0 + $1] = $2 }
END { for (i = 2; i < 34; i+=2)
print i, value[i]
}' input.txt
It basically just records the existing lines in a key/value pair (associative array), then at the end, prints all the records you care about, along with the (possibly empty) value saved earlier.
Note: if the first column needs to be seen as a string instead of an integer, this variant should work:
awk ' { value[$1] = $2 }
END { for (i = 2; i < 34; i+=2)
print i, value[i ""]
}' input.txt
You can use awk to figure out the missing line and add it back:
awk '$1==NR*2{i=NR*2+2} i < $1 { while (i<$1) {print i; i+=2} i+=2}
END{for (; i<=32; i+=2) print i} 1' file
Testing:
cat file
2 djhfbadsljfhdsalkfjads;lfkjs
4 dfhadslkfjhasdlkfjhdsalfkjsahf
6 dsa;fghds;lfhsdalfkjhds;fjdsklj
20 daflgkdsakfjhasdlkjhfasdjkhf
24 dlsagflakdjshgflksdhflksdahfl
awk '$1==NR*2{i=NR*2+2} i < $1 { while (i<$1) {print i; i+=2} i+=2}
END{for (; i<=32; i+=2) print i} 1' file
2 djhfbadsljfhdsalkfjads;lfkjs
4 dfhadslkfjhasdlkfjhdsalfkjsahf
6 dsa;fghds;lfhsdalfkjhds;fjdsklj
8
10
12
14
16
18
20 daflgkdsakfjhasdlkjhfasdjkhf
22
24 dlsagflakdjshgflksdhflksdahfl
26
28
30
32

Resources