How to calculate gradient with AWK - bash

I have a file which includes two columns such as:
A B
1 2
10 20
100 200
.
.
.
I want to calculate gradient (or slope) dB/dA by awk. It means the third column should be the difference between each adjacent rows in column B divides on difference between each corresponding adjacent rows in column A. The results for above date should be:
A B dB/dA
1 2 (20-2)/(10-1)=2
10 20 (200-20)/(100-10)=2
100 200
.
.
.
How can I do that?

Given your files, you can do this :
$cat file
A B
1 2
10 20
100 200
awk 'BEGIN{OFS="\t"}NR==1{print $1,$2,"dA/dB"}NR>2{print a,b,($2-b)/($1-a)}{a=$1;b=$2}' file
A B dA/dB
1 2 2
10 20 2
100 200 2
With :
BEGIN{OFS="\t"} to set Output Field Separator to tab
NR==1{print $1,$2,"dA/dB"} to copy the header and add the grad column
NR>2 to skip the header and the first line as you want to start from the second row (of value)
{a=$1;b=$2} to save values in a and b, for next line. This part works from the 1st line
{print a,b,($2-b)/($1-a)} print the previous line and the gradient between this line and the previous
Hope this helps

Related

Fast way to modify a field for rows having the same values in multiple columns (in bash)?

(it doesn't seem complicated, so I'm sorry if it's a duplicated question; I didn't manage to formulate it well enough for a good search!)
I have a file with five columns:
dog 5 red car 10
dog 9 blue car 10
cat 1 blue car 18
owl 5 red car 15
bla 8 blue train 100
bla 2 red train 100
...
I want to find rows matching the same values in columns 1, 4 and 5 (rows (1,2) and (5,6) in the example above), and modify their value in column 2 to "0", as in:
dog 0 red car 10
dog 0 blue car 10
cat 1 blue car 18
owl 5 red car 15
bla 0 blue train 100
bla 0 red train 100
...
I am doing that by looping through the unique values of column 1 and using grep + awk to find the matches and modify their values. However, I actually have a large file (> 10000 rows, 15 columns) and I would like to find a way of finding the matches without having to read the file multiple times in the loop. Any ideas for bash or a perl one-liner to speed that up?
Thanks a lot!
Using awk (And assuming columns are delimited by tabs):
$ awk 'BEGIN { FS=OFS="\t"}
NR == FNR { dups[$1,$4,$5]++; next }
dups[$1,$4,$5] > 1 { $2 = 0 }
1' input.txt input.txt
dog 0 red car 10
dog 0 blue car 10
cat 1 blue car 18
owl 5 red car 15
bla 0 blue train 100
bla 0 red train 100
This does read the file twice in order to find all lines with the same three fields, but I don't think there's a good way to avoid two passes through the data one way or another while preserving order. For example, a perl version that only reads the file once, but stores all the contents in memory to iterate over at the end, so still two passes:
perl -F"\t" -lane '$dups->{$F[0]}{$F[3]}{$F[4]}++;
push #lines, [#F];
END {
for my $l (#lines) {
$l->[1] = 0 if $dups->{$l->[0]}{$l->[3]}{$l->[4]} > 1;
print join("\t", #$l);
}
}' input.txt

Find mean and maximum in 2nd column for a selection in 1st column

I have two columns as follows
ifile.dat
1 10
3 34
1 4
3 32
5 3
2 2
4 20
3 13
4 50
1 40
2 20
5 2
I would like to calculate the mean and maximum values in 2nd column for some selection in 1st column.
ofile.dat
1-2 40 15.2 #Here 1-2 means all values in 1st column ranging from 1 to 2;
#40 is the maximum of corresponding values in 2nd column and 15.2 is their mean i.e. (10+4+2+40+20)/5
3-4 50 29.8 #Here 3-4 means all values in 1st column ranging from 3 to 4;
#50 is their maximum and 29.8 is their mean i.e. (34+32+20+13+50)/5
5-6 3 2.5 #Here 5-6 means all values in 1st column ranging from 5 to 6;
#3 is their maximum and 2.5 is their mean i.e. (3+2)/2
Similarly if I choose the range of selection with 3 number, then the desire output will be
ofile.dat
1-3 40 19.37
4-6 50 18.7
I have the following script which calculates for single values in the 1st column. But I am looking for multiple selections from 1st column.
awk '{
if (a[$1] < $2) { a[$1]=$2 }} END { for (i in a){}}
{b[$1]+=$2; c[$1]++} END{for (i in b)
printf "%d %2s %5s %5.2f\n", i, OFS, a[i], b[i]/c[i]}' ifile.dat
The original data has the values in the 1st column varying from 1 to 100000. So I need to stratify with an interval of 1000. i.e. 1-1000, 1001-2000, 2001-3000,...
The following awk script will provide basic descriptive statistics with grouping.
Suggesting to look into more robust solution (Python, Perl, R, ...) which will support additional measures, flexibility - no point to reinvent the circle.
Logic for grouping separated is 1-1000, 1001-2000, as per comment above. Code is verbose for clarity.
awk '
{
# Total Counter
nn++ ;
# Group id
gsize = 1000
gid = int(($1-1)/gsize )
v = $2
# Setup new group, if needed
if ( !n[gid] ) {
n[gid] = 0
sum[gid] = 0
max[gid] = min[gid] = v
name[gid] = (gid*gsize +1) "-" ((gid+1)*gsize)
}
if ( v > max[gid] ) max[gid] = v
sum[gid] += v
n[gid]++
}
END {
# Print all groups
for (gid in name) {
printf "%-20s %4d %6.1f %5.1F\n", name[gid], max[gid], sum[gid]/n[gid], n[gid]/nn ;
}
}
'
Could you please try following, tested and written with shown samples only.
sort -k1 Input_file |
awk -v range="1" '
!b[$1]++{
c[++count]=$1
}
{
a[$1]=a[$1]>$2?a[$2]:$2
d[$1]+=$2
e[$1]++
till=$1
}
END{
for(i=1;i<=till;i+=(range+1)){
for(j=i;j<=i+range;j++){
max=max>a[c[j]]?max:a[c[j]]
total+=d[c[j]]
occr+=e[c[j]]
}
print i"-"i+range,max,occr?total/occr:0
occr=total=max=""
}
}
'
For shown samples output will be as follows.
1-2 40 15.2
3-4 50 29.8
5-6 3 2.5
I have kept range variable as 1 since difference of 1st digit is 2 so in your case case lets say 1,1001 and so on is there then keep range variable value as 999 for same.

Sum up custom grand total on crosstab in BIRT

I have a crosstab and create custom grand total for the row level in each column dimension, by using a data element expression.
Crosstab Example:
Cat 1 Cat 2 GT
ITEM C F % VALUE C F % VALUE
A 101 0 0.9 10 112 105 93.8 10 20
B 294 8 2.7 6 69 66 95.7 10 16
C 211 7 3.3 4 212 161 75.9 6 10
------------------------------------------------------------------
GT 606 15 2.47 6 393 332 84.5 8 **14**
Explanation for GT row:
Those C and F column is summarized from the above. But the
% column is division result of F/C.
Create a data element to fill the VALUE column, which comes from range of value definition, varies for each Cat (category). For instance... in Cat 1, if the value is between 0 - 1 the value will be 10, or between 1 - 2 = 8, etc. And condition for Cat 2, between 85 - 100 = 10, and 80 - 85 = 8, etc.
The GT row (with the value of 14), is gathered by adding VALUE of Cat 1 + Cat 2.
I am able to work on point 1 and 2 above, but I can't seem to make it working for GT row. I don't know the code/expression to sum up the VALUE data element for this 2 categories. Because those VALUE field comes from one data element in design mode.
I have found the solution for my problem. I can show the result by using a report variable. I am assigning 2 report variables in % field expression, based on the category in data cube dimension (by using if statement). And then in data element expression, I am calling both of the expressions and add them.

using awk to average specified rows

I have a data file set up like
a 1
b 2
c 3
d 4
a 5
b 6
c 7
d 6
etc
and I would like to output to a new file
a average of 2nd column from all "a" rows
b average of 2nd column from all "b" rows
etc
where a, b, c... are also numbers.
I have been able to do this for specific values (1.4 in the example below) of the 1st column using awk:
awk '{ if ( $1 == 1.4) total += $2; count++ }
END {print total/10 }' data
though count is not giving me the correct about of rows (i.e. count should be 10 as I have manually put in 10 to do the average in the last line).
I assume a for loop will be required but I have not been able to implement that correctly.
Please help. Thanks.
awk '{a[$1]+=$2;c[$1]++}END{for(x in a)printf "average of %s is %.2f\n",x,a[x]/c[x]}'
the output of above line (with your example input) is:
average of a is 3.00
average of b is 4.00
average of c is 5.00
average of d is 5.00

How to reduce a set of lines to take the average?

I have a file with lines like these (columns are tab seperated)
2 1.414455 3.70898
2 2.414455 3.80898
2 3.414455 3.90898
2 1.414455 3.90898
4 4.414455 7.23898
4 3.414455 6.23898
4 5.414455 8.23898
i.e. there are consecutive lines where the first column is an integer, and rest two columns are floats.
I want to reduce them as below
2 2.164455 3.75898
4 4.414455 7.23898
where I keep the first columns, and take the averages of the second and third columns for all elements with same first columns. The number of consecutive lines with same first elements might be different, but they will always be consecutive.
I can do this in perl, but was wondering if there is a simpler bash / sed / awk mix that can do the same for me?
Using awk:
awk '{a[$1]+=$2;b[$1]+=$3;c[$1]++;}END{for(i in c)print i, a[i]/c[i],b[i]/c[i];}' file
2 2.16445 3.83398
4 4.41446 7.23898
Using 3 different arrays: a and b to keep the sum of 2nd and 3rd columns, c to keep the count of elements. At the end, calculating the average and printing it.

Resources