Create bins with awk histogram-like - bash

Here's my input file :
1.37987
1.21448
0.624999
1.28966
1.77084
1.088
1.41667
I would like to create bins of a size of my choice to get histogram-like output, e.g. something like this for 0.1 bins, starting from 0 :
0 0.1 0
...
0.5 0.6 0
0.6 0.7 1
...
1.0 1.1 1
1.1 1.2 0
1.2 1.3 2
1.3 1.4 1
...
My file is too big for R, so I'm looking for an awk solution (also open to anything else that I can understand, as I'm still a Linux beginner).
This was sort of already answered in this post : awk histogram in buckets but the solution is not working for me.

This should be very close if not exactly right. Consider it a starting point at least and verify/figure out the math yourself (in particular decide/verify which bucket(s) an exact boundary match like 0.2 should go into - 0.1 to 0.2 and/or 0.2 to 0.3?):
$ cat tst.awk
BEGIN { delta = (delta == "" ? 0.1 : delta) }
{
bucketNr = int(($0+delta) / delta)
cnt[bucketNr]++
numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
}
END {
for (bucketNr=1; bucketNr<=numBuckets; bucketNr++) {
end = beg + delta
printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr]
beg = end
}
}
$ awk -f tst.awk file
0.0 0.1 0
0.1 0.2 0
0.2 0.3 0
0.3 0.4 0
0.4 0.5 0
0.5 0.6 0
0.6 0.7 1
0.7 0.8 0
0.8 0.9 0
0.9 1.0 0
1.0 1.1 1
1.1 1.2 0
1.2 1.3 2
1.3 1.4 1
1.4 1.5 1
1.5 1.6 0
1.6 1.7 0
1.7 1.8 1
Note that you can assign the bucket delta size on the command line, 0.1 is just the default value:
$ awk -v delta='0.3' -f tst.awk file
0.0 0.3 0
0.3 0.6 0
0.6 0.9 1
0.9 1.2 1
1.2 1.5 4
1.5 1.8 1
$ awk -v delta='0.5' -f tst.awk file
0.0 0.5 0
0.5 1.0 1
1.0 1.5 5
1.5 2.0 1

This is also possible :
awk -v size=0.1
'{ b=int($1/size); a[b]++; bmax=b>bmax?b:bmax; bmin=b<bmin?b:bmin }
END { for(i=bmin;i<=bmax;++i) print i*size,(i+1)*size,a[i] }' <file>
It essentially does the same as the solution of EdMorton, but starts printing buckets from the minimum value which is default 0. It essentially takes negative numbers into account.

Here is my stab at solving this with Awk.
To run: awk -f belowscript.awk inputfile
BEGIN {
PROCINFO["sorted_in"]="#ind_num_asc";
delta = (delta == "") ? 0.1 : delta;
};
/^-?([0-9][0-9]*|[0-9]*(\.[0-9][0-9]*))/ {
# Special case the [-delta - 0] case so it doesn't bin in the [0-delta] bin
fractBin=$1/delta
if (fractBin < 0 && int(fractBin) == fractBin)
fractBin = fractBin+1
prefix = (fractBin <= 0 && int(fractBin) == 0) ? "-" : ""
bins[prefix int(fractBin)]++
}
END {
for (var in bins)
{
srange = sprintf("%0.2f",delta * ((var >= 0) ? var : var-1))
erange = sprintf("%0.2f",delta * ((var >= 0) ? var+1 : var))
print srange " " erange " " bins[var]
}
}
Some notes:
I added support for providing the bin size on the command line like Ed Morton did.
It only prints the bins that contain something
Which bin an exact match goes in - the smaller or the larger bin naturally with this approach negated when going negative, and required tweaking to make it consistent.
the 0 boundary needed special casing for those numbers in the first negative bin, since there is no such number as -0. Awk's associative arrays use strings for keys, so "-0" was possible, and with #ind_num_asc sort order for the for loop, seems to sort the -0 properly - though this may not be portable.

Another solution with Python
# draw histogram in command line with Python
#
# usage: $ cat datafile.txt | python this_script.py [nbins] [nscale]
# The input should be one column of numbers to be piped in.
#
# forked from https://gist.github.com/bgbg
from __future__ import print_function
import sys
import numpy as np
def asciihist(it, bins=10, minmax=None, str_tag='',
scale_output=30, generate_only=False, print_function=print):
"""Create an ASCII histogram from an interable of numbers.
Author: Boris Gorelik boris#gorelik.net. based on http://econpy.googlecode.com/svn/trunk/pytrix/pytrix.py
License: MIT
"""
ret = []
itarray = np.asanyarray(it)
if minmax == 'auto':
minmax = np.percentile(it, [5, 95])
if minmax[0] == minmax[1]:
# for very ugly distributions
minmax = None
if minmax is not None:
# discard values that are outside minmax range
mn = minmax[0]
mx = minmax[1]
itarray = itarray[itarray >= mn]
itarray = itarray[itarray <= mx]
if itarray.size:
total = len(itarray)
counts, cutoffs = np.histogram(itarray, bins=bins)
cutoffs = cutoffs[1:]
if str_tag:
str_tag = '%s ' % str_tag
else:
str_tag = ''
if scale_output is not None:
scaled_counts = counts.astype(float) / counts.sum() * scale_output
else:
scaled_counts = counts
if minmax is not None:
ret.append('Trimmed to range (%s - %s)' % (str(minmax[0]), str(minmax[1])))
for cutoff, original_count, scaled_count in zip(cutoffs, counts, scaled_counts):
ret.append("{:s}{:>8.2f} |{:<7,d} | {:s}".format(
str_tag,
cutoff,
original_count,
"*" * int(scaled_count))
)
ret.append(
"{:s}{:s} |{:s} | {:s}".format(
str_tag,
'-' * 8,
'-' * 7,
'-' * 7
)
)
ret.append(
"{:s}{:>8s} |{:<7,d}".format(
str_tag,
'N=',
total
)
)
else:
ret = []
if not generate_only:
for line in ret:
print_function(line)
ret = '\n'.join(ret)
return ret
if __name__ == '__main__':
nbins=30
if len(sys.argv) >= 2:
nbins = int(sys.argv[1])
nscale=400
if len(sys.argv) == 3:
nscale = int(sys.argv[2])
dataIn =[]
for line in sys.stdin:
if line.strip() != '':
dataIn.append( float(line))
asciihist(dataIn, bins=nbins, scale_output=nscale, minmax=None, str_tag='BIN');

Related

Find linear trend up to the maximum value using awk

I have a datafile as below:
ifile.txt
-10 /
-9 /
-8 /
-7 3
-6 4
-5 13
-4 16
-3 17
-2 23
-1 26
0 29
1 32
2 35
3 38
4 41
5 40
6 35
7 30
8 25
9 /
10 /
Here "/" are the missing values. I would like to compute the linear trend up to the maximum value in the y-axis (i.e. up to the value "41" in 2nd column). So it should calculate the trend from the following data:
-7 3
-6 4
-5 13
-4 16
-3 17
-2 23
-1 26
0 29
1 32
2 35
3 38
4 41
Other (x, y) won't be consider because the y values are less than 41 after (4, 41)
The following script is working fine for all values:
awk '!/\//{sx+=$1; sy+=$2; c++;
sxx+=$1*$1; sxy+=$1*$2}
END {det=c*sxx-sx*sx;
print (det?(c*sxy-sx*sy)/det:"DIV0")}' ifile.txt
But I can't able to do it for maximum value
For the given example the result will be 3.486
Updated based on your comments. I assumed your trend calculations were good and used them:
$ awk '
$2!="/" {
b1[++j]=$1 # buffer them up until or if used
b2[j]=$2
if(max=="" || $2>max) { # once a bigger than current max found
max=$2 # new champion
for(i=1;i<=j;i++) { # use all so far buffered values
# print b1[i], b2[i] # debug to see values used
sx+=b1[i] # Your code from here on
sy+=b2[i]
c++
sxx+=b1[i]*b1[i]
sxy+=b1[i]*b2[i]
}
j=0 # buffer reset
delete b1
delete b2
}
}
END {
det=c*sxx-sx*sx
print (det?(c*sxy-sx*sy)/det:"DIV0")
}' file
For data:
0 /
1 1
2 2
3 4
4 3
5 5
6 10
7 7
8 8
with debug print uncommented program would output:
1 1
2 2
3 4
4 3
5 5
6 10
1.51429
You can do the update of the concerned rows only when $2 > max and save the intermediate rows into variables. for example using associate arrays:
awk '
$2 == "/" {next}
$2 > max {
# update max if $2 > max
max = $2;
# add all elemenet of a1 to a and b1 to b
for (k in a1) {
a[k] = a1[k]; b[k] = b1[k]
}
# add the current row to a, b
a[NR] = $1; b[NR] = $2;
# reset a1, b1
delete a1; delete b1;
next;
}
# if $2 <= max, then set a1, b1
{ a1[NR] = $1; b1[NR] = $2 }
END{
for (k in a) {
#print k, a[k], b[k]
sx += a[k]; sy += b[k]; sxx += a[k]*a[k]; sxy += a[k]*b[k]; c++
}
det=c*sxx-sx*sx;
print (det?(c*sxy-sx*sy)/det:"DIV0")
}
' ifile.txt
#3.48601
Or calculate sx, sy etc directly instead of using arrays:
awk '
$2 == "/" {next}
$2 > max {
# update max if $2 > max
max = $2;
# add the current Row plus the cached values
sx += $1+sx1; sy += $2+sy1; sxx += $1*$1+sxx1; sxy += $1*$2+sxy1; c += 1+c1
# reset the cached variables
sx1 = 0; sy1 = 0; sxx1 = 0; sxy1 = 0; c1 = 0;
next;
}
# if $2 <= max, then calculate and cache the values
{ sx1 += $1; sy1 += $2; sxx1 += $1*$1; sxy1 += $1*$2; c1++ }
END{
det=c*sxx-sx*sx;
print (det?(c*sxy-sx*sy)/det:"DIV0")
}
' ifile.txt

Is there a way to solve Ternary first-order equation with Maxima?

I want to know the Solving Syntax for Ternary first-order equation on maxima.
For example;
F_A + F_C + F_E - 15 = 0;
-F_A *0.4 + 15*0.2 m + F_E*0.4 = 0;
F_C = 0.3*F_A + 0.3*F_E;
wanna know How to get the solution F_A, F_C, F _E?
Since this is a system of linear equations, one can call linsolve to solve it.
(%i10) eq1: F_E + F_C + F_A - 15 = 0 $
(%i11) eq2: 3.0*m + 0.4*F_E - 0.4*F_A = 0 $
(%i12) eq3: F_C = 0.3*F_E + 0.3*F_A $
(%i13) linsolve ([eq1, eq2, eq3], [F_A, F_C, F_E]);
rat: replaced -0.4 by -2/5 = -0.4
rat: replaced 0.4 by 2/5 = 0.4
rat: replaced 3.0 by 3/1 = 3.0
rat: replaced -0.3 by -3/10 = -0.3
rat: replaced -0.3 by -3/10 = -0.3
195*m + 300 45 195*m - 300
(%o13) [F_A = -----------, F_C = --, F_E = - -----------]
52 13 52
Note that it's not necessary for all terms to have a numerical value -- in the solution above, m is a free variable.
Note also that Maxima prefers exact numbers (i.e., integers and rational numbers) to inexact numbers (i.e., floating point). linsolve converts floats to rationals and then works with the result.
Let those be =>
F_A = x; F_C = y; F_E = z;
x + y + z = 15
-0.4*x + 0.4*z = 3
0.3*x -y + 0.3*z = 0
On Mathlab,
refer to the pic

z-score to probability and vice verse in ruby

How can I convert z-score to probability using ruby?
Example:
z_score = 0
probability should be 0.5
z_score = 1.76
probability should be 0.039204
According to this https://stackoverflow.com/a/16197404/1062711 post, here is the function that give you the p proba from the z score
def getPercent(z)
return 0 if z < -6.5
return 1 if z > 6.5
factk = 1
sum = 0
term = 1
k = 0
loopStop = Math.exp(-23)
while term.abs > loopStop do
term = 0.3989422804 * ((-1)**k) * (z**k) / (2*k+1) / (2**k) * (z**(k+1)) /factk
sum += term
k += 1
factk *= k
end
sum += 0.5
1-sum
end
puts getPercent(1.76)

Finding a range of numbers of a file in another file using awk

I have lots of files like this:
3
10
23
.
.
.
720
810
980
And a much bigger file like this:
2 0.004
4 0.003
6 0.034
.
.
.
996 0.01
998 0.02
1000 0.23
What I want to do is find in which range of the second file my first file falls and then estimate the mean of the values in the 2nd column of that range.
Thanks in advance.
NOTE
The numbers in the files do not necessarily follow an easy pattern like 2,4,6...
Since your smaller files are sorted you can pull out the first row and the last row to get the min and max. Then you just need go through the bigfile with an awk script to compute the mean.
So for each smallfile small you would run the script
awk -v start=$(head -n 1 small) -v end=$(tail -n 1 small) -f script bigfile
Where script can be something simple like
BEGIN {
sum = 0;
count = 0;
range_start = -1;
range_end = -1;
}
{
irow = int($1)
ival = $2 + 0.0
if (irow >= start && end >= irow) {
if (range_start == -1) {
range_start = NR;
}
sum = sum + ival;
count++;
}
else if (irow > end) {
if (range_end == -1) {
range_end = NR - 1;
}
}
}
END {
print "start =", range_start, "end =", range_end, "mean =", sum / count
}
You can try below:
for r in *; do
awk -v r=$r -F' ' \
'NR==1{b=$2;v=$4;next}{if(r >= b && r <= $2){m=(v+$4)/2; print m; exit}; b=$2;v=$4}' bigfile.txt
done
Explanation:
First pass it saves column 2 & 4 into temp variables. For all other passes it checks if filename r is between the begin range (previous coluimn 2) and end range (current column 2).
It then works out the mean and prints the result.

Round a ruby float up or down to the nearest 0.05

I'm getting numbers like
2.36363636363636
4.567563
1.234566465448465
10.5857447736
How would I get Ruby to round these numbers up (or down) to the nearest 0.05?
[2.36363636363636, 4.567563, 1.23456646544846, 10.5857447736].map do |x|
(x*20).round / 20.0
end
#=> [2.35, 4.55, 1.25, 10.6]
Check this link out, I think it's what you need.
Ruby rounding
class Float
def round_to(x)
(self * 10**x).round.to_f / 10**x
end
def ceil_to(x)
(self * 10**x).ceil.to_f / 10**x
end
def floor_to(x)
(self * 10**x).floor.to_f / 10**x
end
end
In general the algorithm for “rounding to the nearest x” is:
round(x / precision)) * precision
Sometimes is better to multiply by 1 / precision because it is an integer (and thus it works a bit faster):
round(x * (1 / precision)) / (1 / precision)
In your case that would be:
round(x * (1 / 0.05)) / (1 / 0.05)
which would evaluate to:
round(x * 20) / 20;
I don’t know any Python, though, so the syntax might not be correct but I’m sure you can figure it out.
less precise, but this method is what most people are googling this page for
(5.65235534).round(2)
#=> 5.65
Here's a general function that rounds by any given step value:
place in lib:
lib/rounding.rb
class Numeric
# round a given number to the nearest step
def round_by(increment)
(self / increment).round * increment
end
end
and the spec:
require 'rounding'
describe 'nearest increment by 0.5' do
{0=>0.0,0.5=>0.5,0.60=>0.5,0.75=>1.0, 1.0=>1.0, 1.25=>1.5, 1.5=>1.5}.each_pair do |val, rounded_val|
it "#{val}.round_by(0.5) ==#{rounded_val}" do val.round_by(0.5).should == rounded_val end
end
end
and usage:
require 'rounding'
2.36363636363636.round_by(0.05)
hth.
It’s possible to round numbers with String class’s % method.
For example
"%.2f" % 5.555555555
would give "5.56" as result (a string).
Ruby 2 now has a round function:
# Ruby 2.3
(2.5).round
3
# Ruby 2.4
(2.5).round
2
There are also options in ruby 2.4 like: :even, :up and :down
e.g;
(4.5).round(half: :up)
5
To get a rounding result without decimals, use Float's .round
5.44.round
=> 5
5.54.round
=> 6
I know that the question is old, but I like to share my invention with the world to help others: this is a method for rounding float number with step, rounding decimal to closest given number; it's usefull for rounding product price for example:
def round_with_step(value, rounding)
decimals = rounding.to_i
rounded_value = value.round(decimals)
step_number = (rounding - rounding.to_i) * 10
if step_number != 0
step = step_number * 10**(0-decimals)
rounded_value = ((value / step).round * step)
end
return (decimals > 0 ? "%.2f" : "%g") % rounded_value
end
# For example, the value is 234.567
#
# | ROUNDING | RETURN | STEP
# | 1 | 234.60 | 0.1
# | -1 | 230 | 10
# | 1.5 | 234.50 | 5 * 0.1 = 0.5
# | -1.5 | 250 | 5 * 10 = 50
# | 1.3 | 234.60 | 3 * 0.1 = 0.3
# | -1.3 | 240 | 3 * 10 = 30

Resources