Algorithm to smooth numbers with variable input time - algorithm

I have an app that accepts integers at a variable rate every .25 to 2 seconds.
I'd like to output the data in a smoothed format for 3, 5 or 7 seconds depending on user input.
If the data always came in at the same rate, let's say every .25 seconds, then this would be easy. The variable rate is what confuses me.
Data might come in like this:
Time - Data
0.25 - 100
0.50 - 102
1.00 - 110
1.25 - 108
2.25 - 107
2.50 - 102
ect...
I'd like to display a 3 second rolling average every .25 seconds on my display.
The simplest form of doing this is to put each item into an array with a time stamp.
array.push([0.25, 100])
array.push([0.50, 102])
array.push([1.00, 110])
array.push([1.25, 108])
ect...
Then every .25 seconds I would read through the array, back to front, until I got to a time that was less than now() - rollingAverageTime. I would sum that and display it. I would then .Shift() the beginning of the array.
That seems not very efficient though. I was wondering if someone had a better way to do this.

Why don't you save the timestamp of the starting value and then accumulate the values and the number of samples until you get a timestamp that is >= startingTime + rollingAverageTime and then divide the accumulator by the number of samples taken?
EDIT:
If you want to preserve the number of samples, you can do this way:
Take the accumulator, and for each input value sum it and store the value and the timestamp in a shift register; at every cycle, you have to compare the latest sample's timestamp with the oldest timestamp in the shift register plus the smoothing time; if it's equal or more, subtract the oldest saved value from the accumulator, delete that entry from the shift register and output the accumulator, divided by the smoothing time. If you iterate you obtain a rolling average with (i think) the least amount of computation for each cycle:
a sum (to increment the accumulator)
a sum and a subtraction (to compare the timestamp)
a subtraction (from the accumulator)
a division (to calculate the average, done in a smart way can be a shift right)
For a total of about 4 algebric sums and a division (or shift)
EDIT:
For taking into account the time from the last sample as a weighting factor, you can divide the value for the ratio between this time and the averaging time, and you obtain an already weighted average, without having to divide the accumulator.
I added this part because it doesn't add computational load, so you can implement quite easy if you want to.

The answer from clabacchio has the basics right, but perhaps you need a bit more sophisticated answer.
Calculating the average:
0.25 - 100
0.50 - 102
1.00 - 110
In the above subset of the data what is the answer you want? You could use the mean of these numbers or you could do it in a weighted fashion. You could convert the data into:
0.50 - 0.25 = 0.25 ---- (100+102)/2 = 101
1.00 - 0.50 = 0.50 ---- (102+110)/2 = 106
Then you can take the weighted average of these values, weight being the time difference, and value being the average value.
The final answer = (0.25*101 + 0.5*106)/(0.25+0.5) = whatever the value is.
Now coming to "moving" averages:
You can either use previous k values or previous k seconds worth of data. In both cases you can keep two sums: weighted sum and sum of weights.

So... the worst case scenario is 4 readings per second over 7 seconds = 28 values in your array to process. That will be done in nanoseconds anyway, so not worth optimizing IMHO.

Related

Reduced chi-square too low (close to 0) after weighted fit - convolution integral - Python lmfit

I'm fitting the following data where t: time (s), G: counts per second, f: impulse function (mm/s):
t G f
0 4.58 0
900 11.73 (11/900)
1800 18.23 (8.25/900)
2700 19.33 (3/900)
3600 19.04 (0.5/900)
4500 17.21 0
5400 12.98 0
6300 11.59 0
7200 9.26 0
8100 7.66 0
9000 6.59 0
9900 5.68 0
10800 5.1 0
Using the following convolution integral:
And more specifically:
Where: lambda_1 = 0.000431062 and lambda_2 = 0.000580525.
The code used to perform that fitting is:
#Extract data into numpy arrays
t=df['t'].as_matrix()
g=df['G'].as_matrix()
f=df['f'].as_matrix()
#add parameters
params=Parameters()
params.add('a',value=1)
params.add('b',value=0.7)
params.add('c',value =1)
#define functions
def exp(x,k):
return np.exp(-x*k)
def residuals(params,x,y):
A=params['a'].value
B=params['b'].value
C=params['c'].value
dt=x[2]-x[1]
model = A*(np.convolve(exp(x,lambda_1), f))[:len(x)]*dt+B*np.convolve(exp(x,lambda_2), f)[:len(x)]*dt+C
weights=1/np.sqrt(y)
return (model - y)*weights
#perform fit using leastsq
result = minimize(residuals, params, args=(t,g))
final = g + result.residual
print(report_fit(result))
It works, however I obtain a very low reduced chi-square (around 0) when I multiply the residual to be minimized by the weight (1/np.sqrt (g) (weighted fit). If I do not taken into account the weight (non-weighted fit), I obtain a reduced chi-square of 0.254. I would like to obtain a reduced chi-square around 1.
A reduced chi-square far below 1 would imply that your estimate of the uncertainty in the data is far too large. If I read your example correctly, you are using the square-root of G as the uncertainty in G. Using the square root is a standard approach for estimating uncertainties in values dominated by counting statistics.
But... your G is a floating point number that you describe as counts per second. I might assume counts per second over 900 seconds.
If that is right (and we assume for simplicity no significant uncertainty in that time duration), then the uncertainties should be 30x smaller than you have them. That is, you are using
g_values = [4.58 , 11.73, 18.23]
g_uncertainties = sqrt(g_values) = [2.1401, 3.4249, 4.2697]
but the uncertainties in the counts would be sqrt(g_values*900), and so the uncertainties in counts per second by sqrt(g_values*900)/900 = sqrt(g_values)/30.
More formally, the uncertainties in a value representing "counts per time" would add the uncertainties in counts and the uncertainties in time in quadrature. But again, the uncertainties in your time are probably very small (or, at least your time data implies that it is below 1 second).

Simple weighted rating value

I have a database consisting of clubs and its ratings people have provided them with.
Currently, I am performing an average of the ratings based on a club and then sorting these averages in descending order to have a list of highest rated clubs.
The problem I am having is there should be some weighting based on how many ratings you have. A club might get 5 (5.0) ratings and end up at the top of the list against a club that has 16K ratings and is also averaged with a 5.0 rating.
What I'm looking for is the algorithm which factors in the number of ratings to ensure we are querying the data with a weighted algorithm that takes in the number of ratings.
Currently my algorithm is:
(sum of club ratings)/(total number of ratings) to give me the average
This does not incorporate the weight algorithm
Lets suppose your ratings can go from 0k to 100k(as you said some club has 16k rating). Now you want that to be normalized to a range of 0k to 5k.
Lets say 0k to 100k is actual range. (A_lower to A_higher)
And, 0k to 5k is the normalized range. (N_lower to N_higher)
You want to change 16k, which is A_rating(Actual rating) to a normalized value which is N_rating(inbetween 0 to 5k).
The formula that you can use for this is
N-rating = A_rating * ( (N_higher - N_lower) / (A_higher - A_ lower) )
Lets take an example.
If the actual rating is 25k. The range of the actual rating is from 0 to 100k. And you want it normalized between 0 to 5k. Then
N-rating = 25 * ( (5 - 0) / (100 - 0) )
=> N_rating = 1.25
EDIT
A little more explanation
We do normalization, if there are values that are spread in a big range, and we want to represent them in a smaller range.
Q) What is a normalized value.
It is the value that would represent the exact place of the actual value(25k), if the Actual range(0 to 100) was a little smaller(0 to 5).
Q) why am i taking the division of a normalized range to a actual range and then multiplying by the actual rating.
To understand this, lets use a little of unitary method logic.
You have a value 25 when the range is 0 to 100, and would want to know what the value be normalized to if the range was 0 to 5. So,
//We will take already known values, the highest ones in both the ranges
100 is similar to 5 //the higher value of both the ranges
//In unitary method this would go like
If 100 is 5
//then
1 is (5 / 100)
//and
x is x * (5 / 100) //we put 25 in place of x here
Q) why did you choose 0 to 5k as the normalized range.
I chose because you mentioned your rating should be below 5k. You can choose any range you wish.
What about simply adding the number of rating weighted wit a very little value?
This is just a very basic idea:
(sum of club ratings)/(total number of ratings)+0.00000001*(number of club ratings)
This way clubs with same average get ranked by number of ratings.

Microsoft Excel cumulative calculation performance

I know 2 ways to calculate cumulative values in Excel.
1st method:
A B
Value Cumulative total
9 =A1
8 =B1+A2
7 =B2+A3
6 =B3+A4
2nd method:
A B
Value Cumulative total
9 =SUM($A$1:A1)
8 =SUM($A$1:A2)
7 =SUM($A$1:A3)
6 =SUM($A$1:A4)
2 questions:
Which method has better performance when the data set gets really big (say 100k rows)? 1st method seems to be having less overhead. Because when adding a new value in column A (Value), new cell in column B only needs to do "B(n-1)+A(n)". Where in 2nd method, is it smart enough to do similar? Or it will adds 100k rows from A1:A(n)?
What's the best way to calculate the cumulative values? I found 2nd method is more popular though I doubt its performance. The only upside for 2nd method I can see now is the formula in column B cells are more consistent. In 1st method, the 1st cell in column B has to be a determined in advance.
number sequence 9, 8, 7, 6, -9, -8, -7, -6; workbook set to manual calculation, triggered by following code:
Sub ManualCalc()
Dim R As Range
Set R = Selection
[F1] = Now()
R.Worksheet.Calculate
[F2] = Now()
[F3] = ([F2] - [F1]) * 86400
End Sub
At 4096 rows calculation time is not measurable for both variants (0 seconds), at 65536 rows your 1st method is still not measurable, your 2nd method takes a bit less than 8 seconds on my laptop (Dell Latitude E6420, Win7, Office2010 - average of 3 measurements each). So for high number of rows I would therefore prefer method 1.
Regarding your Q1 ... yes it would add 100k sums of ever growing ranges ... Excel is not supposed to be smart, it's supposed to calculate whatever you ask it to calculate. If it did, it would interpret the intention of a set of formulas at runtime which I'd regard as very dangerous!

Generating strongly biased random numbers for tests

I want to run tests with randomized inputs and need to generate 'sensible' random
numbers, that is, numbers that match good enough to pass the tested function's
preconditions, but hopefully wreak havoc deeper inside its code.
math.random() (I'm using Lua) produces uniformly distributed random
numbers. Scaling these up will give far more big numbers than small numbers,
and there will be very few integers.
I would like to skew the random numbers (or generate new ones using the old
function as a randomness source) in a way that strongly favors 'simple' numbers,
but will still cover the whole range, i.e., extending up to positive/negative infinity
(or ±1e309 for double). This means:
numbers up to, say, ten should be most common,
integers should be more common than fractions,
numbers ending in 0.5 should be the most common fractions,
followed by 0.25 and 0.75; then 0.125,
and so on.
A different description: Fix a base probability x such that probabilities
will sum to one and define the probability of a number n as xk
where k is the generation in which n is constructed as a surreal
number1. That assigns x to 0, x2 to -1 and +1,
x3 to -2, -1/2, +1/2 and +2, and so on. This
gives a nice description of something close to what I want (it skews a bit too
much), but is near-unusable for computing random numbers. The resulting
distribution is nowhere continuous (it's fractal!), I'm not sure how to
determine the base probability x (I think for infinite precision it would be
zero), and computing numbers based on this by iteration is awfully
slow (spending near-infinite time to construct large numbers).
Does anyone know of a simple approximation that, given a uniformly distributed
randomness source, produces random numbers very roughly distributed as
described above?
I would like to run thousands of randomized tests, quantity/speed is more
important than quality. Still, better numbers mean less inputs get rejected.
Lua has a JIT, so performance is usually not much of an issue. However, jumps based
on randomness will break every prediction, and many calls to math.random()
will be slow, too. This means a closed formula will be better than an
iterative or recursive one.
1 Wikipedia has an article on surreal numbers, with
a nice picture. A surreal number is a pair of two surreal
numbers, i.e. x := {n|m}, and its value is the number in the middle of the
pair, i.e. (for finite numbers) {n|m} = (n+m)/2 (as rational). If one side
of the pair is empty, that's interpreted as increment (or decrement, if right
is empty) by one. If both sides are empty, that's zero. Initially, there are
no numbers, so the only number one can build is 0 := { | }. In generation
two one can build numbers {0| } =: 1 and { |0} =: -1, in three we get
{1| } =: 2, {|1} =: -2, {0|1} =: 1/2 and {-1|0} =: -1/2 (plus some
more complex representations of known numbers, e.g. {-1|1} ? 0). Note that
e.g. 1/3 is never generated by finite numbers because it is an infinite
fraction – the same goes for floats, 1/3 is never represented exactly.
How's this for an algorithm?
Generate a random float in (0, 1) with a library function
Generate a random integral roundoff point according to a desired probability density function (e.g. 0 with probability 0.5, 1 with probability 0.25, 2 with probability 0.125, ...).
'Round' the float by that roundoff point (e.g. floor((float_val << roundoff)+0.5))
Generate a random integral exponent according to another PDF (e.g. 0, 1, 2, 3 with probability 0.1 each, and decreasing thereafter)
Multiply the rounded float by 2exponent.
For a surreal-like decimal expansion, you need a random binary number.
Even bits tell you whether to stop or continue, odd bits tell you whether to go right or left on the tree:
> 0... => 0.0 [50%] Stop
> 100... => -0.5 [<12.5%] Go, Left, Stop
> 110... => 0.5 [<12.5%] Go, Right, Stop
> 11100... => 0.25 [<3.125%] Go, Right, Go, Left, Stop
> 11110... => 0.75 [<3.125%] Go, Right, Go, Right, Stop
> 1110100... => 0.125
> 1110110... => 0.375
> 1111100... => 0.625
> 1111110... => 0.875
One way to quickly generate a random binary number is by looking at the decimal digits in math.random() and replace 0-4 with '1' and 5-9 with '1':
0.8430419054348022
becomes
1000001010001011
which becomes -0.5
0.5513009827118367
becomes
1100001101001011
which becomes 0.25
etc
Haven't done much lua programming, but in Javascript you can do:
Math.random().toString().substring(2).split("").map(
function(digit) { return digit >= "5" ? 1 : 0 }
);
or true binary expansion:
Math.random().toString(2).substring(2)
Not sure which is more genuinely "random" -- you'll need to test it.
You could generate surreal numbers in this way, but most of the results will be decimals in the form a/2^b, with relatively few integers. On Day 3, only 2 integers are produced (-3 and 3) vs. 6 decimals, on Day 4 it is 2 vs. 14, and on Day n it is 2 vs (2^n-2).
If you add two uniform random numbers from math.random(), you get a new distribution which has a "triangle" like distribution (linearly decreasing from the center). Adding 3 or more will get a more 'bell curve' like distribution centered around 0:
math.random() + math.random() + math.random() - 1.5
Dividing by a random number will get a truly wild number:
A/(math.random()+1e-300)
This will return an results between A and (theoretically) A*1e+300,
though my tests show that 50% of the time the results are between A and 2*A
and about 75% of the time between A and 4*A.
Putting them together, we get:
round(6*(math.random()+math.random()+math.random() - 1.5)/(math.random()+1e-300))
This has over 70% of the number returned between -9 and 9 with a few big numbers popping up rarely.
Note that the average and sum of this distribution will tend to diverge towards a large negative or positive number, because the more times you run it, the more likely it is for a small number in the denominator to cause the number to "blow up" to a large number such as 147,967 or -194,137.
See gist for sample code.
Josh
You can immediately calculate the nth born surreal number.
Example, the 1000th Surreal number is:
convert to binary:
1000 dec = 1111101000 bin
1's become pluses and 0's minuses:
1111101000
+++++-+---
The first '1' bit is 0 value, the next set of similar numbers is +1 (for 1's) or -1 (for 0's), then the value is 1/2, 1/4, 1/8, etc for each subsequent bit.
1 1 1 1 1 0 1 0 0 0
+ + + + + - + - - -
0 1 1 1 1 h h h h h
+0+1+1+1+1-1/2+1/4-1/8-1/16-1/32
= 3+17/32
= 113/32
= 3.53125
The binary length in bits of this representation is equal to the day on which that number was born.
Left and right numbers of a surreal number are the binary representation with its tail stripped back to the last 0 or 1 respectively.
Surreal numbers have an even distribution between -1 and 1 where half of the numbers created to a particular day will exist. 1/4 of the numbers exists evenly distributed between -2 to -1 and 1 to 2 and so on. The max range will be negative to positive integers matching the number of days you provide. The numbers go to infinity slowly because each day only adds one to the negative and positive ranges and days contain twice as many numbers as the last.
Edit:
A good name for this bit representation is "sinary"
Negative numbers are transpositions. ex:
100010101001101s -> negative number (always start 10...)
111101010110010s -> positive number (always start 01...)
and we notice that all bits flip accept the first one which is a transposition.
Nan is => 0s (since all other numbers start with 1), which makes it ideal for representation in bit registers in a computer since leading zeros are required (we don't make ternary computer anymore... too bad)
All Conway surreal algebra can be done on these number without needing to convert to binary or decimal.
The sinary format can be seem as a one plus a simple one's counter with a 2's complement decimal representation attached.
Here is an incomplete report on finary (similar to sinary): https://github.com/peawormsworth/tools/blob/master/finary/Fine%20binary.ipynb

Simulating "Wheel of fortune" (Monte Carlo Simulation Hit or Miss Method)

I'm trying to make a randomizer that will use the Monte Carlo Hit or Miss Simulation.
I have a Key-Value pair that represents the ID and the probability value:
ID - Value
2 - 0.37
1 - 0.35
4 - 0.14
3 - 0.12
When you add all of those values, you will get a total of 1.0.
You can imagine those values as the total area of a "slice" on the "wheel" (EG: ID 2 occupies 37% of the wheel, while ID 3 only occupies 12% of the wheel). When converted to "range" it will look like this:
ID - Value - Range
2 - 0.37 - 0 to 37
1 - 0.35 - 37 to 72
4 - 0.14 - 72 to 86
3 - 0.12- 86 to 100
Now, I am using Random.NextDouble() to generate a random value that is between 0.0 and 1.0. That random value will be considered as the "spin" on the wheel. Say, the randomizer returns 0.35, then ID 2 will be selected.
What is the best way to implement this given that I have an array of doubles?
The simplest solutions are often the best, if your range is 0 - 100 by design (or another manageebly small number), you can allocate an int[] and use the table of ranges you created to fill in the ID at the corresponding index, your "throw" will then look like:
int randomID = rangesToIDs[random.nextInt(rangesToIDs.length)];
Btw, it is not necessary to sort the ID's on range size, as the randoms are assumed to be distributed uniformly it does not matter where in the lookup table a range is placed. It only matters that the number of entries is proportional to the chance to throw an ID.
Let's assume your initial data is represented as array D[n], where D[i] = (id, p) and sum(D[i].p for i=0..n-1) == 1.
Build a second array P[n] such that P[i] = (q, id): P[i] = (sum(D[j].p for j in 0..i), D[j].id) -- i.e., convert individual probablity of each slice i into cumulative probability of all slices preceding i (inclusive). Note that, by definition, this array P is ordered by field q (i.e. by cumulative probability).
Now you can use binary search to find the slice chosen by the random number r (0 <= r <= 1):
find highest i such that P[i].q <= r; then P[i].id is your slice.
It is possible to speed up the lookup further by hashing the probability range with a fixed grid. I can write more details on this if anybody is interested.
As jk wrote sorted dictionary of should be fine.
let's say you got dictionary like this:
0.37 2
0.72 1
0.86 4
1.00 3
You roll xx = 0.66..
Iterate through dictionary starting from lowest number (that's 0.37)
if xx < dict[i].key
return dict[i].value
Or another solution which comes to my mind is List of custom objects containing lower and upper bound and value. You iterate then through list and check if rolled number is in range of up and low bounds.
a sorted map/dictionary with the 'Value' as the key and the 'ID' as the value would allow you to quickly find the upper bound of the range you are in and then look up the ID for that range
assuming your dictionary allows it, a binary search would be better to find the upper bound than interating throught the entire dictionary
boundaries = [37, 72, 86, 100]
num = 100 * random
for i in boundaries:
if num < i then return i

Resources