suitable formula/algorithm for detecting temperature fluctuations - algorithm

I'm creating an app to monitor water quality. The temperature data is updated every 2 min to firebase real-time database. App has two requirements
1) It should alert the user when temperature exceed 33 degree or drop below 23 degree - This part is done
2) It should alert user when it has big temperature fluctuation after analysing data every 30min - This part i'm confused.
I don't know what algorithm to use to detect big temperature fluctuation over a period of time and alert the user. Can someone help me on this?

For a period of 30 minutes, your app would give you 15 values.
If you want to figure out a big change in this data, then there is one way to do so.
You can use implement the following method:
Calculate the mean and the standard deviation of the values.
Subtract the data you have from the mean and then take the absolute value of the result.
Compare if the absolute value is greater than one standard deviation, if it is greater then you have a big data.
See this example for better understanding:
Lets suppose you have these values for 10 minutes:
25,27,24,35,28
First Step:
Mean = 27 (apprx)
One standard deviation = 3.8
Second Step: Absolute(Data - Mean)
abs(25-27) = 2
abs(27-27) = 0
abs(24-27) = 3
abs(35-27) = 8
abs(28-27) = 1
Third Step
Check if any of the subtraction is greater than standard deviation
abs(35-27) gives 8 which is greater than 3.8
So, there is a big fluctuation. If all the subtracted results are less than standard deviation, then there is no fluctuation.
You can still improvise the result by selecting two or three standard deviation instead of one standard deviation.

Start by defining what you mean by fluctuation.
You don't say what temperature scale you're using. Fahrenheit, Celsius, Rankine, or Kelvin?
Your sampling rate is a new data value every two minutes. Do you define fluctuation as the absolute value of the difference between the last point and current value? That's defensible.
If the max allowable absolute value is some multiple of your 33-23 = 10 degrees you're in business.

Related

How to model normal distributed sensor measurements by day time?

I have a sensor that gives me some value every x minutes. In my database I store the values grouped by hour. So the input domain is an integer between 0 and 23. I want to learn the parameters for a Gaussian distribution (mean and variance). At the end I want to get the probability of a certain measurement level at any given time (hour of day).
The problem is that the peak of the distribution could be at midnight and in the afternoon there could be the minimum. In this case I have no idea how to model it with a Gaussian as the normal distribution does not seem to be suited for "circular" domains (after 23h comes 0h and not 24h, so it is not really continuous).
How could this be modeled? Can I use a normal Gaussian here or should I go for another kind of probability model?
Example:
Here I would expect the model to have mean = 23 or 0. But I guess when I learn parameters here, the mean would be exactly in the middle, because a Gaussian only has one peak.

How to detect the precise sampling interval from samples stored in a database?

A hardware sensor is sampled precisely (precise period of sampling) using a real-time unit. However, the time value is not sent to the database together with the sampled value. Instead, time of insertion of the record to the database is stored for the sample in the database. The DATETIME type is used, and the GETDATE() function is used to get current time (Microsoft SQL Server).
How can I reconstruct the precise sampling times?
As the sampling interval is (should be) 60 seconds exactly, there was no need earlier for more precise solution. (This is an old solution, third party, with a lot of historical samples. This way it is not possible to fix the design.)
For processing of the samples, I need to reconstruct the correct time instances for the samples. There is no problem with shifting the time of the whole sequence (that is, it does not matter whether the start time is rather off, not absolute). On the other hand, the sampling interval should be detected as precisely as possible. I also cannot be sure, that the sampling interval was exactly 60 seconds (as mentioned above). I also cannot be sure, that the sampling interval was really constant (say, slight differences based on temperature of the device).
When processing the samples, I want to get:
start time
the sampling interval
the sequence o the sample values
When reconstructing the samples, I need to convert it back to tuples:
time of the sample
value of the sample
Because of that, for the sequence with n samples, the time of the last sample should be equal to start_time + sampling_interval * (n - 1), and it should be reasonably close to the original end time stored in the database.
Think in terms of the stored sample times slightly oscillate with respect to the real sample-times (the constant delay between the sampling and the insertion into the database is not a problem here).
I was thinking about calculating the mean value and the corrected standard deviation for the interval calculated from the previous and current sample times.
Discontinuity detection: If the calculated interval is greater than 3 sigma off the mean value, I would consider it a discontinuity of the sampled curve (say, the machine is switched off, or any outer event lead to missing samples. In the case, I want to start with processing a new sequence. (The sampling frequency could also be changed.)
Is there any well known approach to the problem. If yes, can you point me to the article(s)? Or can you give me the name or acronym of the algorithm?
+1 to looking at the difference sequence. We can model the difference sequence as the sum of a low frequency truth (the true rate of the samples, slowly varying over time) and high frequency noise (the random delay to get the sample into the database). You want a low-pass filter to remove the latter.

Why Jmeter Average Time not matching properly ?

When I'm trying to execute my test plan in jmeter for 10,50,100... virtual users with ram up period 30 sec and Loop count is 1. I'm not getting Average response time exactly when I calculated with Average Time=(Min Time+ Max Time)/2.
Please check my attached image for differences in Average time
Can anyone suggest me please how we need to understand this.
Thanks in Advance.
Average: This is the Average elapsed time of a set of results. It is the arithmetic mean of all the samples response time.
The following equation show how the Average value (μ) is calculated:
μ = 1/n * Σi=1…n xi
An important thing to understand is that the mean value can be very misleading as it does not show you how close (or far) your values are from the average.The main thing you should focus on is "Standard Deviation".
The standard deviation (σ) measures the mean distance of the values to their average (μ). In other words, it gives us a good idea of the dispersion or variability of the measures to their mean value.
The following equation show how the standard deviation (σ) is calculated:
σ = 1/n * √ Σi=1…n (xi-μ)2
So interpreting the standard deviation is wise as mean value could be the same for the different response time of the samples! If the deviation value is low compared to the mean value, it will indicate you that your measures are not dispersed (or mostly close to the mean value) and that the mean value is significant.
Min - The lowest elapsed time(response time) for the samples with the same label.
Max - The longest elapsed time (response time) for the samples with the same label.
For further detail you could go through JMeter documentation and this blog. It will really help you to understand the concept.

Density of time events

I am working on an assignment where I am supposed to compute the density of an event. Let's say that a certain event happens 5 times within seconds, it would mean that it would have a higher density than if it were to happen 5 times within hours.
I have in my possession, the time at which the event happens.
I was first thinking about computing the elapsed time between each two successive events and then play with the average and mean of these values.
My problem is that I do not know how to accurately represent this notion of density through mathematics. Let's say that I have 5 events happening really close to each other, and then a long break, and then again 5 events happening really close to each other. I would like to be able to represent this as high density. How should I go about it?
In the last example, I understand that my mean won't be truly representative but that my standard deviation will show that. However, how could I have a single density value (let's say between 0 and 1) with which I could rank different events?
Thank you for your help!
I would try the harmonic mean, which represents the rate at which your events happen, by still giving you an averaged time value. It is defined by :
I think its behaviour is close to what you expect as it measures what you want, but not between 0 and 1 and with inverse tendencies (small values mean dense, large values mean sparse). Let us go through a few of your examples :
~5 events in an hour. Let us suppose for simplicity there is 10 minutes between each event. Then we have H = 6 /(6 * 1/10) = 10
~5 events in 10 minutes, then nothing until the end of the hour (50 minutes). Let us suppose all short intervals are 2.5 minutes, then H = 6 / (5/2.5 + 1/50) = 6 * 50 / 101 = 2.97
~5 events in 10 minutes, but this cycle restarts every half hour thus we have 20 minutes as the last interval instead of 50. Then we get H = 6 / (5/2.5 + 1/20) = 6 * 20 / 41 = 2.92
As you can see the effect of the longer and rarer values in a set is diminished by the fact that we use inverses, thus less weight to the "in between bursts" behaviour. Also you can compare behaviours with the same "burst density" but that do not happen at the same frequency, and you will get numbers that are close but whose ordering still reflects this difference.
For density to make sense you need to define 2 things:
the range where you look at it,
and the unit of time
After that you can say for example, that from 12:00 to 12:10 the density of the event was an average of 10/minute.
What makes sense in your case obviously depends on what your input data is. If your measurement lasts for 1 hour and you have millions of entries then probably seconds or milliseconds are better choice for unit. If you measure for a week and have a few entries then day is a better unit.

How do I measure variability of a benchmark comprised of many sub-benchmarks?

(Not strictly programming, but a question that programmers need answered.)
I have a benchmark, X, which is made up of a lot of sub-benchmarks x1..xn. Its quite a noisy test, with the results being quite variable. To accurately benchmark, I must reduce that "variability", which requires that I first measure the variability.
I can easily calculate the variability of each sub-benchmark, using perhaps standard deviation or variance. However, I'd like to get a single number which represents the overall variability as a single number.
My own attempt at the problem is:
sum = 0
foreach i in 1..n
calculate mean across the 60 runs of x_i
foreach j in 1..60
sum += abs(mean[i] - x_i[j])
variability = sum / 60
Best idea: ask at the statistics Stack Exchange once it hits public beta (in a week).
In the meantime: you might actually be more interested in the extremes of variability, rather than the central tendency (mean, etc.). For many applications, I imagine that there's relatively little to be gained by incrementing the typical user experience, but much to be gained by improving the worst user experiences. Try the 95th percentile of the standard deviations and work on reducing that. Alternatively, if the typical variability is what you want to reduce, plot the standard deviations all together. If they're approximately normally distributed, I don't know of any reason why you couldn't just take the mean.
I think you're misunderstanding the standard deviation -- if you run your test 50 times and have 50 different runtimes the standard deviation will be a single number that describes how tight or loose those 50 numbers are distributed around your average. In conjunction with your average run time, the standard deviation will help you see how much spread there is in your results.
Consider the following run times:
12 15 16 18 19 21 12 14
The mean of these run times is 15.875. The sample standard deviation of this set is 3.27. There's a good explanation of what 3.27 actually means (in a normally distributed population, roughly 68% of the samples will fall within one standard deviation of the mean: e.g., between 15.875-3.27 and 15.875+3.27) but I think you're just looking for a way to quantify how 'tight' or 'spread out' the results are around your mean.
Now consider a different set of run times (say, after you compiled all your tests with -O2):
14 16 14 17 19 21 12 14
The mean of these run times is also 15.875. The sample standard deviation of this set is 3.0. (So, roughly 68% of the samples will fall within 15.875-3.0 and 15.875+3.0.) This set is more closely grouped than the first set.
And you have a single number that summarizes how compact or loose a group of numbers is around the mean.
Caveats
Standard deviation is built on the assumption of a normal distribution -- but your application may not be normally distributed, so please be aware that standard deviation may be a rough guideline at best. Plot your run-times in a histogram to see if your data looks roughly normal or uniform or multimodal or...
Also, I'm using the sample standard deviation because these are only a sample out of the population space of benchmark runs. I'm not a professional statistician, so even this basic assumption may be wrong. Either population standard deviation or sample standard deviation will give you good enough results in your application IFF you stick to either sample or population. Don't mix the two.
I mentioned that the standard deviation in conjunction with the mean will help you understand your data: if the standard deviation is almost as large as your mean, or worse, larger, then your data is very dispersed, and perhaps your process is not very repeatable. Interpreting a 3% speedup in the face of a large standard deviation is nearly useless, as you've recognized. And the best judge (in my experience) of the magnitude of the standard deviation is the magnitude of the average.
Last note: yes, you can calculate standard deviation by hand, but it is tedious after the first ten or so. Best to use a spreadsheet or wolfram alpha or your handy high-school calculator.
From Variance:
"the variance of the total group is equal to the mean of the variances of the subgroups, plus the variance of the means of the subgroups."
I had to read that several times, then run it: 464 from this formula == 464, the standard deviation of all the data -- the single number you want.
#!/usr/bin/env python
import sys
import numpy as np
N = 10
exec "\n".join( sys.argv[1:] ) # this.py N= ...
np.set_printoptions( 1, threshold=100, suppress=True ) # .1f
np.random.seed(1)
data = np.random.exponential( size=( N, 60 )) ** 5 # N rows, 60 cols
row_avs = np.mean( data, axis=-1 ) # av of each row
row_devs = np.std( data, axis=-1 ) # spread, stddev, of each row about its av
print "row averages:", row_avs
print "row spreads:", row_devs
print "average row spread: %.3g" % np.mean( row_devs )
# http://en.wikipedia.org/wiki/Variance:
# variance of the total group
# = mean of the variances of the subgroups + variance of the means of the subgroups
avvar = np.mean( row_devs ** 2 )
varavs = np.var( row_avs )
print "sqrt total variance: %.3g = sqrt( av var %.3g + var avs %.3g )" % (
np.sqrt( avvar + varavs ), avvar, varavs)
var_all = np.var( data ) # std^2 all N x 60 about the av of the lot
print "sqrt variance all: %.3g" % np.sqrt( var_all )
row averages: [ 49.6 151.4 58.1 35.7 59.7 48. 115.6 69.4 148.1 25. ]
row devs: [ 244.7 932.1 251.5 76.9 201.1 280. 513.7 295.9 798.9 159.3]
average row dev: 375
sqrt total variance: 464 = sqrt( av var 2.13e+05 + var avs 1.88e+03 )
sqrt variance all: 464
To see how group variance increases, run the example in Wikipedia Variance.
Say we have
60 men of heights 180 +- 10, exactly 30: 170 and 30: 190
60 women of heights 160 +- 7, 30: 153 and 30: 167.
The average standard dev is (10 + 7) / 2 = 8.5 .
Together though, the heights
-------|||----------|||-|||-----------------|||---
153 167 170 190
spread like 170 +- 13.2, much greater than 170 +- 8.5.
Why ? Because we have not only the spreads men +- 10 and women +- 7,
but also the spreads from 160 / 180 about the common mean 170.
Exercise: compute the spread 13.2 in two ways,
from the formula above, and directly.
This is a tricky problem because benchmarks can be of different natural lengths anyway. So, the first thing you need to do is to convert each of the individual sub-benchmark figures into scale-invariant values (e.g., “speed up factor” relative to some believed-good baseline) so that you at least have a chance to compare different benchmarks.
Then you need to pick a way to combine the figures. Some sort of average. There are, however, many types of average. We can reject the use of the mode and the median here; they throw away too much relevant information. But the different kinds of mean are useful because of the different ways they give weight to outliers. I used to know (but have forgotten) whether it was the geometric mean or the harmonic mean that was most useful in practice (the arithmetic mean is less good here). The geometric mean is basically an arithmetic mean in the log-domain, and a harmonic mean is similarly an arithmetic mean in the reciprocal-domain. (Spreadsheets make this trivial.)
Now that you have a means to combine the values for a run of the benchmark suite into something suitably informative, you've then got to do lots of runs. You might want to have the computer do that while you get on with some other task. :-) Then try combining the values in various ways. In particular, look at the variance of the individual sub-benchmarks and the variance of the combined benchmark number. Also consider doing some of the analyses in the log and reciprocal domains.
Be aware that this is a slow business that is difficult to get right and it's usually uninformative to boot. A benchmark only does performance testing of exactly what's in the benchmark, and that's mostly not how people use the code. It's probably best to consider strictly time-boxing your benchmarking work and instead focus on whether users think the software is perceived as fast enough or whether required transaction rates are actually attained in deployment (there are many non-programming ways to screw things up).
Good luck!
You are trying to solve the wrong problem. Better try to minimize it. The differences can be because of caching.
Try running the code on a single (same) core with SetThreadAffinityMask() function on Windows.
Drop the first measurement.
Increase the thead priority.
Stop hyperthreading.
If you have many conditional jumps it can introduce visible differences between calls with different input. (this could be solved by giving exactly the same input for i-th iteration, and then comparing the measured times between these iterations).
You can find here some useful hints: http://www.agner.org/optimize/optimizing_cpp.pdf

Resources