Determine a threshold value at which the difference is statistically significant from 2 continuous variables - threshold

I have two continuous variables (Time and Y). I need to find a threshold value of Time at which the difference of two groups of Y is statistically significant.
To do that, I think I need to convert Time variable into many new variables with many different thresholds, each new variable has two values: 1 (< threshold value) and 2 (>= threshold value).
For example, I convert Time into 3 new variables: Time5, Time10 and Time15 corresponding 3 threshold values are 5, 10 and 15.
After that, I need to perform t-test for each group variable: Y~Time5, Y~Time10 and Y~Time15.
However, real data will have more variables and more thresholds. So, someone can help me to do that.
I really appreciate if you have better ideas to run it. Thank you very much.
Y <- rnorm(100, 30, 3)
Time <- as.integer(rnorm(100, 5, 1))
df <- data.frame(Y, Time)

Related

How can I normalize trending data?

Say I want to calculate the velocity of two datapoints (A and A'), each having a score, and a time published (A' is a future version of A, and has a higher score). This would be
[A'(score) - A(score)] / [A'(time published) - A (time published)]
What I want to capture are trends with high velocities. This means I want a score going from 20 to 200 having higher weight than 8500 to 9000. So I thought I'd normalize this data by dividing the scores by a baseline.
Ex. if A(score) is 2, and A'(score) is 3, the baseline is 2, so in the formula above,
A'(score) - A(score) would be (3/2 - 2/2)
However, this means that when the numbers are this low, the velocities will be very high (since on the other hand
9000/8500 - 8500/8500
produces very low velocities, given that time difference is constant in this example only, however normally, time differences are variable).
Is there any way to reduce the impact of low starting scores WHILE at the same time allowing jumps from, say, 20 to 200 being significant? Thank you.
There are two ways to look at this. Either could give you what you want.
My first thought was that your question came very close to providing your answer. You gave yourself an important hint by calling your first calculation your velocity - your rate of change of a score over time. You could then look at its acceleration - your rate of change of the velocity over time. That's:
(A''(score) - A'(score)) - (A'(score) - A(score))
Note, I'm not dividing by time, because you say the time difference is constant for each measurement. Then you're dividing each value by a constant, which is inefficient and probably doesn't give you any further clarity.
More likely, though, it seems you want how significant the change is from one score to the next. I suspect what you want is:
(A'(score) - A(score)) / A(score)
This is (a - b) / b, which reduces down to (a/b) - 1. If you don't care about the -1, the simplest way you can see the relevant change in your score is:
A'(score)/A(score)
This shows the rate of growth of the score from one step to the next.
Edit, after clarification:
Given your comment, a variable rate of time makes the logic more complicated, but still do-able.
In that case, you do want to calculate velocity, as you were doing:
V = A'(score) - A(score) / A'(time) - A(time)
But you want to normalize it based on the previous velocity:
result = V'/V
This then becomes similar to the "acceleration" example - it requires 3 samples to have a good idea of the rate of change of the rate of change. If you spell it out all the way, you get something like:
result = (A''(score) - A'(score))/(A''(time) - A'(time)) / (A'(score) - A(score))/(A'(time) - A(time))
You can do some math to shove these numbers around, but there's really no prettier result than that.

suitable formula/algorithm for detecting temperature fluctuations

I'm creating an app to monitor water quality. The temperature data is updated every 2 min to firebase real-time database. App has two requirements
1) It should alert the user when temperature exceed 33 degree or drop below 23 degree - This part is done
2) It should alert user when it has big temperature fluctuation after analysing data every 30min - This part i'm confused.
I don't know what algorithm to use to detect big temperature fluctuation over a period of time and alert the user. Can someone help me on this?
For a period of 30 minutes, your app would give you 15 values.
If you want to figure out a big change in this data, then there is one way to do so.
You can use implement the following method:
Calculate the mean and the standard deviation of the values.
Subtract the data you have from the mean and then take the absolute value of the result.
Compare if the absolute value is greater than one standard deviation, if it is greater then you have a big data.
See this example for better understanding:
Lets suppose you have these values for 10 minutes:
25,27,24,35,28
First Step:
Mean = 27 (apprx)
One standard deviation = 3.8
Second Step: Absolute(Data - Mean)
abs(25-27) = 2
abs(27-27) = 0
abs(24-27) = 3
abs(35-27) = 8
abs(28-27) = 1
Third Step
Check if any of the subtraction is greater than standard deviation
abs(35-27) gives 8 which is greater than 3.8
So, there is a big fluctuation. If all the subtracted results are less than standard deviation, then there is no fluctuation.
You can still improvise the result by selecting two or three standard deviation instead of one standard deviation.
Start by defining what you mean by fluctuation.
You don't say what temperature scale you're using. Fahrenheit, Celsius, Rankine, or Kelvin?
Your sampling rate is a new data value every two minutes. Do you define fluctuation as the absolute value of the difference between the last point and current value? That's defensible.
If the max allowable absolute value is some multiple of your 33-23 = 10 degrees you're in business.

Python Probability Aligorithm

I am looking for help with a Python algorithm that will take a percent or fraction (such as 45% or 4500/10000) and testing it multiple times, and seeing how many times it comes out true, and how many times it comes out false.
Basically, I am looking for an algorithm that will take a probability, test it multiple times, and give us results on how many times you, say, survived, or died.
Is this possible, and can anyone help me?
Loop over the following for the number of trials you want:
Generate a random integer between 0 and the denominator (if it's a fraction) or real number between 0 and 1 (if it's a percent)
If the value is less than the numerator/percent, record a failure, otherwise record a success
You can find information on generating random values in the python documentation, and how you determine whether you're working with a percent or a fraction will depend on how you accept and parse user input.

Generating seed values for pseudorandom number generators

I have 4 integers with which I want to convert to a seed in order to generate a random number. I understand this is arbitrary for the most part, I do however want to make sure what I am currently doing is not overkill (or doesn't generate enough spread in seed values).
I have roughly 1000 objects which I want to have random properties based on some of their variables.
Two variables are constant and are of the 0 - 1000 range and are random for each object, duplicates can occur but this is not likely at all (constant1 and constant2). The other two variables change with deltas of 1 over long time periods through the running of the program, start at 0, can be anywhere within the signed int32 range but will tend to be between -100 and 100 (variable1 and variable2).
How do you suitably generate a seed from these 4 values?
You should probably initialize Random generator only once, when class instance is initialized, so you should use only 2 of the properties (the other 2 are set to 0 by default, aren't they?) to get a seed.
Because of 1. and assuming that constant1 and constant2 are random by default within 0-1000, you can use constant1 * 1000 + constant2 to get random number between 0 and 1000000. I'm not sure about the randomness distribution, but it should be enough to get a seed.
Update
If you really need to get the seed depend on other two variables, you can follow the pattern and do it as follows:
var seed = ((variable1 * 200 + variable) * 1000 + constant1) * 1000 + constant2;
but because it exceeds Int32 range you have to do that in unsafe context to prevent OverflowException being thrown.
And the last thing: I'm not 100% sure it will give you normalized distribution of generated values.

Weighted, load-balancing resource scheduling algorithm

A software application that I'm working on needs to be able to assign tasks to a group of users based on how many tasks they presently have, where the users with the fewest tasks are the most likely to get the next task. However, the current task load should be treated as a weighting, rather than an absolute order definition. IOW, I need to implement a weighted, load-balancing algorithm.
Let's say there are five users, with the following number of tasks:
A: 4
B: 5
C: 0
D: 7
E: 9
I want to prioritize the users for the next task in the order CABDE, where C is most likely to get the assignment and E, the least likely. There are two important things to note here:
The number of users can vary from 2 to dozens.
The number of tasks assigned to each user can vary from 1 to hundreds.
For now, we can treat all tasks as equal, though I wouldn't mind including task difficult as a variable that I can use in the future - but this is purely icing on the cake.
The ideas I've come up with so far aren't very good in some situations. They might weight users too closely together if there are a large number of users, or they might fall flat if a user has no current tasks, or....
I've tried poking around the web, but haven't had much luck. Can anyone give me a quick summary of an algorithm that would work well? I don't need an actual implementation--I'll do that part--just a good description. Alternative, is there a good web site that's freely accessible?
Also, while I certainly appreciate quality, this need not be statistically perfect. So if you can think of a good but not great technique, I'm interested!
As you point out, this is a load-balancing problem. It's not really a scheduling problem, since you're not trying to minimise anything (total time, number of concurrent workers, etc.). There are no special constraints (job duration, time clashes, skill sets to match etc.) So really your problem boils down to selecting an appropriate weighting function.
You say there are some situations you want to avoid, like user weightings that are too close together. Can you provide more details? For example, what's wrong with making the chance of assignment just proportional to the current workload, normalised by the workload of the other workers? You can visualise this as a sequence of blocks of different lengths (the tasks), being packed into a set of bins (the workers), where you're trying to keep the total height of the bins as even as possible.
With more information, we could make specific recommendations of functions that could work for you.
Edit: example load-balancing functions
Based on your comments, here are some example of simple functions that can give you different balancing behaviour. A basic question is whether you want deterministic or probabilistic behaviour. I'll give a couple of examples of each.
To use the example in the question - there are 4 + 5 + 0 + 7 + 9 = 25 jobs currently assigned. You want to pick who gets job 26.
1) Simple task farm. For each job, always pick the worker with the least jobs currently pending. Fast workers get more to do, but everyone finishes at about the same time.
2) Guarantee fair workload. If workers work at different speeds, and you don't want some doing more than others, then track the number of completed + pending jobs for each worker. Assign the next job to keep this number evenly spread (fast workers get free breaks).
3) Basic linear normalisation. Pick a maximum number of jobs each worker can have. Each worker's workload is normalised to that number. For example, if the maximum number of jobs/worker is 15, then 50 more jobs can be added before you reach capacity. So for each worker the probability of being assigned the next job is
P(A) = (15 - 4)/50 = 0.22
P(B) = (15 - 5)/50 = 0.2
P(C) = (15 - 0)/50 = 0.3
P(D) = (15 - 7)/50 = 0.16
P(E) = (15 - 9)/50 = 0.12
If you don't want to use a specific maximum threshold, you could use the worker with the highest current number of pending jobs as the limit. In this case, that's worker E, so the probabilities would be
P(A) = (9 - 4)/20 = 0.25
P(B) = (9 - 5)/20 = 0.2
P(C) = (9 - 0)/20 = 0.45
P(D) = (9 - 7)/20 = 0.1
P(E) = (9 - 9)/20 = 0
Note that in this case, the normalisation ensures worker E can't be assigned any jobs - he's already at the limit. Also, just because C doesn't have anything to do doesn't mean he is guaranteed to be given a new job (it's just more likely).
You can easily implement the choice function by generating a random number r between 0 and 1 and comparing it to these boundaries. So if r is < 0.25, A gets the job, 0.25< r < 0.45, B gets the job, etc.
4) Non-linear normalisation. Using a log function (instead of the linear subtraction) to weight your numbers is an easy way to get a non-linear normalisation. You can use this to skew the probabilities, e.g. to make it much more likely that workers without many jobs are given more.
The point is, the number of ways of doing this are practically unlimited. What weighting function you use depends on the specific behaviour you're trying to enable. Hopefully that's given you some ideas which you can use as a starting point.

Resources