I have a dataset in the form of (timestamp,latitude,longitude). I'm going to be given n entries where each entry is of the form of (timestamp,latitude,longitude). This is for one user.
User1:(timestamp1,latitude1,longitude1)....(timestamp_n,latitude_n,longitude_n)
Now let's say we have 100 users each has a set of points of (timestamp,latitude,longitude)
I want to know which set of users might have matching trajectory.
Matching trajectory would be the same route taken, hence in a given set of timestamps the latitude and longitude should be same or close enough as well as the timestamp should be same or close enough. Close enough can be about 30 seconds for timestamp while for space let it be like 200 metres. I can do this via a brute force approach and I'm looking for better solutions.
You can use a k-dtree or a range tree to index your data. These will let you efficient perform a range query over all three dimensions to your data.
This is quite unrelated to whether the algorithm will still be bruteforce or not.
What I want to present here is how to measure the difference between 2 paths.
It just that I think defining precisely how to quantify the difference will be important.
If you want something faster, then you can probably approximate this quantity later.
Ok, I think the difference between 2 paths is:
The average distance between 2 users over time.
You should be able to interpolate between 2 given data points to find out where the user is at any given time. Just linear interpolation might suffice.
When I say average over time, one would discretize the time so it is easier to compute.
Let's say:
The average distance between 2 users every 10 seconds period.
Edit: The above suggestion assumed that you care about "timing".
Since you mention the timestamp and all.
If you didn't care about it, you shouldn't have put it into the question in the first place.
Anyway, I kind of imagine that it is possible you want to just look at the path itself.
In that case, you could still use the above definition of path difference
simply by ignoring the actual timestamp and imagine that the users start at the same time at the begining of the paths.
The travel speed can be set in various ways... such as making both users complete the path at the same time no matter if one path is longer than another, or maybe just let both travel at the same speed.
Anyway, it all comes down to defining how do you want to measure the path difference.
You need to give more details in the question.
Related
I’ve got a statistical/mathematical problem I’m stumped on and I was really hoping to get some help. I’m working on a research where I need to compare a weekly graph with its own history to see when in the past it was almost the same. Think of this as “finding the closest match”. The information is displayed as a line graph, but it’s readily available as raw data:
Date...................Result
08/10/18......52.5
08/07/18......60.2
08/06/18......58.5
08/05/18......55.4
08/04/18......55.2
and so on...
What I really want is the output to be a form of correlation between the current data points with the other set of 5 concurrent data points in history. So, something like:
Date range.....................Correlation
07/10/18-07/15/18....0.98
We’ll be getting a code written in Python for the software to do this automatically (so that as new data is added, it automatically runs and finds the closest set of numbers to match the current one).
Here’s where the difficulty sets in: Since numbers are on a general upward trend over time, we don’t want it to compare the absolute value (since the numbers might never really match). One suggestion has been to compare the delta (rate of change as a percentage over the previous day), or using a log scale.
I’m wondering: how do I go about this? What kind of calculation I can use to get the desired results? I’ve looked at the different kind of correlation equations, but they don’t account for the “shape” of the data, and they generally just average it out. The shape of the line chart is the important thing.
Thanks very much in advance!
I would simply divide the data of each week by their average (i.e., normalize them to an average of 1), then sum the squares of the differences of each day of each pair of weeks. This sum is what you want to minimize.
If you don't care about how much a graph oscillates relative to its mean, you can normalize also the variance. For each week, calculate mean and variance, then subtract the mean and divide by the root of the variance. Each week will have mean 0 and variance 1. Then minimize the sum of squares of differences like before.
If the normalization of data is all you can change in your workflow, just leave out the sum of squares of differences minimization part.
Im solving an interesting problem wherein for each user, I would like to keep his last N days of activity. This can be applied to many a use-cases and one such simple one is:
For each user - user can come to gym some random day - I want to get the total number of times he hit the gym over the last 90 days.
This is a tricky one for me.
My thoughts: I thought of storing a vector where each entry would determine a day and then a boolean value might represent his visit. To count, just linear processing of that section in the array would suffice.
What is the best way?
Depending on how complex you need it to be, a simple array that stores each of a clients visits should suffice.
Upon each visit, add a new entry containing the date/time. Each day, run a check to see if any clients contain visit records that are older than 90 days. The first record that is not old enough means there are no more records to check, so you can safely move to the next client.
Hope this helps you!
Make for every client a Queue data structure containing elements with visit date.
When client visits gym, just add current date
Q[ClientIdx].Add(Today)
When you need to get a track for him:
while (not Q[ClientIdx].Empty) and (Today - Q[ClientIdx].Peek > 90)
Q[ClientIdx].Remove //dequeue too old records
VisitCount = Q.Count
You can use standard Queue implementations in many languages or simple own implementation based on array/list if standard one is not available.
Note that every record is added and removed once, so amortized complexity is O(1) per add/count operation
Your idea will work, but is it really space-wise efficient?
Your data-structure would be something like this: A boolean 2D vector (you can imagine it as a matrix), where every row is a user and every column is a day (sorted), so that would consist of a:
matrix of size U x N
where U is the number of users.
To answer the question I initially asked, you need to think how dense this matrix is going to be. If it's going to be much, then you made the right choice, if not, then you wasted (much) space. You can see the trade-off here.
Of course, you have to think about your use case. In the gym example, I do not think this would be space efficient, since most people do not go to the gym every day (I think), which will result in a sparse matrix, meaning that we wasted space.
Another idea would be to have a single vector os size N, where the days are sorted. Every entry would be a single linked list, where every node would be a user.
If a user is found in the list of a day, then it means that he went to gym at that day.
With this approach we allocate exactly as much space as needed, so it's space optimal, regardless of the density I mentioned in the matrix's case.
However, is this it? No, of course not! I discussed about space, but what about time efficiency? For example, search is a usually frequent method we want our data structure to support, and if we would like that to be fast!
In the matrix's case, search would be an O(1) operation, which is sweet, since accessing the matrix is a constant operation.
In the vector+list's case however, the search would take O(L), where L is the average size of the lists our vector has in total.
So which one? It depends on your application!
I would try a hashtable as well, which would not require sorting and is space efficient (What is the space complexity of a hash table?).
How about having an Queue with fixed size (90) of visit details for every user? You can generalise it for multiple users and key advantage is you don't have to worry about maintaining last 90 days of data.
You can dump a queue to list or array and persist if needed in O(n). And as you have mentioned the check for no.of presence will be O(n) as well.
I'm trying to come up with a weighted algorithm for an application. In the application, there is a limited amount of space available for different elements. Once all the space is occupied, the algorithm should choose the best element(s) to remove in order to make space for new elements.
There are different attributes which should affect this decision. For example:
T: Time since last accessed. (It's best to replace something that hasn't been accessed in a while.)
N: Number of times accessed. (It's best to replace something which hasn't been accessed many times.)
R: Number of elements which need to be removed in order to make space for the new element. (It's best to replace the least amount of elements. Ideally this should also take into consideration the T and N attributes of each element being replaced.)
I have 2 problems:
Figuring out how much weight to give each of these attributes.
Figuring out how to calculate the weight for an element.
(1) I realize that coming up with the weight for something like this is very subjective, but I was hoping that there's a standard method or something that can help me in deciding how much weight to give each attribute. For example, I was thinking that one method might be to come up with a set of two sample elements and then manually compare the two and decide which one should ultimately be chosen. Here's an example:
Element A: N = 5, T = 2 hours ago.
Element B: N = 4, T = 10 minutes ago.
In this example, I would probably want A to be the element that is chosen to be replaced since although it was accessed one more time, it hasn't been accessed in a lot of time compared with B. This method seems like it would take a lot of time, and would involve making a lot of tough, subjective decisions. Additionally, it may not be trivial to come up with the resulting weights at the end.
Another method I came up with was to just arbitrarily choose weights for the different attributes and then use the application for a while. If I notice anything obviously wrong with the algorithm, I could then go in and slightly modify the weights. This is basically a "guess and check" method.
Both of these methods don't seem that great and I'm hoping there's a better solution.
(2) Once I do figure out the weight, I'm not sure which way is best to calculate the weight. Should I just add everything? (In these examples, I'm assuming that whichever element has the highest replacementWeight should be the one that's going to be replaced.)
replacementWeight = .4*T - .1*N - 2*R
or multiply everything?
replacementWeight = (T) * (.5*N) * (.1*R)
What about not using constants for the weights? For example, sure "Time" (T) may be important, but once a specific amount of time has passed, it starts not making that much of a difference. Essentially I would lump it all in an "a lot of time has passed" bin. (e.g. even though 8 hours and 7 hours have an hour difference between the two, this difference might not be as significant as the difference between 1 minute and 5 minutes since these two are much more recent.) (Or another example: replacing (R) 1 or 2 elements is fine, but when I start needing to replace 5 or 6, that should be heavily weighted down... therefore it shouldn't be linear.)
replacementWeight = 1/T + sqrt(N) - R*R
Obviously (1) and (2) are closely related, which is why I'm hoping that there's a better way to come up with this sort of algorithm.
What you are describing is the classic problem of choosing a cache replacement policy. Which policy is best for you, depends on your data, but the following usually works well:
First, always store a new object in the cache, evicting the R worst one(s). There is no way to know a priori if an object should be stored or not. If the object is not useful, it will fall out of the cache again soon.
The popular squid cache implements the following cache replacement algorithms:
Least Recently Used (LRU):
replacementKey = -T
Least Frequently Used with Dynamic Aging (LFUDA):
replacementKey = N + C
Greedy-Dual-Size-Frequency (GDSF):
replacementKey = (N/R) + C
C refers to a cache age factor here. C is basically the replacementKey of the item that was evicted last (or zero).
NOTE: The replacementKey is calculated when an object is inserted or accessed, and stored alongside the object. The object with the smallest replacementKey is evicted.
LRU is simple and often good enough. The bigger your cache, the better it performs.
LFUDA and GDSF both are tradeoffs. LFUDA prefers to keep large objects even if they are less popular, under the assumption that one hit to a large object makes up lots of hits for smaller objects. GDSF basically makes the opposite tradeoff, keeping many smaller objects over fewer large objects. From what you write, the latter might be a good fit.
If none of these meet your needs, you can calculate optimal values for T, N and R (and compare different formulas for combining them) by minimizing regret, the difference in performance between your formula and the optimal algorithm, using, for example, Linear regression.
This is a completely subjective issue -- as you yourself point out. And a distinct possibility is that if your test cases consist of pairs (A,B) where you prefer A to B, then you might find that you prefer A to B , B to C but also C over A -- i.e. its not an ordering.
If you are not careful, your function might not exist !
If you can define a scalar function of your input variables, with various parameters for coefficients and exponents, you might be able to estimate said parameters by using regression, but you will need an awful lot of data if you have many parameters.
This is the classical statistician's approach of first reviewing the data to IDENTIFY a model, and then using that model to ESTIMATE a particular realisation of the model. There are large books on this subject.
tl;dr: I want to predict file copy completion. What are good methods given the start time and the current progress?
Firstly, I am aware that this is not at all a simple problem, and that predicting the future is difficult to do well. For context, I'm trying to predict the completion of a long file copy.
Current Approach:
At the moment, I'm using a fairly naive formula that I came up with myself: (ETC stands for Estimated Time of Completion)
ETC = currTime + elapsedTime * (totalSize - sizeDone) / sizeDone
This works on the assumption that the remaining files to be copied will do so at the average copy speed thus far, which may or may not be a realistic assumption (dealing with tape archives here).
PRO: The ETC will change gradually, and becomes more and more accurate as the process nears completion.
CON: It doesn't react well to unexpected events, like the file copy becoming stuck or speeding up quickly.
Another idea:
The next idea I had was to keep a record of the progress for the last n seconds (or minutes, given that these archives are supposed to take hours), and just do something like:
ETC = currTime + currAvg * (totalSize - sizeDone)
This is kind of the opposite of the first method in that:
PRO: If the speed changes quickly, the ETC will update quickly to reflect the current state of affairs.
CON: The ETC may jump around a lot if the speed is inconsistent.
Finally
I'm reminded of the control engineering subjects I did at uni, where the objective is essentially to try to get a system that reacts quickly to sudden changes, but isn't unstable and crazy.
With that said, the other option I could think of would be to calculate the average of both of the above, perhaps with some kind of weighting:
Weight the first method more if the copy has a fairly consistent long-term average speed, even if it jumps around a bit locally.
Weight the second method more if the copy speed is unpredictable, and is likely to do things like speed up/slow down for long periods, or stop altogether for long periods.
What I am really asking for is:
Any alternative approaches to the two I have given.
If and how you would combine several different methods to get a final prediction.
If you feel that the accuracy of prediction is important, the way to go about about building a predictive model is as follows:
collect some real-world measurements;
split them into three disjoint sets: training, validation and test;
come up with some predictive models (you already have two plus a mix) and fit them using the training set;
check predictive performance of the models on the validation set and pick the one that performs best;
use the test set to assess the out-of-sample prediction error of the chosen model.
I'd hazard a guess that a linear combination of your current model and the "average over the last n seconds" would perform pretty well for the problem at hand. The optimal weights for the linear combination can be fitted using linear regression (a one-liner in R).
An excellent resource for studying statistical learning methods is The Elements of
Statistical Learning by Hastie, Tibshirani and Friedman. I can't recommend that book highly enough.
Lastly, your second idea (average over the last n seconds) attempts to measure the instantaneous speed. A more robust technique for this might be to use the Kalman filter, whose purpose is exactly this:
Its purpose is to use measurements observed over time, containing
noise (random variations) and other inaccuracies, and produce values
that tend to be closer to the true values of the measurements and
their associated calculated values.
The principal advantage of using the Kalman filter rather than a fixed n-second sliding window is that it's adaptive: it will automatically use a longer averaging window when measurements jump around a lot than when they're stable.
Imho, bad implementations of ETC are wildly overused, which allows us to have a good laugh. Sometimes, it might be better to display facts instead of estimations, like:
5 of 10 files have been copied
10 of 200 MB have been copied
Or display facts and an estimation, and make clear that it is only an estimation. But I would not display only an estimation.
Every user knows that ETCs are often completely meaningless, and then it is hard to distinguish between meaningful ETCs and meaningless ETCs, especially for inexperienced users.
I have implemented two different solutions to address this problem:
The ETC for the current transfer at start time is based on a historic speed value. This value is refined after each transfer. During the transfer I compute a weighted average between the historic data and data from the current transfer, so that the closer to the end you are the more weight is given to actual data from the transfer.
Instead of showing a single ETC, show a range of time. The idea is to compute the ETC from the last 'n' seconds or minutes (like your second idea). I keep track of the best and worst case averages and compute a range of possible ETCs. This is kind of confusing to show in a GUI, but okay to show in a command line app.
There are two things to consider here:
the exact estimation
how to present it to the user
1. On estimation
Other than statistics approach, one simple way to have a good estimation of the current speed while erasing some noise or spikes is to take a weighted approach.
You already experimented with the sliding window, the idea here is to take a fairly large sliding window, but instead of a plain average, giving more weight to more recent measures, since they are more indicative of the evolution (a bit like a derivative).
Example: Suppose you have 10 previous windows (most recent x0, least recent x9), then you could compute the speed:
Speed = (10 * x0 + 9 * x1 + 8 * x2 + ... + x9) / (10 * window-time) / 55
When you have a good assessment of the likely speed, then you are close to get a good estimated time.
2. On presentation
The main thing to remember here is that you want a nice user experience, and not a scientific front.
Studies have demonstrated that users reacted very badly to slow-down and very positively to speed-up. Therefore, a good progress bar / estimated time should be conservative in the estimates presented (reserving time for a potential slow-down) at first.
A simple way to get that is to have a factor that is a percentage of the completion, that you use to tweak the estimated remaining time. For example:
real-completion = 0.4
presented-completion = real-completion * factor(real-completion)
Where factor is such that factor([0..1]) = [0..1], factor(x) <= x and factor(1) = 1. For example, the cubic function produces the nice speed-up toward the completion time. Other functions could use an exponential form 1 - e^x, etc...
What is an algorithm to compare multiple sets of numbers against a target set to determine which ones are the most "similar"?
One use of this algorithm would be to compare today's hourly weather forecast against historical weather recordings to find a day that had similar weather.
The similarity of two sets is a bit subjective, so the algorithm really just needs to diferentiate between good matches and bad matches. We have a lot of historical data, so I would like to try to narrow down the amount of days the users need to look through by automatically throwing out sets that aren't close and trying to put the "best" matches at the top of the list.
Edit:
Ideally the result of the algorithm would be comparable to results using different data sets. For example using the mean square error as suggested by Niles produces pretty good results, but the numbers generated when comparing the temperature can not be compared to numbers generated with other data such as Wind Speed or Precipitation because the scale of the data is different. Some of the non-weather data being is very large, so the mean square error algorithm generates numbers in the hundreds of thousands compared to the tens or hundreds that is generated by using temperature.
I think the mean square error metric might work for applications such as weather compares. It's easy to calculate and gives numbers that do make sense.
Since your want to compare measurements over time you can just leave out missing values from the calculation.
For values that are not time-bound or even unsorted, multi-dimensional scatter data it's a bit more difficult. Choosing a good distance metric becomes part of the art of analysing such data.
Use the pearson correlation coefficient. I figured out how to calculate it in an SQL query which can be found here: http://vanheusden.com/misc/pearson.php
In finance they use Beta to measure the correlation of 2 series of numbers. EG, Beta could answer the question "Over the last year, how much would the price of IBM go up on a day that the price of the S&P 500 index went up 5%?" It deals with the percentage of the move, so the 2 series can have different scales.
In my example, the Beta is Covariance(IBM, S&P 500) / Variance(S&P 500).
Wikipedia has pages explaining Covariance, Variance, and Beta: http://en.wikipedia.org/wiki/Beta_(finance)
Look at statistical sites. I think you are looking for correlation.
As an example, I'll assume you're measuring temp, wind, and precip. We'll call these items "features". So valid values might be:
Temp: -50 to 100F (I'm in Minnesota, USA)
Wind: 0 to 120 Miles/hr (not sure if this is realistic but bear with me)
Precip: 0 to 100
Start by normalizing your data. Temp has a range of 150 units, Wind 120 units, and Precip 100 units. Multiply your wind units by 1.25 and Precip by 1.5 to make them roughly the same "scale" as your temp. You can get fancy here and make rules that weigh one feature as more valuable than others. In this example, wind might have a huge range but usually stays in a smaller range so you want to weigh it less to prevent it from skewing your results.
Now, imagine each measurement as a point in multi-dimensional space. This example measures 3d space (temp, wind, precip). The nice thing is, if we add more features, we simply increase the dimensionality of our space but the math stays the same. Anyway, we want to find the historical points that are closest to our current point. The easiest way to do that is Euclidean distance. So measure the distance from our current point to each historical point and keep the closest matches:
for each historicalpoint
distance = sqrt(
pow(currentpoint.temp - historicalpoint.temp, 2) +
pow(currentpoint.wind - historicalpoint.wind, 2) +
pow(currentpoint.precip - historicalpoint.precip, 2))
if distance is smaller than the largest distance in our match collection
add historicalpoint to our match collection
remove the match with the largest distance from our match collection
next
This is a brute-force approach. If you have the time, you could get a lot fancier. Multi-dimensional data can be represented as trees like kd-trees or r-trees. If you have a lot of data, comparing your current observation with every historical observation would be too slow. Trees speed up your search. You might want to take a look at Data Clustering and Nearest Neighbor Search.
Cheers.
Talk to a statistician.
Seriously.
They do this type of thing for a living.
You write that the "similarity of two sets is a bit subjective", but it's not subjective at all-- it's a matter of determining the appropriate criteria for similarity for your problem domain.
This is one of those situation where you are much better off speaking to a professional than asking a bunch of programmers.
First of all, ask yourself if these are sets, or ordered collections.
I assume that these are ordered collections with duplicates. The most obvious algorithm is to select a tolerance within which numbers are considered the same, and count the number of slots where the numbers are the same under that measure.
I do have a solution implemented for this in my application, but I'm looking to see if there is something that is better or more "correct". For each historical day I do the following:
function calculate_score(historical_set, forecast_set)
{
double c = correlation(historical_set, forecast_set);
double avg_history = average(historical_set);
double avg_forecast = average(forecast_set);
double penalty = abs(avg_history - avg_forecast) / avg_forecast
return c - penalty;
}
I then sort all the results from high to low.
Since the correlation is a value from -1 to 1 that says whether the numbers fall or rise together, I then "penalize" that with the percentage difference the averages of the two sets of numbers.
A couple of times, you've mentioned that you don't know the distribution of the data, which is of course true. I mean, tomorrow there could be a day that is 150 degree F, with 2000km/hr winds, but it seems pretty unlikely.
I would argue that you have a very good idea of the distribution, since you have a long historical record. Given that, you can put everything in terms of quantiles of the historical distribution, and do something with absolute or squared difference of the quantiles on all measures. This is another normalization method, but one that accounts for the non-linearities in the data.
Normalization in any style should make all variables comparable.
As example, let's say that a day it's a windy, hot day: that might have a temp quantile of .75, and a wind quantile of .75. The .76 quantile for heat might be 1 degree away, and the one for wind might be 3kmh away.
This focus on the empirical distribution is easy to understand as well, and could be more robust than normal estimation (like Mean-square-error).
Are the two data sets ordered, or not?
If ordered, are the indices the same? equally spaced?
If the indices are common (temperatures measured on the same days (but different locations), for example, you can regress the first data set against the second,
and then test that the slope is equal to 1, and that the intercept is 0.
http://stattrek.com/AP-Statistics-4/Test-Slope.aspx?Tutorial=AP
Otherwise, you can do two regressions, of the y=values against their indices. http://en.wikipedia.org/wiki/Correlation. You'd still want to compare slopes and intercepts.
====
If unordered, I think you want to look at the cumulative distribution functions
http://en.wikipedia.org/wiki/Cumulative_distribution_function
One relevant test is Kolmogorov-Smirnov:
http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test
You could also look at
Student's t-test,
http://en.wikipedia.org/wiki/Student%27s_t-test
or a Wilcoxon signed-rank test http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
to test equality of means between the two samples.
And you could test for equality of variances with a Levene test http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
Note: it is possible for dissimilar sets of data to have the same mean and variance -- depending on how rigorous you want to be (and how much data you have), you could consider testing for equality of higher moments, as well.
Maybe you can see your set of numbers as a vector (each number of the set being a componant of the vector).
Then you can simply use dot product to compute the similarity of 2 given vectors (i.e. set of numbers).
You might need to normalize your vectors.
More : Cosine similarity