How do I calculate how "true" something STILL is? Weighted votes with the time they happened? - algorithm

Say I have a thing, and you can upvote/downvote it whether it's true. As time goes by, this thing changes, so I want votes that happened longer ago to matter less (at least, that's what I think should happen). Is there a formula for something like this? Am I missing something?
an idea:
vote# is +1 if upvote, -1 if downvote
score = (vote1/time since vote1) + (vote2/time since vote2) + (vote3/time since vote3)
That way score will increase with more votes and each individual vote starts decaying. High score would mean a lot of people upvoted that, and recently.

I believe the most common approach is to use an exponentially-weighted moving average: https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average.
The nice thing with exponential weighting is that you only need to keep track of your current average, and then you can easily update it for each new incoming value, rather than needing to keep a full history of all past values in order to compute an updated average.
(Or at least, that's the case for weighting based on "number of data-points since this one". For weighting based on "amount of time since this data-point", where the data-points aren't at fixed intervals, it's slightly more complicated, because you'll also need to keep the timestamp of the last data-point; but you still don't need to keep the full history, so it's still very simple and efficient.)

Related

Statistics/Algorithm: How do I compare a weekly graph with its own history to see when in the past it was almost the same?

I’ve got a statistical/mathematical problem I’m stumped on and I was really hoping to get some help. I’m working on a research where I need to compare a weekly graph with its own history to see when in the past it was almost the same. Think of this as “finding the closest match”. The information is displayed as a line graph, but it’s readily available as raw data:
Date...................Result
08/10/18......52.5
08/07/18......60.2
08/06/18......58.5
08/05/18......55.4
08/04/18......55.2
and so on...
What I really want is the output to be a form of correlation between the current data points with the other set of 5 concurrent data points in history. So, something like:
Date range.....................Correlation
07/10/18-07/15/18....0.98
We’ll be getting a code written in Python for the software to do this automatically (so that as new data is added, it automatically runs and finds the closest set of numbers to match the current one).
Here’s where the difficulty sets in: Since numbers are on a general upward trend over time, we don’t want it to compare the absolute value (since the numbers might never really match). One suggestion has been to compare the delta (rate of change as a percentage over the previous day), or using a log scale.
I’m wondering: how do I go about this? What kind of calculation I can use to get the desired results? I’ve looked at the different kind of correlation equations, but they don’t account for the “shape” of the data, and they generally just average it out. The shape of the line chart is the important thing.
Thanks very much in advance!
I would simply divide the data of each week by their average (i.e., normalize them to an average of 1), then sum the squares of the differences of each day of each pair of weeks. This sum is what you want to minimize.
If you don't care about how much a graph oscillates relative to its mean, you can normalize also the variance. For each week, calculate mean and variance, then subtract the mean and divide by the root of the variance. Each week will have mean 0 and variance 1. Then minimize the sum of squares of differences like before.
If the normalization of data is all you can change in your workflow, just leave out the sum of squares of differences minimization part.

Coding: Keep track of last N days of records for each user.

Im solving an interesting problem wherein for each user, I would like to keep his last N days of activity. This can be applied to many a use-cases and one such simple one is:
For each user - user can come to gym some random day - I want to get the total number of times he hit the gym over the last 90 days.
This is a tricky one for me.
My thoughts: I thought of storing a vector where each entry would determine a day and then a boolean value might represent his visit. To count, just linear processing of that section in the array would suffice.
What is the best way?
Depending on how complex you need it to be, a simple array that stores each of a clients visits should suffice.
Upon each visit, add a new entry containing the date/time. Each day, run a check to see if any clients contain visit records that are older than 90 days. The first record that is not old enough means there are no more records to check, so you can safely move to the next client.
Hope this helps you!
Make for every client a Queue data structure containing elements with visit date.
When client visits gym, just add current date
Q[ClientIdx].Add(Today)
When you need to get a track for him:
while (not Q[ClientIdx].Empty) and (Today - Q[ClientIdx].Peek > 90)
Q[ClientIdx].Remove //dequeue too old records
VisitCount = Q.Count
You can use standard Queue implementations in many languages or simple own implementation based on array/list if standard one is not available.
Note that every record is added and removed once, so amortized complexity is O(1) per add/count operation
Your idea will work, but is it really space-wise efficient?
Your data-structure would be something like this: A boolean 2D vector (you can imagine it as a matrix), where every row is a user and every column is a day (sorted), so that would consist of a:
matrix of size U x N
where U is the number of users.
To answer the question I initially asked, you need to think how dense this matrix is going to be. If it's going to be much, then you made the right choice, if not, then you wasted (much) space. You can see the trade-off here.
Of course, you have to think about your use case. In the gym example, I do not think this would be space efficient, since most people do not go to the gym every day (I think), which will result in a sparse matrix, meaning that we wasted space.
Another idea would be to have a single vector os size N, where the days are sorted. Every entry would be a single linked list, where every node would be a user.
If a user is found in the list of a day, then it means that he went to gym at that day.
With this approach we allocate exactly as much space as needed, so it's space optimal, regardless of the density I mentioned in the matrix's case.
However, is this it? No, of course not! I discussed about space, but what about time efficiency? For example, search is a usually frequent method we want our data structure to support, and if we would like that to be fast!
In the matrix's case, search would be an O(1) operation, which is sweet, since accessing the matrix is a constant operation.
In the vector+list's case however, the search would take O(L), where L is the average size of the lists our vector has in total.
So which one? It depends on your application!
I would try a hashtable as well, which would not require sorting and is space efficient (What is the space complexity of a hash table?).
How about having an Queue with fixed size (90) of visit details for every user? You can generalise it for multiple users and key advantage is you don't have to worry about maintaining last 90 days of data.
You can dump a queue to list or array and persist if needed in O(n). And as you have mentioned the check for no.of presence will be O(n) as well.

Design an algorithm to match trajectory?

I have a dataset in the form of (timestamp,latitude,longitude). I'm going to be given n entries where each entry is of the form of (timestamp,latitude,longitude). This is for one user.
User1:(timestamp1,latitude1,longitude1)....(timestamp_n,latitude_n,longitude_n)
Now let's say we have 100 users each has a set of points of (timestamp,latitude,longitude)
I want to know which set of users might have matching trajectory.
Matching trajectory would be the same route taken, hence in a given set of timestamps the latitude and longitude should be same or close enough as well as the timestamp should be same or close enough. Close enough can be about 30 seconds for timestamp while for space let it be like 200 metres. I can do this via a brute force approach and I'm looking for better solutions.
You can use a k-dtree or a range tree to index your data. These will let you efficient perform a range query over all three dimensions to your data.
This is quite unrelated to whether the algorithm will still be bruteforce or not.
What I want to present here is how to measure the difference between 2 paths.
It just that I think defining precisely how to quantify the difference will be important.
If you want something faster, then you can probably approximate this quantity later.
Ok, I think the difference between 2 paths is:
The average distance between 2 users over time.
You should be able to interpolate between 2 given data points to find out where the user is at any given time. Just linear interpolation might suffice.
When I say average over time, one would discretize the time so it is easier to compute.
Let's say:
The average distance between 2 users every 10 seconds period.
Edit: The above suggestion assumed that you care about "timing".
Since you mention the timestamp and all.
If you didn't care about it, you shouldn't have put it into the question in the first place.
Anyway, I kind of imagine that it is possible you want to just look at the path itself.
In that case, you could still use the above definition of path difference
simply by ignoring the actual timestamp and imagine that the users start at the same time at the begining of the paths.
The travel speed can be set in various ways... such as making both users complete the path at the same time no matter if one path is longer than another, or maybe just let both travel at the same speed.
Anyway, it all comes down to defining how do you want to measure the path difference.
You need to give more details in the question.

Feedback on ranking algorithm options for my website

I am currently working on writing an algorithm for my new site I plan to launch soon. The index page will display the "hottest" posts at the moment.
Variables to consider are:
Number of votes
How controversial the post is (# between 0-1)
Time since post
I have come up with two possible algorithms, the first and most simple is:
controversial * (numVotesThisHour / (numVotesTotal - numVotesThisHour)
Denom = numVotesTuisHour if numVotesTotal - numVotesThisHour == 0
Highest number is hottest
My other option is to use an algorithm similar to Reddit's (except that the score decreases as time goes by):
[controversial * log(x)] - (TimePassed / interval)
x = { numVotesTotal if numVotesTotal >= 10, 10 if numVotesTotal < 10
Highest number is hottest
The first algorithm would allow older posts to become "hot" again in the future while the second one wouldn't.
So my question is, which one of these two algorithms do you think is more effective? Which one do you think will display the truly "hot" topics at the moment? Can you think of any advantages or disadvantages to using one over the other? I just want to make sure I don't overlook anything so that I can ensure the content is as relevant as possible. Any feedback would be great! Thanks!
Am I missing something. In the first formula you have numVotesTotal in the denominator. So higher number of votes all time will mean it will never be so hot even if it is not so old.
For example if I have two posts - P1 and P2 (both equally controversial). Say P1 has numVotesTotal = 20, and P2 has numVotesTotal = 1000. Now in the last one hour P1 gets numVotesThisHour = 10 and P2 gets numVotesThisHour = 200.
According to the algorithm, P1 is more famous than P2. It doesn't make sense to me.
I think the first algorithm relies too heavily on instantaneous trend. Think of NASCAR, the current leader could be going 0 m.p.h. because he's at a pit stop. The second one uses the notion of average trend. I think both have their uses.
So for two posts with the same total votes and controversial rating, but where posts one receives 20 votes in the first hour and zero in the second, while the other receives 10 in each hour. The first post will be buried by the first algorithm but the second algorithm will rank them equally.
YMMV, but I think the 'hotness' is entirely dependent on the time frame, and not at all on the total votes unless your time frame is 'all time'. Also, it seems to me that the proportion of all votes in the relevant time frame, rather than the absolute number of them, is the important figure.
You might have several categories of hot:
Hottest this hour
Hottest this week
Hottest since your last visit
Hottest all time
So, 'Hottest in the last [whatever]' could be calculated like this:
votes_for_topic_in_timeframe / all_votes_in_timeframe
if you especially want a number between 0 and 1, (useful for comparing across categories) or, if you only want the ones in a specific timeframe, just take the votes_for_topic_in_timeframe values and sort into descending order.
If you don't want the user explicitly choosing the time frame, you may want to calculate all (say) four versions (or perhaps just the top 3), assign a multiplier to each category to give each category a relative importance, and calculate total values for each topic to take the top n. This has the advantage of potentially hiding from the user that no-one at all has voted in the last hour ;)

Algorithm for most recently/often contacts for auto-complete?

We have an auto-complete list that's populated when an you send an email to someone, which is all well and good until the list gets really big you need to type more and more of an address to get to the one you want, which goes against the purpose of auto-complete
I was thinking that some logic should be added so that the auto-complete results should be sorted by some function of most recently contacted or most often contacted rather than just alphabetical order.
What I want to know is if there's any known good algorithms for this kind of search, or if anyone has any suggestions.
I was thinking just a point system thing, with something like same day is 5 points, last three days is 4 points, last week is 3 points, last month is 2 points and last 6 months is 1 point. Then for most often, 25+ is 5 points, 15+ is 4, 10+ is 3, 5+ is 2, 2+ is 1. No real logic other than those numbers "feel" about right.
Other than just arbitrarily picked numbers does anyone have any input? Other numbers also welcome if you can give a reason why you think they're better than mine
Edit: This would be primarily in a business environment where recentness (yay for making up words) is often just as important as frequency. Also, past a certain point there really isn't much difference between say someone you talked to 80 times vs say 30 times.
Take a look at Self organizing lists.
A quick and dirty look:
Move to Front Heuristic:
A linked list, Such that whenever a node is selected, it is moved to the front of the list.
Frequency Heuristic:
A linked list, such that whenever a node is selected, its frequency count is incremented, and then the node is bubbled towards the front of the list, so that the most frequently accessed is at the head of the list.
It looks like the move to front implementation would best suit your needs.
EDIT: When an address is selected, add one to its frequency, and move to the front of the group of nodes with the same weight (or (weight div x) for courser groupings). I see aging as a real problem with your proposed implementation, in that it requires calculating a weight on each and every item. A self organizing list is a good way to go, but the algorithm needs a bit of tweaking to do what you want.
Further Edit:
Aging refers to the fact that weights decrease over time, which means you need to know each and every time an address was used. Which means, that you have to have the entire email history available to you when you construct your list.
The issue is that we want to perform calculations (other than search) on a node only when it is actually accessed -- This gives us our statistical good performance.
This kind of thing seems similar to what is done by firefox when hinting what is the site you are typing for.
Unfortunately I don't know exactly how firefox does it, point system seems good as well, maybe you'll need to balance your points :)
I'd go for something similar to:
NoM = Number of Mail
(NoM sent to X today) + 1/2 * (NoM sent to X during the last week)/7 + 1/3 * (NoM sent to X during the last month)/30
Contacts you did not write during the last month (it could be changed) will have 0 points. You could start sorting them for NoM sent in total (since it is on the contact list :). These will be showed after contacts with points > 0
It's just an idea, anyway it is to give different importance to the most and just mailed contacts.
If you want to get crazy, mark the most 'active' emails in one of several ways:
Last access
Frequency of use
Contacts with pending sales
Direct bosses
Etc
Then, present the active emails at the top of the list. Pay attention to which "group" your user uses most. Switch to that sorting strategy exclusively after enough data is collected.
It's a lot of work but kind of fun...
Maybe count the number of emails sent to each address. Then:
ORDER BY EmailCount DESC, LastName, FirstName
That way, your most-often-used addresses come first, even if they haven't been used in a few days.
I like the idea of a point-based system, with points for recent use, frequency of use, and potentially other factors (prefer contacts in the local domain?).
I've worked on a few systems like this, and neither "most recently used" nor "most commonly used" work very well. The "most recent" can be a real pain if you accidentally mis-type something once. Alternatively, "most used" doesn't evolve much over time, if you had a lot of contact with somebody last year, but now your job has changed, for example.
Once you have the set of measurements you want to use, you could create an interactive apoplication to test out different weights, and see which ones give you the best results for some sample data.
This paper describes a single-parameter family of cache eviction policies that includes least recently used and least frequently used policies as special cases.
The parameter, lambda, ranges from 0 to 1. When lambda is 0 it performs exactly like an LFU cache, when lambda is 1 it performs exactly like an LRU cache. In between 0 and 1 it combines both recency and frequency information in a natural way.
In spite of an answer having been chosen, I want to submit my approach for consideration, and feedback.
I would account for frequency by incrementing a counter each use, but by some larger-than-one value, like 10 (To add precision to the second point).
I would account for recency by multiplying all counters at regular intervals (say, 24 hours) by some diminisher (say, 0.9).
Each use:
UPDATE `addresslist` SET `favor` = `favor` + 10 WHERE `address` = 'foo#bar.com'
Each interval:
UPDATE `addresslist` SET `favor` = FLOOR(`favor` * 0.9)
In this way I collapse both frequency and recency to one field, avoid the need for keeping a detailed history to derive {last day, last week, last month} and keep the math (mostly) integer.
The increment and diminisher would have to be adjusted to preference, of course.

Resources