Algorithm to do efficient weighted ranking? - algorithm

I need an algorithm to do fast weighted ranking of Twitter posts.
Each post has a number of ranking scores (like age, author follower count, keyword mentions, etc.). I'm looking for algorithm that can quickly find the top N Tweets, given the weights of each ranking score.
Now, the use case is that these weights will change, and recalculating the ranking scores for every tweet every time the weights change is prohibitively expensive.
I will have access to sorted lists of Tweets, one for each ranking score. So I'm looking for an algorithm to efficiently search through these lists to find my top N.

NOTICE: This answer is provided due to the belief that knowledge is always good (even if it might be used for evil purposes). If you are able to obtain and store/track information like age, author follower count, keyword mentions, etc without ensuring participants fully understand how their data will be used and without obtaining every participant's explicit consent (and without "opt-in, with the ability to opt out at any time"); then you are violating people's privacy and deserve bankruptcy for your grossly unethical malware. It's bad enough that multiple large companies are evil without making it worse.
Assume there's a formula like score = a_rank * a_weight + b_rank * b_weight + c_rank * c_weight.
This can be split into pieces, like:
a_score = a_rank * a_weight
b_score = b_rank * b_weight
c_score = c_rank * c_weight
score = a_score + b_score + c_score
If you know the range of a_rank you can sort the entries into "a_rank buckets". For example, if you have 100 buckets and "a_rank" can be a value from "a_rank_min" to "a_rank_max"; then "bucket_number = (a_rank - a_rank_min) * 100 / (a_rank_max - a_rank_min)".
From here you can say that all entries in a specific "a_rank bucket" must have an "a_score" in a specific range; and you can calculate the minimum and maximum possible "a_score" for all entries in a bucket from "bucket_number" alone; using formulas like "min_a_score_for_bucket = (bucketNumber * (a_rank_max - a_rank_min) / 100 + a_rank_min) * a_weight" and "max_a_score_for_bucket = ( (bucketNumber+1) * (a_rank_max - a_rank_min) / 100 + a_rank_min) * a_weight - 1".
The next step is to establish a "current 10 entries with the highest score so far". Do this by selecting the first 10 entries from the highest "a_rank bucket/s" and calculate their scores fully.
Once this is done (and you know "10th highest score so far") you can calculate a filter for each bucket. If you assume all entries in a bucket have the maximum possible a_rank (determined from the bucket number alone) and the maximum possible c_rank (determined from the possible range of all c_rank values) then you can calculate the minimum value for b_rank that would be needed for the entry's score to be higher than "10th highest score so far"; and in the same way, if you assume all entries in a bucket have the maximum possible a_rank and the maximum possible b_rank you can calculate the minimum value for c_rank that would be needed. The "minimum needed b_rank" and "minimum needed c_rank" can then be used to skip over entries that couldn't possibly beat the "10th highest score so far" without calculating the score for any of those entries.
Of course every time you find an entry with a higher score than the "10th highest score so far" you will get a new "10th highest score so far" and will have to recalculate the "minimum needed b_rank" and "minimum needed c_rank" for the buckets. Ideally you'd look at buckets in "highest a_rank bucket first" order and therefore will only calculate the "minimum needed b_rank" and "minimum needed c_rank" for the current bucket
Near the start (while you're looking at the bucket with the highest a_rank values) it probably won't filter out many entries and might even make performance worse (due to the cost of recalculating "minimum needed b_rank" and "minimum needed c_rank" values). Near the end (while you're looking at the buckets with the lowest a_rank values) you may be able to skip entire buckets without looking at any entry in them.
Note that:
all the weights can change without changing any of the buckets; but it's nicer for performance if "a_rank" has the strongest influence on the score.
the range of values for "a_rank" shouldn't change (you'd have to rebuild the buckets if it does); but the range of values for "b_rank" and "c_rank" can be variable (updated every time a new entry is created)
sorting each bucket in "highest a_rank first" order (and then using "highest b_rank first" as a tie-breaker, etc) will help performance when finding the 10 entries with the highest score; but it will also add overhead when an entry is added. For this reason, for most cases, I probably wouldn't bother sorting the contents of buckets at all.
it would be nice if you can have a bucket for each possible value of "a_rank"; as this gives almost all of the benefits of sorting without any of the overhead of sorting. If you can't have a bucket for each possible value of "a_rank", then increasing the number of buckets can help performance.
in theory; it would be possible to have multiple layers of "bucketing" (e.g. "a_rank buckets" that contain "b_rank buckets"). This would significantly increase complexity, and increase memory consumption; but (especially if no sorting is done) might significantly improve performance (and might make performance worse).

Related

Search ids in a space of 10^30

I have distributed 50 millions of ids within a numeric space of the size of 10^30. Ids are distributed randomly, no series or reversed function could be found. For example, the minimum and the maximum are:
25083112306903763728975529743
29353757632236106718171971627
Two consecutive ids have a distance in the order at least of 10^19. For example:
28249462572807242052513352500
28249462537043093417625790615
This distribution is solid to a brute force attack since to find 1 consecutive to another, it will take at least 10^19 search (to have an idea about timing, 1000 search it will take 1 second then it will spend 10^16 seconds...).
There are other search algorithms to search in this space that could take less time and make my ids distribution less solid?
If your 50 millions are really randomly distributed in a space of 10^30, you can't do anything better than brute force.
This means you can only iterate your 10^30 values in a random order, and in average you have to test 10^30 / (5 10^7) = 2.10^22 to find one.
Of course, there exists an algorithm to find all of them at first try, but it's extremely unlikely that you stumble on it without knowing the ids first.

Open-ended tournament pairing algorithm

I'm developing a tournament model for a virtual city commerce game (Urbien.com) and would love to get some algorithm suggestions. Here's the scenario and current "basic" implementation:
Scenario
Entries are paired up duel-style, like on the original Facemash or Pixoto.com.
The "player" is a judge, who gets a stream of dueling pairs and must choose a winner for each pair.
Tournaments never end, people can submit new entries at any time and winners of the day/week/month/millenium are chosen based on the data at that date.
Problems to be solved
Rating algorithm - how to rate tournament entries and how to adjust their ratings after each match?
Pairing algorithm - how to choose the next pair to feed the player?
Current solution
Rating algorithm - the Elo rating system currently used in chess and other tournaments.
Pairing algorithm - our current algorithm recognizes two imperatives:
Give more duels to entries that have had less duels so far
Match people with similar ratings with higher probability
Given:
N = total number of entries in the tournament
D = total number of duels played in the tournament so far by all players
Dx = how many duels player x has had so far
To choose players x and y to duel, we first choose player x with probability:
p(x) = (1 - (Dx / D)) / N
Then choose player y the following way:
Sort the players by rating
Let the probability of choosing player j at index jIdx in the sorted list be:
p(j) = ...
0, if (j == x)
n*r^abs(jIdx - xIdx) otherwise
where 0 < r < 1 is a coefficient to be chosen, and n is a normalization factor.
Basically the probabilities in either direction from x form a geometic series, normalized so they sum to 1.
Concerns
Maximize informational value of a duel - pairing the lowest rated entry against the highest rated entry is very unlikely to give you any useful information.
Speed - we don't want to do massive amounts of calculations just to choose one pair. One alternative is to use something like the Swiss pairing system and pair up all entries at once, instead of choosing new duels one at a time. This has the drawback (?) that all entries submitted in a given timeframe will experience roughly the same amount of duels, which may or may not be desirable.
Equilibrium - Pixoto's ImageDuel algorithm detects when entries are unlikely to further improve their rating and gives them less duels from then on. The benefits of such detection are debatable. On the one hand, you can save on computation if you "pause" half the entries. On the other hand, entries with established ratings may be the perfect matches for new entries, to establish the newbies' ratings.
Number of entries - if there are just a few entries, say 10, perhaps a simpler algorithm should be used.
Wins/Losses - how does the player's win/loss ratio affect the next pairing, if at all?
Storage - what to store about each entry and about the tournament itself? Currently stored:
Tournament Entry: # duels so far, # wins, # losses, rating
Tournament: # duels so far, # entries
instead of throwing in ELO and ad-hoc probability formulae, you could use a standard approach based on the maximum likelihood method.
The maximum likelihood method is a method for parameter estimation and it works like this (example). Every contestant (player) is assigned a parameter s[i] (1 <= i <= N where N is total number of contestants) that measures the strength or skill of that player. You pick a formula that maps the strengths of two players into a probability that the first player wins. For example,
P(i, j) = 1/(1 + exp(s[j] - s[i]))
which is the logistic curve (see http://en.wikipedia.org/wiki/Sigmoid_function). When you have then a table that shows the actual results between the users, you use global optimization (e.g. gradient descent) to find those strength parameters s[1] .. s[N] that maximize the probability of the actually observed match result. E.g. if you have three contestants and have observed two results:
Player 1 won over Player 2
Player 2 won over Player 3
then you find parameters s[1], s[2], s[3] that maximize the value of the product
P(1, 2) * P(2, 3)
Incidentally, it can be easier to maximize
log P(1, 2) + log P(2, 3)
Note that if you use something like the logistics curve, it is only the difference of the strength parameters that matters so you need to anchor the values somewhere, e.g. choose arbitrarily
s[1] = 0
In order to have more recent matches "weigh" more, you can adjust the importance of the match results based on their age. If t measures the time since a match took place (in some time units), you can maximize the value of the sum (using the example)
e^-t log P(1, 2) + e^-t' log P(2, 3)
where t and t' are the ages of the matches 1-2 and 2-3, so that those games that occurred more recently weigh more.
The interesting thing in this approach is that when the strength parameters have values, the P(...) formula can be used immediately to calculate the win/lose probability for any future match. To pair contestants, you can pair those where the P(...) value is close to 0.5, and then prefer those contestants whose time-adjusted number of matches (sum of e^-t1 + e^-t2 + ...) for match ages t1, t2, ... is low. The best thing would be to calculate the total impact of a win or loss between two players globally and then prefer those matches that have the largest expected impact on the ratings, but that could require lots of calculations.
You don't need to run the maximum likelihood estimation / global optimization algorithm all the time; you can run it e.g. once a day as a batch run and use the results for the next day for matching people together. The time-adjusted match masses can be updated real time anyway.
On algorithm side, you can sort the players after the maximum likelihood run base on their s parameter, so it's very easy to find equal-strength players quickly.

Data structure/algorithm to efficiently save weighted moving average

I'd like to sum up moving averages for a number of different categories when storing log records. Imagine a service that saves web server logs one entry at a time. Let's further imagine, we don't have access to the logged records. So we see them once but don't have access to them later on.
For different pages, I'd like to know
the total number of hits (easy)
a "recent" average (like one month or so)
a "long term" average (over a year)
Is there any clever algorithm/data model that allows to save such moving averages without having to recalculate them by summing up huge quantities of data?
I don't need an exact average (exactly 30 days or so) but just trend indicators. So some fuzziness is not a problem at all. It should just make sure that newer entries are weighted higher than older ones.
One solution probably would be to auto-create statistics records for each month. However, I don't even need past month statistics, so this seems like overkill. And it wouldn't give me a moving average but rather swap to new values from month to month.
An easy solution would be to keep an exponentially decaying total.
It can be calculated using the following formula:
newX = oldX * (p ^ (newT - oldT)) + delta
where oldX is the old value of your total (at time oldT), newX is the new value of your total (at time newT); delta is the contribution of new events to the total (for example the number of hits today); p is less or equal to 1 and is the decay factor. If we take p = 1, then we have the total number of hits. By decreasing p, we effectively decrease the interval our total describes.
If all you really want is a smoothed value with a given time constant then the easiest thing is to use a single pole recursive IIR filter (aka AR or auto-regressive filter in time series analysis). This takes the form:
Xnew = k * X_old + (1 - k) * x
where X_old is the previous smoothed value, X_new is the new smoothed value, x is the current data point and k is a factor which determines the time constant (usually a small value, < 0.1). You may need to determine the two k values (one value for "recent" and a smaller value for "long term") empirically, based on your sample rate, which ideally should be reasonably constant, e.g. one update per day.
It may be solution for you.
You can aggregate data to intermediate storage grouped by hour or day. Than grouping function will work very fast, because you will need to group small amount of records and inserts will be fast as well. Precision decisions up to you.
It can be better than auto-correlated exponential algorithms because you can understand what you calculate easier and it doesn't require math each step.
For last term data you can use capped collections with limited amount of records. They supported natively by some DBs for example MongoDB.

Reducing the Average Number of Comparisons in Selection

The problem here is to reduce the average number of comparisons need in a selection sort.
I am reading an article on this and here is text snippet:
More generally, a sample S' of s elements is chosen from the n
elements. Let "delta" be some number, which we will choose later so
as to minimize the average number of comparisons used by the
procedure. We find the (v1 = (k * s)/(n - delta))th and (v2 = (k* * s)/(n + delta)
)th smallest elements in S'. Almost certainly, the kth smallest
element in S will fall between v1 and v2, so we are left with a
selection problem on (2 * delta) elements. With low probability, the
kth smallest element does not fall in this range, and we have
considerable work to do. However, with a good choice of s and delta,
we can ensure, by the laws of probability, that the second case does
not adversely affect the total work.
I do not follow the above text. Can anyone please explain to me with examples. How did the author reduce to 2 * delta elements? And how does he know that there is a low probablity that element does not fall into this category.
Thanks!
The basis for the idea is that the normal selection algorithm has linear runtime complexity, but in practical terms is slow. We need to sort all the elements in groups of five, and recursively do even more work. O(n) but with too large a constant. The idea then, is to reduce the number of comparisons in the selection algorithm (not a selection sort necessarily). Intuitively it is the same as in basic statistics; if I take a sample subspace of large enough proportion, it is likely that the distribution of data in the subspace adequately reflects the data in the whole space.
So if I'm looking for the kth number in a set of size one million, I could instead take say 10 000 (already one hundredth the size), which is still large enough to be a good representation of the global distribution, and look for the k/100th number. That's simple scaling. So if the space was 10 and I was looking for the 3rd, that's like looking for the 30th in 100, or the 300th in 1000, etc. Essentially k/S = k'/S' (where we're looking for the kth number in S, and we translate that to the k'th number in S' our subspace) and therefore k' = k*S'/S which should look familiar, since in the text you quoted S' is denoted by s, and S by n, and that's the same fraction quoted.
Now in order to take statistical fluctuations into account, we don't assume that the subspace will be a perfect representation of the data's distribution, so we allow for some fluctuation, namely, delta. We say let's find the k'th-delta and k'th+delta elements in S', and then we can say with great certainty (i.e. high mathematical probability) that the kth value from S is in the interval (k'th-delta, k'th+delta).
To wrap it all up we perform these two selections on S', then partition S accordingly, and now do [normal] selection on the much smaller interval in the partition. This ends up being almost optimal for the elements outside the interval, because we don't do selection on those, only partition them. So the selection process is faster, because we have reduced the problem size from S to S'.

How do I pick the most beneficial combination of items from a set of items?

I'm designing a piece of a game where the AI needs to determine which combination of armor will give the best overall stat bonus to the character. Each character will have about 10 stats, of which only 3-4 are important, and of those important ones, a few will be more important than the others.
Armor will also give a boost to 1 or all stats. For example, a shirt might give +4 to the character's int and +2 stamina while at the same time, a pair of pants may have +7 strength and nothing else.
So let's say that a character has a healthy choice of armor to use (5 pairs of pants, 5 pairs of gloves, etc.) We've designated that Int and Perception are the most important stats for this character. How could I write an algorithm that would determine which combination of armor and items would result in the highest of any given stat (say in this example Int and Perception)?
Targeting one statistic
This is pretty straightforward. First, a few assumptions:
You didn't mention this, but presumably one can only wear at most one kind of armor for a particular slot. That is, you can't wear two pairs of pants, or two shirts.
Presumably, also, the choice of one piece of gear does not affect or conflict with others (other than the constraint of not having more than one piece of clothing in the same slot). That is, if you wear pants, this in no way precludes you from wearing a shirt. But notice, more subtly, that we're assuming you don't get some sort of synergy effect from wearing two related items.
Suppose that you want to target statistic X. Then the algorithm is as follows:
Group all the items by slot.
Within each group, sort the potential items in that group by how much they boost X, in descending order.
Pick the first item in each group and wear it.
The set of items chosen is the optimal loadout.
Proof: The only way to get a higher X stat would be if there was an item A which provided more X than some other in its group. But we already sorted all the items in each group in descending order, so there can be no such A.
What happens if the assumptions are violated?
If assumption one isn't true -- that is, you can wear multiple items in each slot -- then instead of picking the first item from each group, pick the first Q(s) items from each group, where Q(s) is the number of items that can go in slot s.
If assumption two isn't true -- that is, items do affect each other -- then we don't have enough information to solve the problem. We'd need to know specifically how items can affect each other, or else be forced to try every possible combination of items through brute force and see which ones have the best overall results.
Targeting N statistics
If you want to target multiple stats at once, you need a way to tell "how good" something is. This is called a fitness function. You'll need to decide how important the N statistics are, relative to each other. For example, you might decide that every +1 to Perception is worth 10 points, while every +1 to Intelligence is only worth 6 points. You now have a way to evaluate the "goodness" of items relative to each other.
Once you have that, instead of optimizing for X, you instead optimize for F, the fitness function. The process is then the same as the above for one statistic.
If, there is no restriction on the number of items by category, the following will work for multiple statistics and multiple items.
Data preparation:
Give each statistic (Int, Perception) a weight, according to how important you determine it is
Store this as a 1-D array statImportance
Give each item-statistic combination a value, according to how much said item boosts said statistic for the player
Store this as a 2-D array itemStatBoost
Algorithm:
In pseudocode. Here assume that itemScore is a sortable Map with Item as the key and a numeric value as the value, and values are initialised to 0.
Assume that the sort method is able to sort this Map by values (not keys).
//Score each item and rank them
for each statistic as S
for each item as I
score = itemScore.get(I) + (statImportance[S] * itemStatBoost[I,S])
itemScore.put(I, score)
sort(itemScore)
//Decide which items to use
maxEquippableItems = 10 //use the appropriate value
selectedItems = new array[maxEquippableItems]
for 0 <= idx < maxEquippableItems
selectedItems[idx] = itemScore.getByIndex(idx)

Resources