Optimize returning top results from intersecting Sorted Set and Set in Redis - ruby

I am trying to optimize my Redis code and I am currently having some difficulties scaling my solution. Redis is hosted on Redis Cloud, the app on Heroku and I am using Ruby for my language.
My Redis setup:
I have a series of sorted sets which each contains approx 1,000 scored members and a corresponding set (which could be a hash, string, list, set, sorted set or any other structure) for each user in the system.
For example in the news:sports key I have the following structure. Other keys examples are news:entertainment, news:business.
news:sports -- Member Score
StoryOID1 1000
StoryOID2 999
StoryOID3 998
StoryOID4 997
StoryOID5 996
...
I also have a user specific key (a set) for each of the master sorted sets (user1:news:sports) that contain a list of stories the user has already seen. ie
seen:user1:sports StoryOID2
My challenge:
On each user request, I need to return the top 3 members (with the highest scores, although I don't need to know the scores) from a specific sorted set that the user hasn't already seen. I prefer not to save the result in Redis as have no long term use and I will just have to delete the key.
Given the example above, User1 requests sports:news, I would return:
StoryOID1
StoryOID3
StoryOID4
As part of my code, I loop through 10 sorted sets (10 genres) returning the top 3 unseen stories from each for a total of 30 OIDS being returned for each request.
For each user request:
Do this 10 times:
ZRANGEBYSCORE top 24 members
loop user genres key using SISMEMBER until I return 3 unseen members
end
Benchmarking at 60 dynos (heroku) on I can only get to 500 simultaneous connections (and at 1000 ms response time) with the Redis loop below being the bottleneck. I am aiming to scale a few multiples over my current setup. Anything and Everything can be change to scale this process.
My Current Process (in Ruby):
def newslist (userOID,genres)
#pull top 24 stories for the given news:genres -- 24 could be replaced by 1,3,6,12 etc
newsscores = ##redis.zrevrangebyscore("news:#{genres}", "+inf", "-inf", :limit => [0, 24],:with_scores => true)
newsstories = Array.new(3)
i = 0 #news acceptance counter
loopcnt = 0 #loop counter
while i < 3
if newsscores.count == loopcnt - 1 #loop to the max number of news returned in news news
break #breakout of loop
end
seen = ##redis.sismember("seen:#{userOID}:#{genres}", newsscores[loopcnt][0])
if seen == false
newsstories[i] = newsscores[loopcnt][0]
i+=1
end
loopcnt += 1
end
if i==3
return newsstories #return 3 news newss
else
return 0 #return 0 -- this should cause a repick
end
return 0 #return 0 -- this should cause a repick
end
I know I am paying a large costs the large number of Redis requets. My current thought process is to basically translate the above into a Lua script I can run on the server side but I can't help but feel there is a more elegant solution that will scale far better.
Is there a better way?

First off, yes: you should 100% be using Lua for this. Check the CPU on your Redis box. I'm willing to bet it's not burning up. Your bottleneck at this point is almost definitely network throughput because you're requiring a back-and-forth call each time you hit SISMEMBER (which is up to 24 times PER USER). That's a lot of unnecessary network activity. It's especially unnecessary when you consider that the logic you're doing on top of SISMEMBER could very easily be done server side, and there's really no reason to send anything back to your client until you finish the loop completely. That logic applies to the initial ZRANGEBYSCORE top 24 members as well. You can literally just translate the entire:
Do this 10 times:
ZRANGEBYSCORE top 24 members
loop user genres key using SISMEMBER until I return 3 unseen members
end
into Lua and go from 250 network hits per user to just 1 per user. That will be a huge, huge win. On top of that, you'll be sending a lot less information to Redis and back the one time you do initiate a Redis call. Here's some Lua pseudo-code that should give you the idea of what you want to be doing:
local genres = {KEYS[1], KEYS[2], KEYS[3], KEYS[4], KEYS[5], KEYS[6], KEYS[7], KEYS[8], KEYS[9], KEYS[10]}
local user_seen_genre_sets = {KEYS[11], KEYS[12], KEYS[13], KEYS[14], KEYS[15], KEYS[16], KEYS[17], KEYS[18], KEYS[19], KEYS[20]}
local user_id = ARGV[1]
to_return = {{},{},{},{},{},{},{},{},{},{}}
for i = 1, #genres do
possible_stories = redis.call('ZREVRANGEBYSCORE', genres[i], 'inf', 0, 'LIMIT', 0, 24)
--call SISMEMBER on each story above with the appropriate user_unseen_genre_sets key
--add the first 3 results to to_return[i], then stop the loop.
end
return to_return
Why use Lua instead of a pipeline?
Itamar Haber brought up a good point that you might want to use broken up pipelines for this instead of a single Lua script because the Lua script might block your Redis server for too long. Here's a few reasons for why you want to use a Lua script instead of broken up pipelines:
I've never seen a single Lua script on Redis that doesn't do something like KEYS(*) take longer than 10s of milliseconds. None of the costs of the operations mentioned should be considered more than log(n) either, so you're pretty future proofed as well in case you're expecting large data growth. If your Redis server is being blocked for too long, that's more an indication that you need a bigger server because none of the operations you're running are very intense (log(n) at most, as mentioned).
One of the major benefits of the Lua script is you are sending your logic to be run server-side instead of sending a bunch of data back and forth to run your logic client-side (i.e. Get all the possible stories and send them to the client. Now send them each back to Redis one by one to run an ISMEMBER). All of that data being sent over the network will be a much bigger bottle neck than running more operations within Redis and Lua, both of which are very, very fast.
So, all in all, despite the valid question, I firmly stand by the Lua approach. If you care to run benchmarks and share them with us, that would be really awesome, because I'm guessing switching this over will improve things around two orders of magnitude.

There is no need to use Lua scripting here, though depending on your data size, Lua versions of this calculation may be faster (depending on SISMEMBER vs. ZUNIONSTORE vs. SDIFFSTORE + ZINTERSTORE performance). Generally, you can calculate everything you want without multiple round-trips and without Lua scripting, provided the following 3 assumptions.
Your ZSETs all use positive non-zero scores (it is easier if all scores are >1, I will assume this)
Your SETs all contain the same members as your sorted sets
The maximum score in your ZSETs is fixed, or can at least be bounded (something like 1000000 is perfectly reasonable)
The important operation here is ZUNIONSTORE, which can take SETs as input, and whose members behave as though all scores are 1.
To get the top 3 stories from your news ZSETs, excluding those stories already read by a given user, you can make the following calls:
ZUNIONSTORE temp 2 news:sports seen:user1:sports WEIGHTS 1 -1000000
ZREVRANGEBYSCORE temp inf 0 LIMIT 0 3
DEL temp
You can wrap that with a MULTI/EXEC transaction so you don't have any extra data laying around after the query, pipeline it all, etc. This one has a limitation that as your number of stories read and the number of stories in each category increases, this one performs slower.
As an alternative, if you had a secondary SET that had the same members as (for example) your news:sports ZSET, you could instead perform:
SDIFFSTORE temp news:sports:set seen:user1:sports
ZINTERSTORE temp 2 temp news:sports WEIGHTS 0 1
ZREVRANGEBYSCORE temp inf -inf LIMIT 0 3
DEL temp
That would eliminate the score requirements, though would increase data size for each of your stories by an entry in a SET. This one also gets slower for every entry in the news SET, ZSET, and the user's seen SET, but the constants are different, so may be faster depending on your data sizes.

Related

Count unique users if user visit n times

I want to implement FreqCapping in an ad network. I want to serve a campaign to unique users only n time in day. If n=1, I could implement this with BloomFilter in redis, but usually n is greater than 1. Is there any data structure (even probabilistic data structure) that aims this issue? And is that implemented in redis?
It sounds like you're describing the Count-min sketch, and while Redis core doesn't have it, RedisBloom does :)
If n is small, just use a bloom filter on '1x' + user, '2x' + user, ..., n + 'x' + 'user'. As a detail, check them in random order. This means that when the user has been seen only a small portion of the time, you will have fewer lookups.
If n is large, consider only doing a fixed number of random lookups. That trades performance when you're close to your limit with sometimes choosing not to fill when you're close to your limit. For example with a maximum of 4 lookups, at 50% of your way to the limit you make the right choice over 90% of the time, and at 80% of the way to the limit you still make the right choice around 60% of the time. And if n=20, you're saving a lot of time when you've hit the limit.
I am sure that there is some kind of special bloom filter that achieves similar limits where you check/set a random subset of the hash functions every time (checking more than you would set). But you won't find that special structure already pre-built in Redis.
The probabilistic version that I am suggesting is this:
def is_available(user, k=4, n=20):
tried = []
for 1..k:
i = rand(n)
while i in tried:
i = rand(n)
id = user + ":" + str(i)
if bloomfilter.lookup(id):
tried.append(i)
else:
bloomfilter.add(id)
return True
return False
The point of randomization is to reduce how many lookups you need. If you go in the same order each time, then on the 10th visit you'll have 9 lookups before you find that they are not over quota. But if n is 20 and you are proceeding in random order, half the time the first lookup will be enough. This reduces round trips, which improves performance, which matters a lot in adtech.

Garbage collection with a very large dictionary

I have a very large immutable set of keys that doesn't fit in memory, and an even larger list of references, which must be scanned just once. How can the mark phase be done in RAM? I do have a possible solution, which I will write as an answer later (don't want to spoil it), but maybe there are other solutions I didn't think about.
I will try to restate the problem to make it more "real":
You work at Facebook, and your task is to find which users didn't ever create a post with an emoji. All you have is the list of active user names (around 2 billion), and the list of posts (user name / text), which you have to scan, but just once. It contains only active users (you don't need to validate them).
Also, you have one computer, with 2 GB of RAM (bonus points for 1 GB). So it has to be done all in RAM (without external sort or reading in sorted order). Within two day.
Can you do it? How? Tips: You might want to use a hash table, with the user name as the key, and one bit as the value. But the list of user names doesn't fit in memory, so that doesn't work. With user ids it might work, but you just have the names. You can scan the list of user names a few times (maybe 40 times, but not more).
Sounds like a problem I tackled 10 years ago.
The first stage: ditch GC. The overhead of GC for small objects (a few bytes) can be in excess of 100%.
The second stage: design a decent compression scheme for user names. English has about 3 bits per character. Even if you allowed more characters, the average amount of bits won't rise fast.
Third stage: Create dictionary of usernames in memory. Use a 16 bit prefix of each username to choose the right sub-dictionary. Read in all usernames, initially sorting them just by this prefix. Then sort each dictionary in turn.
As noted in the question, allocate one extra bit per username for the "used emoji" result.
The problem is now I/O bound, as the computation is embarrassingly parallel. The longest phase will be reading in all the posts (which is going to be many TB).
Note that in this setup, you're not using fancy data types like String. The dictionaries are contiguous memory blocks.
Given a deadline of two days, I would however dump some of this this fanciness. The I/O bound for reading the text is severe enough that the creation of the user database may exceed 16 GB. Yes, that will swap to disk. Big deal for a one-off.
Hash the keys, sort the hashes, and store sorted hashes in compressed form.
TL;DR
The algorithm I propose may be considered as an extension to the solution for similar (simpler) problem.
To each key: apply a hash function that maps keys to integers in range [0..h]. It seems to be reasonably good to start with h = 2 * number_of_keys.
Fill all available memory with these hashes.
Sort the hashes.
If hash value is unique, write it to the list of unique hashes; otherwise remove all copies of it and write it to the list of duplicates. Both these lists should be kept in compressed form: as difference between adjacent values, compressed with optimal entropy coder (like arithmetic coder, range coder, or ANS coder). If the list of unique hashes was not empty, merge it with sorted hashes; additional duplicates may be found while merging. If the list of duplicates was not empty, merge new duplicates to it.
Repeat steps 1..4 while there are any unprocessed keys.
Read keys several more times while performing steps 1..5. But ignore all keys that are not in the list of duplicates from previous pass. For each pass use different hash function (for anything except matching with the list of duplicates from previous pass, which means we need to sort hashes twice, for 2 different hash functions).
Read keys again to convert remaining list of duplicate hashes into list of plain keys. Sort it.
Allocate array of 2 billion bits.
Use all unoccupied memory to construct an index for each compressed list of hashes. This could be a trie or a sorted list. Each entry of the index should contain a "state" of entropy decoder which allows to avoid decoding compressed stream from the very beginning.
Process the list of posts and update the array of 2 billion bits.
Read keys once more co convert hashes back to keys.
While using value h = 2*number_of_keys seems to be reasonably good, we could try to vary it to optimize space requirements. (Setting it too high decreases compression ratio, setting it too low results in too many duplicates).
This approach does not guarantee the result: it is possible to invent 10 bad hash functions so that every key is duplicated on every pass. But with high probability it will succeed and most likely will need about 1GB RAM (because most compressed integer values are in range [1..8], so each key results in about 2..3 bits in compressed stream).
To estimate space requirements precisely we might use either (complicated?) mathematical proof or complete implementation of algorithm (also pretty complicated). But to obtain rough estimation we could use partial implementation of steps 1..4. See it on Ideone. It uses variant of ANS coder named FSE (taken from here: https://github.com/Cyan4973/FiniteStateEntropy) and simple hash function implementation (taken from here: https://gist.github.com/badboy/6267743). Here are the results:
Key list loads allowed: 10 20
Optimal h/n: 2.1 1.2
Bits per key: 2.98 2.62
Compressed MB: 710.851 625.096
Uncompressed MB: 40.474 3.325
Bitmap MB: 238.419 238.419
MB used: 989.744 866.839
Index entries: 1'122'520 5'149'840
Indexed fragment size: 1781.71 388.361
With the original OP limitation of 10 key scans optimal value for hash range is only slightly higher (2.1) than my guess (2.0) and this parameter is very convenient because it allows using 32-bit hashes (instead of 64-bit ones). Required memory is slightly less than 1GB, which allows to use pretty large indexes (so step 10 would be not very slow). Here lies a little problem: these results show how much memory is consumed at the end, but in this particular case (10 key scans) we temporarily need more than 1 GB memory while performing second pass. This may be fixed if we drop results (unique hashes) of the first first pass and recompute them later, together with step 7.
With not so tight limitation of 20 key scans optimal value for hash range is 1.2, which means algorithm needs much less memory and allows more space for indexes (so that step 10 would be almost 5 times faster).
Loosening limitation to 40 key scans does not result in any further improvements.
Minimal perfect hashing
Create a minimal perfect hash function (MPHF).
At around 1.8 bits per key (using the
RecSplit
algorithm), this uses about 429 MB.
(Here, 1 MB is 2^20 bytes, 1 GB is 2^30 bytes.)
For each user, allocate one bit as a marker, about 238 MB.
So memory usage is around 667 MB.
Then read the posts, for each user calculate the hash,
and set the related bit if needed.
Read the user table again, calculate the hash, check if the bit is set.
Generation
Generating the MPHF is a bit tricky, not because it is slow
(this may take around 30 minutes of CPU time),
but due to memory usage. With 1 GB or RAM,
it needs to be done in segments.
Let's say we use 32 segments of about the same size, as follows:
Loop segmentId from 0 to 31.
For each user, calculate the hash code, modulo 32 (or bitwise and 31).
If this doesn't match the current segmentId, ignore it.
Calculate a 64 bit hash code (using a second hash function),
and add that to the list.
Do this until all users are read.
A segment will contain about 62.5 million keys (2 billion divided by 32), that is 238 MB.
Sort this list by key (in place) to detect duplicates.
With 64 bit entries, the probability of duplicates is very low,
but if there are any, use a different hash function and try again
(you need to store which hash function was used).
Now calculate the MPHF for this segment.
The RecSplit algorithm is the fastest I know.
The CHD algorithm can be used as well,
but needs more space / is slower to generate.
Repeat until all segments are processed.
The above algorithm reads the user list 32 times.
This could be reduced to about 10 if more segments are used
(for example one million),
and as many segments are read, per step, as fits in memory.
With smaller segments, less bits per key are needed
to the reduced probability of duplicates within one segment.
The simplest solution I can think of is an old-fashioned batch update program. It takes a few steps, but in concept it's no more complicated than merging two lists that are in memory. This is the kind of thing we did decades ago in bank data processing.
Sort the file of user names by name. You can do this easily enough with the Gnu sort utility, or any other program that will sort files larger than what will fit in memory.
Write a query to return the posts, in order by user name. I would hope that there's a way to get these as a stream.
Now you have two streams, both in alphabetic order by user name. All you have to do is a simple merge:
Here's the general idea:
currentUser = get first user name from users file
currentPost = get first post from database stream
usedEmoji = false
while (not at end of users file and not at end of database stream)
{
if currentUser == currentPostUser
{
if currentPost has emoji
{
usedEmoji = true
}
currentPost = get next post from database
}
else if currentUser > currentPostUser
{
// No user for this post. Get next post.
currentPost = get next post from database
usedEmoji = false
}
else
{
// Current user is less than post user name.
// So we have to switch users.
if (usedEmoji == false)
{
// No post by this user contained an emoji
output currentUser name
}
currentUser = get next user name from file
}
}
// at the end of one of the files.
// Clean up.
// if we reached the end of the posts, but there are still users left,
// then output each user name.
// The usedEmoji test is in there strictly for the first time through,
// because the current user when the above loop ended might have had
// a post with an emoji.
while not at end of user file
{
if (usedEmoji == false)
{
output currentUser name
}
currentUser = get next user name from file
usedEmoji = false
}
// at this point, names of all the users who haven't
// used an emoji in a post have been written to the output.
An alternative implementation, if obtaining the list of posts as described in #2 is overly burdensome, would be to scan the list of posts in their natural order and output the user name from any post that contains an emoji. Then, sort the resulting file and remove duplicates. You can then proceed with a merge similar to the one described above, but you don't have to explicitly check if post has an emoji. Basically, if a name appears in both files, then you don't output it.

Redis: Find keys that exist

We have a list of numbers stored in redis as keys (300 million keys which are 10 digit numeric keys).
Our users give us a list of about 1 million numbers and expect that we get out a subset of these numbers that do not exist in redis as keys. The expectation is to get the result in sub second and we have been trying to use Redis for the same.
Initially it looked like the right way to go (using EXISTS), but now we are questioning if there are better ways to get to the resultant without looping through these numbers and creating the subset.
Could someone please let us know how we could do this efficiently?
Old question I know, but I thought it deserved a fuller answer.
The problem with getting all the keys from redis and then doing the containment test is that you have to pull the 300m keys from redis for every check, or keep a "local" copy of those keys which defeats the point in redis.
Instead of taking the data to the processing, much better to take the processing to the data.
You can use redis sets and let redis do the set diffing.
Using python-redis here, but obviously the execution of redis could be done from any language.
import os, base64, time, redis
r = redis.Redis()
def create_keys(n, size=10):
data = base64.b64encode(os.urandom(n * size))
return [data[i:i+size] for i in range(0, n * size, size)]
if not r.exists('ref_keys'):
for _ in range(3):
r.sadd('ref_keys', *create_keys(1*10**6))
print('{} keys in reference key set'.format(r.scard('ref_keys')))
existing_keys = r.srandmember('ref_keys', number=50*10**3)
keys_to_check = existing_keys + create_keys(50*10**3)
start = time.time()
try:
r.sadd('check_ref', *keys_to_check)
missing = r.sdiff('check_ref', 'ref_keys')
finally:
r.delete('check_ref')
print('number of missing keys: {}, time taken {:0.3f}s'.format(len(missing), time.time() - start))
(Most of the code here (and time) is spent creating the test case.)
Only the checked 1m keys have to be transferred, rather than all 300m.
Note: due to memory my ref_keys set is only has 30m keys and the containment test took 3s. SDIFF has "Time complexity: O(N) where N is the total number of elements in all given sets." so I suspect you would struggle to get the time to below a second on commodity hardware.
Yes you should avoid looping on the user list and use EXISTS for each keys. Redis commands are relatively slow (because of the client/server architecture) as opposed to variable manipulation in a usual language.
One solution I would suggest would require some coding : I would get all the existing keys with KEYS (http://redis.io/commands/keys), then sort the result and the user list.
Then you can implement a fast search to check if the user' keys are in the redis keys.
Actually you might use set in Python, with the difference already coded
http://docs.python.org/2/library/sets.html
(This is unsorted, but the implementation is a dict, which is a hashtable).

Weighted, load-balancing resource scheduling algorithm

A software application that I'm working on needs to be able to assign tasks to a group of users based on how many tasks they presently have, where the users with the fewest tasks are the most likely to get the next task. However, the current task load should be treated as a weighting, rather than an absolute order definition. IOW, I need to implement a weighted, load-balancing algorithm.
Let's say there are five users, with the following number of tasks:
A: 4
B: 5
C: 0
D: 7
E: 9
I want to prioritize the users for the next task in the order CABDE, where C is most likely to get the assignment and E, the least likely. There are two important things to note here:
The number of users can vary from 2 to dozens.
The number of tasks assigned to each user can vary from 1 to hundreds.
For now, we can treat all tasks as equal, though I wouldn't mind including task difficult as a variable that I can use in the future - but this is purely icing on the cake.
The ideas I've come up with so far aren't very good in some situations. They might weight users too closely together if there are a large number of users, or they might fall flat if a user has no current tasks, or....
I've tried poking around the web, but haven't had much luck. Can anyone give me a quick summary of an algorithm that would work well? I don't need an actual implementation--I'll do that part--just a good description. Alternative, is there a good web site that's freely accessible?
Also, while I certainly appreciate quality, this need not be statistically perfect. So if you can think of a good but not great technique, I'm interested!
As you point out, this is a load-balancing problem. It's not really a scheduling problem, since you're not trying to minimise anything (total time, number of concurrent workers, etc.). There are no special constraints (job duration, time clashes, skill sets to match etc.) So really your problem boils down to selecting an appropriate weighting function.
You say there are some situations you want to avoid, like user weightings that are too close together. Can you provide more details? For example, what's wrong with making the chance of assignment just proportional to the current workload, normalised by the workload of the other workers? You can visualise this as a sequence of blocks of different lengths (the tasks), being packed into a set of bins (the workers), where you're trying to keep the total height of the bins as even as possible.
With more information, we could make specific recommendations of functions that could work for you.
Edit: example load-balancing functions
Based on your comments, here are some example of simple functions that can give you different balancing behaviour. A basic question is whether you want deterministic or probabilistic behaviour. I'll give a couple of examples of each.
To use the example in the question - there are 4 + 5 + 0 + 7 + 9 = 25 jobs currently assigned. You want to pick who gets job 26.
1) Simple task farm. For each job, always pick the worker with the least jobs currently pending. Fast workers get more to do, but everyone finishes at about the same time.
2) Guarantee fair workload. If workers work at different speeds, and you don't want some doing more than others, then track the number of completed + pending jobs for each worker. Assign the next job to keep this number evenly spread (fast workers get free breaks).
3) Basic linear normalisation. Pick a maximum number of jobs each worker can have. Each worker's workload is normalised to that number. For example, if the maximum number of jobs/worker is 15, then 50 more jobs can be added before you reach capacity. So for each worker the probability of being assigned the next job is
P(A) = (15 - 4)/50 = 0.22
P(B) = (15 - 5)/50 = 0.2
P(C) = (15 - 0)/50 = 0.3
P(D) = (15 - 7)/50 = 0.16
P(E) = (15 - 9)/50 = 0.12
If you don't want to use a specific maximum threshold, you could use the worker with the highest current number of pending jobs as the limit. In this case, that's worker E, so the probabilities would be
P(A) = (9 - 4)/20 = 0.25
P(B) = (9 - 5)/20 = 0.2
P(C) = (9 - 0)/20 = 0.45
P(D) = (9 - 7)/20 = 0.1
P(E) = (9 - 9)/20 = 0
Note that in this case, the normalisation ensures worker E can't be assigned any jobs - he's already at the limit. Also, just because C doesn't have anything to do doesn't mean he is guaranteed to be given a new job (it's just more likely).
You can easily implement the choice function by generating a random number r between 0 and 1 and comparing it to these boundaries. So if r is < 0.25, A gets the job, 0.25< r < 0.45, B gets the job, etc.
4) Non-linear normalisation. Using a log function (instead of the linear subtraction) to weight your numbers is an easy way to get a non-linear normalisation. You can use this to skew the probabilities, e.g. to make it much more likely that workers without many jobs are given more.
The point is, the number of ways of doing this are practically unlimited. What weighting function you use depends on the specific behaviour you're trying to enable. Hopefully that's given you some ideas which you can use as a starting point.

Algorithm for most recently/often contacts for auto-complete?

We have an auto-complete list that's populated when an you send an email to someone, which is all well and good until the list gets really big you need to type more and more of an address to get to the one you want, which goes against the purpose of auto-complete
I was thinking that some logic should be added so that the auto-complete results should be sorted by some function of most recently contacted or most often contacted rather than just alphabetical order.
What I want to know is if there's any known good algorithms for this kind of search, or if anyone has any suggestions.
I was thinking just a point system thing, with something like same day is 5 points, last three days is 4 points, last week is 3 points, last month is 2 points and last 6 months is 1 point. Then for most often, 25+ is 5 points, 15+ is 4, 10+ is 3, 5+ is 2, 2+ is 1. No real logic other than those numbers "feel" about right.
Other than just arbitrarily picked numbers does anyone have any input? Other numbers also welcome if you can give a reason why you think they're better than mine
Edit: This would be primarily in a business environment where recentness (yay for making up words) is often just as important as frequency. Also, past a certain point there really isn't much difference between say someone you talked to 80 times vs say 30 times.
Take a look at Self organizing lists.
A quick and dirty look:
Move to Front Heuristic:
A linked list, Such that whenever a node is selected, it is moved to the front of the list.
Frequency Heuristic:
A linked list, such that whenever a node is selected, its frequency count is incremented, and then the node is bubbled towards the front of the list, so that the most frequently accessed is at the head of the list.
It looks like the move to front implementation would best suit your needs.
EDIT: When an address is selected, add one to its frequency, and move to the front of the group of nodes with the same weight (or (weight div x) for courser groupings). I see aging as a real problem with your proposed implementation, in that it requires calculating a weight on each and every item. A self organizing list is a good way to go, but the algorithm needs a bit of tweaking to do what you want.
Further Edit:
Aging refers to the fact that weights decrease over time, which means you need to know each and every time an address was used. Which means, that you have to have the entire email history available to you when you construct your list.
The issue is that we want to perform calculations (other than search) on a node only when it is actually accessed -- This gives us our statistical good performance.
This kind of thing seems similar to what is done by firefox when hinting what is the site you are typing for.
Unfortunately I don't know exactly how firefox does it, point system seems good as well, maybe you'll need to balance your points :)
I'd go for something similar to:
NoM = Number of Mail
(NoM sent to X today) + 1/2 * (NoM sent to X during the last week)/7 + 1/3 * (NoM sent to X during the last month)/30
Contacts you did not write during the last month (it could be changed) will have 0 points. You could start sorting them for NoM sent in total (since it is on the contact list :). These will be showed after contacts with points > 0
It's just an idea, anyway it is to give different importance to the most and just mailed contacts.
If you want to get crazy, mark the most 'active' emails in one of several ways:
Last access
Frequency of use
Contacts with pending sales
Direct bosses
Etc
Then, present the active emails at the top of the list. Pay attention to which "group" your user uses most. Switch to that sorting strategy exclusively after enough data is collected.
It's a lot of work but kind of fun...
Maybe count the number of emails sent to each address. Then:
ORDER BY EmailCount DESC, LastName, FirstName
That way, your most-often-used addresses come first, even if they haven't been used in a few days.
I like the idea of a point-based system, with points for recent use, frequency of use, and potentially other factors (prefer contacts in the local domain?).
I've worked on a few systems like this, and neither "most recently used" nor "most commonly used" work very well. The "most recent" can be a real pain if you accidentally mis-type something once. Alternatively, "most used" doesn't evolve much over time, if you had a lot of contact with somebody last year, but now your job has changed, for example.
Once you have the set of measurements you want to use, you could create an interactive apoplication to test out different weights, and see which ones give you the best results for some sample data.
This paper describes a single-parameter family of cache eviction policies that includes least recently used and least frequently used policies as special cases.
The parameter, lambda, ranges from 0 to 1. When lambda is 0 it performs exactly like an LFU cache, when lambda is 1 it performs exactly like an LRU cache. In between 0 and 1 it combines both recency and frequency information in a natural way.
In spite of an answer having been chosen, I want to submit my approach for consideration, and feedback.
I would account for frequency by incrementing a counter each use, but by some larger-than-one value, like 10 (To add precision to the second point).
I would account for recency by multiplying all counters at regular intervals (say, 24 hours) by some diminisher (say, 0.9).
Each use:
UPDATE `addresslist` SET `favor` = `favor` + 10 WHERE `address` = 'foo#bar.com'
Each interval:
UPDATE `addresslist` SET `favor` = FLOOR(`favor` * 0.9)
In this way I collapse both frequency and recency to one field, avoid the need for keeping a detailed history to derive {last day, last week, last month} and keep the math (mostly) integer.
The increment and diminisher would have to be adjusted to preference, of course.

Resources