I use reddit's ranking algorithm on my site in order to rank what's "hot" and what's not. Here is a simple explanation of how reddit's ranking algorithm works:
https://medium.com/hacking-and-gonzo/how-reddit-ranking-algorithms-work-ef111e33d0d9#.igk62pe0x
Specifically, this part:
# Rewritten code from /r2/r2/lib/db/_sorts.pyx
from datetime import datetime, timedelta
from math import log
epoch = datetime(1970, 1, 1)
def epoch_seconds(date):
td = date - epoch
return td.days * 86400 + td.seconds + (float(td.microseconds) / 1000000)
def score(ups, downs):
return ups - downs
def hot(ups, downs, date):
s = score(ups, downs)
order = log(max(abs(s), 1), 10)
sign = 1 if s > 0 else -1 if s < 0 else 0
seconds = epoch_seconds(date) - 1134028003
return round(sign * order + seconds / 45000, 7)
However, I found that, although this algorithm works very well, it doesn't change up content as often as I would like it to.
How can I modify the algorithm to do this? What numbers or parts of the algorithm would I need to modify in order for content on my site to change more frequently?
Instead of dividing seconds by 45000, you can divide it by a smaller number so that recent posts will have more significant impact and the feed will change more frequently. You can also tweak with the base of the logarithmic function or use a function with a different growth rate like square root function.
Related
This is more an abstract "How would I approach this" question than me struggling with coding. I want to make a character creation screen where you have 0 points but can take away from one stat and put it into another. How, under this system would you randomise stats. I have base stats and the max deviation but I don't know how to got about randomising the stats so it makes specialized characters. They're not all going to be 150% in one and 75% in the other two stats but I think gentle specialization, probably with some form of weighted randomizer, would be nice. Feel free to respond with pseudo code or just an explanation of how you would do it. :D
Here's my solution in python:
import random
from operator import add, sub
baseStats = {
"baseHealth":10.00,
"baseSpeed":10.00,
"baseAccuracy":10.00,
}
baseDeviation = 3
ops = (add, sub)
charStats = {}
#Make spread. Eg: If the deviation is 3 It'll be [0, 0, 0, 0, 1, 1, 1, 2, 2, 3]
#With the highest deviations being the rarest
spread = []
for i in range(1,baseDeviation+2):
for j in range(1,baseDeviation+2-i):
spread.append(i)
print(spread)
#Make a list of stats without the base values.
remainingStats = []
for key, value in baseStats.items():
charStats[key] = value
remainingStats.append(key)
#Choose a stat and add or subract a random choice from our weighted spread
op = random.choice(ops)
chosenOne = random.choice(remainingStats)
remainingStats.remove(chosenOne)
chosenNumber = random.choice(spread)
charStats[chosenOne] = op(charStats[chosenOne],chosenNumber)
spread.remove(chosenNumber)
#Work out the difference between the randomised stat and the standard then give
#it to one and leave the other be.
difference = baseStats[chosenOne] - charStats[chosenOne]
charStats[random.choice(remainingStats)] = charStats[random.choice(remainingStats)] + difference
print(charStats)
How to generate pseudo random numbers and row-counts in Tableau? I didn't find any built-in functions (like 'RAND', 'RCOUNT').
Edit:
Just learned that there is a Random() function in Tableau. It is not in the library but if you use it anyway, it will tell you that the formula is valid and create a value between 0 and 1.
Original and still valid answer in case you want to use officially supported functions:
Since Tableau is used to create graphs based on your data, there is usually little use for random numbers (would you explain what you need them for?)
However you could use an approach like this to work around this limitation: http://community.tableau.com/docs/DOC-1474
Basically getting a semi-random seed out of the time, combine it with other values based on table calculations and multiplying it with other semi-random values
Seed
(DATEPART('second', NOW()) + 1) * (DATEPART('minute', NOW()) + 1) * (DATEPART('hour', NOW()) + 1) * (DATEPART('day', NOW()) + 1)
Random Number
((PREVIOUS_VALUE(MIN([Seed])) * 1140671485 + 12820163) % (2^24))
Random Int
INT([Random Number] / (2^24) * [Random Upper Limit]) + 1
Where [Random Upper Limit] is a user defined value to limit the range of the result.
Let's say we have a news website with 100 pages each displaying several articles, and we want to parse regularly the website to keep statistics on the number of commentaries per article.
The number of commentaries on a article will change rapidly on new articles (so on the first pages), and really slowly on the very old article (on the last pages).
So I will want to parse the first pages way more often than the last pages.
A solution to this problem I imagined would be each time to generate an interval of the pages we want to parse, with the additional requirement that n in this interval would have a probability 1/n of appearing.
For example, we would parse the page 1 every time.
The page 2 would appear in the interval half of the time.
The page 3, 1/3 of the time...
Our algorithm would then generate the 'interval' [1,1] most of the time. The interval [1,2] would be less likely, [1,3] even less ... and [1,100] would be really rare.
Do you see a way to implement this algorithm with the usual random function of most of the languages ?
Is there another way to solve the problem (parse more often the recent content on a website) making more sense ?
Thanks for your help.
edit:
Here is an implementation in Python based on the answer provided by #david-eisenstat.
I tried to implement the version with random() generating integers, but I obtain strange results.
# return a number between 1 and n
def randPage(n):
while True:
r = floor(1 / (1 - random()))
if r <= n:
return r
If you have a function random() that returns doubles in the interval [0, 1), then you look at pages 1 to floor(1 / (1 - random())). Page n is examined if and only if the output of random() is in the interval [1 - 1/n, 1), which has length 1/n.
If you're using an integer random() function in the interval [0, RAND_MAX], then let k = random() and look at RAND_MAX / k pages if k != 0 or all of them if k == 0.
For an application I'm working on, I need to sample a small set of values from a very large data set, on the order of few hundred taken from about 60 trillion (and growing).
Usually I use the technique of seeing if a uniform random number r (0..1) is less than S/T, where S is the number of sample items I still need, and T is the number of items in the set that I haven't considered yet.
However, with this new data, I don't have time to roll the die for each value; there are too many. Instead, I want to generate a random number of entries to "skip", pick the value at the next position, and repeat. That way I can just roll the die and access the list S times. (S is the size of the sample I want.)
I'm hoping there's a straightforward way to do that and create an unbiased sample, along the lines of the S/T test.
To be honest, approximately unbiased would be OK.
This is related (more or less a follow-on) to this persons question:
https://math.stackexchange.com/questions/350041/simple-random-sample-without-replacement
One more side question... the person who showed first showed this to me called it the "mailman's algorithm", but I'm not sure if he was pulling my leg. Is that right?
How about this:
precompute S random numbers from 0 to the size of your dataset.
order your numbers, low to high
store the difference between consecutive numbers as the skip size
iterate though the large dataset using the skip size above.
...The assumption being the order you collect the samples doesn't matter
So I thought about it, and got some help from http://math.stackexchange.com
It boils down to this:
If I picked n items randomly all at once, where would the first one land? That is, min({r_1 ... r_n}). A helpful fellow at math.stackexchange boiled it down to this equation:
x = 1 - (1 - r) ** (1 / n)
that is, the distribution would be 1 minus (1 - r) to the nth power. Then solve for x. Pretty easy.
If I generate a uniform random number and plug it in for r, this is distributed the same as min({r_1 ... r_n}) -- the same way that the lowest item would fall. Voila! I've just simulated picking the first item as if I had randomly selected all n.
So I skip over that many items in the list, pick that one, and then....
Repeat until n is 0
That way, if I have a big database (like Mongo), I can skip, find_one, skip, find_one, etc. Until I have all the items I need.
The only problem I'm having is that my implementation favors the first and last element in the list. But I can live with that.
In Python 2.7, my implementation looks like:
def skip(n):
"""
Produce a random number with the same distribution as
min({r_0, ... r_n}) to see where the next smallest one is
"""
r = numpy.random.uniform()
return 1.0 - (1.0 - r) ** (1.0 / n)
def sample(T, n):
"""
Take n items from a list of size T
"""
t = T
i = 0
while t > 0 and n > 0:
s = skip(n) * (t - n + 1)
i += s
yield int(i) % T
i += 1
t -= s + 1
n -= 1
if __name__ == '__main__':
t = [0] * 100
for c in xrange(10000):
for i in sample(len(t), 10):
t[i] += 1 # this is where we would read value i
pprint.pprint(t)
For my work, I need some kind of algorithm with the following input and output:
Input: a set of dates (from the past). Output: a set of weights - one weight per one given date (the sum of all weights = 1).
The basic idea is that the closest date to today's date should receive the highest weight, the second closest date will get the second highest weight, and so on...
Any ideas?
Thanks in advance!
First, for each date in your input set assign the amount of time between the date and today.
For example: the following date set {today, tomorrow, yesterday, a week from today} becomes {0, 1, 1, 7}. Formally: val[i] = abs(today - date[i]).
Second, inverse the values in such a way that their relative weights are reversed. The simplest way of doing so would be: val[i] = 1/val[i].
Other suggestions:
val[i] = 1/val[i]^2
val[i] = 1/sqrt(val[i])
val[i] = 1/log(val[i])
The hardest and most important part is deciding how to inverse the values. Think, what should be the nature of the weights? (do you want noticeable differences between two far away dates, or maybe two far away dates should have pretty equal weights? Do you want a date which is very close to today have an extremely bigger weight or a reasonably bigger weight?).
Note that you should come up with an inverting procedure where you cannot divide by zero. In the example above, dividing by val[i] results in division by zero. One method to avoid division by zero is called smoothing. The most trivial way to "smooth" your data is using the add-one smoothing where you just add one to each value (so today becomes 1, tomorrow becomes 2, next week becomes 8, etc).
Now the easiest part is to normalize the values so that they'll sum up to one.
sum = val[1] + val[2] + ... + val[n]
weight[i] = val[i]/sum for each i
Sort dates and remove dups
Assign values (maybe starting from the farthest date in steps of 10 or whatever you need - these value can be arbitrary, they just reflect order and distance)
Normalize weights to add up to 1
Executable pseudocode (tweakable):
#!/usr/bin/env python
import random, pprint
from operator import itemgetter
# for simplicity's sake dates are integers here ...
pivot_date = 1000
past_dates = set(random.sample(range(1, pivot_date), 5))
weights, stepping = [], 10
for date in sorted(past_dates):
weights.append( (date, stepping) )
stepping += 10
sum_of_steppings = sum([ itemgetter(1)(x) for x in weights ])
normalized = [ (d, (w / float(sum_of_steppings)) ) for d, w in weights ]
pprint.pprint(normalized)
# Example output
# The 'date' closest to 1000 (here: 889) has the highest weight,
# 703 the second highest, and so forth ...
# [(151, 0.06666666666666667),
# (425, 0.13333333333333333),
# (571, 0.2),
# (703, 0.26666666666666666),
# (889, 0.3333333333333333)]
How to weight: just compute the difference of all dates and the current date
x(i) = abs(date(i) - current_date)
you can then use different expression to assign weights:
w(i) = 1/x(i)
w(i) = exp(-x(i))
w(i) = exp(-x(i)^2))
use gaussian distribution - more complicated, do not recommend
Then use normalized weights: w(i)/sum(w(i)) so that the sum is 1.
(Note that the exponential func is always used by statisticians in survival analysis)
The first thing that comes to my mind to to use a geometric series:
http://en.wikipedia.org/wiki/Geometric_series
(1/2)+(1/4)+(1/8)+(1/16)+(1/32)+(1/64)+(1/128)+(1/256)..... sums to one.
Yesterday would be 1/2
2 days ago would be 1/4
and so on
Is is the index for the i-th date.
Assign weights equal to to Ni / D.
D0 is the first date.
Ni is the difference in days between the i-th date and the first date D0.
D is the normalization factor
converts dates to yyyymmddhhmiss format (24 hours), add all these values and the total, divide by the total time, and sort by this value.
declare #data table
(
Date bigint,
Weight float
)
declare #sumTotal decimal(18,2)
insert into #Data (Date)
select top 100
replace(replace(replace(convert(varchar,Datetime,20),'-',''),':',''),' ','')
from Dates
select #sumTotal=sum(Date)
from #Data
update #Data set
Weight=Date/#sumTotal
select * from #Data order by 2 desc