Modeling a job-scheduling problem with variable resource mapping - job-scheduling

I am new to google or-tools and I am trying to solve the following problem:
a number of tasks with precedence constraints
a pool of N resources
resources are grouped by names (a: [0, 1], b: [2, 3, 4], ...)
each job may acquire resources, either directly (job j needs resource 13 and 14) or indirectly (job j needs 1 resource of group a and 2 of group b)
jobs may run in parallel if their precedence and resource constraints allow it, the amount of machines is unlimited
all jobs are assumed to have the same execution time
I want to find a minimal makespan and I want to know:
which job is supposed to run at any time t
which resource is used by which job at time t
I implemented precedence constraints the following way:
from ortools.sat.python import cp_model
njobs = 5
precedence_constraints = [
(0, 3),
(0, 2),
(1, 2),
(2, 3),
(2, 4)
]
model = cp_model.CpModel()
job_time = [ model.NewIntVar(0, njobs-1, 'j{}'.format(i)) for i in range(njobs) ]
for p, n in precedence_constraints:
model.Add(job_time[p] < job_time[n])
model.Minimize(sum(job_time))
solver = cp_model.CpSolver()
status = solver.Solve(model)
for i in range(0, njobs):
print('j{} = {}'.format(i, solver.Value(job_time[i])))
I do not understand how to implement the resource mapping.

You can try to model it as a nearly-classical flexible jobshop:
https://github.com/google/or-tools/blob/stable/examples/python/flexible_job_shop_sat.py
with the addition that you add a cumulative resource to help propagation (See this ongoing discussion on the or-tools mailing list: https://groups.google.com/forum/?hl=kn#!topic/or-tools-discuss/0syUImixcFI), and with the fact that the sum of active copies may be greater than 1.
This is not the most efficient, but it is easier to start that way. And it will be more robust if all durations are not equal.

Related

Maximize weekly $ given changing price for parts & wholes

I have a dollar amount I wish to maximize my spend on products to purchase for a given list of projects.
For example $60. I cannot spend more than $60 a week but can hold over any left over amount. So, if I spend $58 I get to spend $62 the next week. If I purchase a product on week 1 I can use the left over amount on week 2 thereby not needing to re-purchase the same item.
The solution needs to be generic so that I can maximize between a lot of products and a lot of projects given that fixed dollar amount per week.
I need to generate a report with the list of products to purchase and the list of projects to do for that week. The prices get updates weekly so I will need to recalculate the max spend weekly (meaning forecasting is not really part of the solution) and I need to reuse the amount from the products/inventory already purchased.
I have all the data and there aren't any unknown variables. I just need to be able to figure out how to maximize what to purchase given parts and wholes under a fixed dollar amount with history.
To make it more concrete through example (although abstracted from my real data):
I have a database of products (12.5 K different ones) and their corresponding prices. I also have a list of fixed list projects (let's say 2500) I wish to do with those products. For each project I have the corresponding products needed for each. Each project takes a different amount of product. Projects can have overlapping or unique products per project.
So for example:
project 1 is building my model airplanes;
project 2 is fixing my picture frames;
project 3 is building a bird house;
etc.
Project 1 may need:
glue (1oz)
paint (3 qt)
balsam wood (2 lbs)
Project 2 may need:
glue (2 oz)
nails (10 count)
project 3 may need:
glue (10 oz)
paint (5 qts)
nails (40 count)
wood balsam (3 lbs)
wood pine (50 lbs)
Products:
Glue 4oz - $10
Paint 3qts - $30
Nails 15 count - $7
Wood Balsam 8 pounds - $12
Wood Pine 12 pounds - $8
For example, if I buy a bottle of glue (4 oz) at $10 I can use it for my airplanes and my picture frames but not my bird house. I need to do an exhaustive analysis of all products and all projects weekly given my dollar amount to spend since the prices change (sales/demand/etc.).
How can I best spend the $60 to do as many projects as possible in a given week? Week 2 I get a new $60 to spend, (most likely have leftover) money, and have product (such as glue) left over from the week before?
Any python code / projects that do something similar or exactly this already that I may be able to import and modify for my needs?
Any help in terms of algorithms, sample code, full solution!?!, ideas, etc. would be appreciated...
Thanks in advance!!! (FYI: his is for a personal project.)
This is a problem which is very well suited to be tackled with mathematical programming. With mathematical optimization you can optimize variables (e. g. a variable says if a project is conducted at some point) with an objective like the numbers of projects conducted while also considering a set of constraints. For Python there are several free libraries for optimization of mathematical programs, I will show how to get started with your problem using PuLP. Please note that free software for these kind of problems usually performs way worse than commercial one, which can be very expensive. For small or easy problems the free software suffices though.
To get started:
easy_install pulp
Now, import pulp and as a little help itertools.product. There are many ways to represent your data, I choose to declare some ranges which serve as index sets. So r = 0 would be glue and p = 0 build a model air plane. The number of time periods you have to choose. With 4 time periods all projects can be conducted eventually.
from pulp import *
from itertools import product
R = range(5) # Resources
P = range(3) # Projects
T = range(4) # Periods
Your parameters could be represented as follows. project_consumption[0, 0] expresses that project 0 needs 1/4 of material 0 (glue) to be conducted.
resource_prices = (10, 30, 7, 12, 8) # Price per unit of resource
# Needed percentage of resource r for project p
project_consumption = {(0, 0): 1/4, (0, 1): 3/3, (0, 2): 0/15, (0, 3): 2/8, (0, 4): 0/12,
(1, 0): 2/4, (1, 1): 0/3, (1, 2): 10/15, (1, 3): 0/8, (1, 4): 0/12,
(2, 0): 10/4, (2, 1): 5/3, (2, 2): 40/15, (2, 3): 3/8, (2, 4): 50/12,}
budget = 60
Next, we declare our problem formulation. We want to maximize the number of projects, so we declare the sense LpMaximize. The decision variables are declared next:
planned_project[p, t]: 1 if project p is conducted in period t, else 0
stocked_material[r, t]: Amount of material r which is on stock in t
consumption_material[r, t]: Amount of r that is consumed in period t
purchase_material[r, t]: Amount of r purchased in t
budget[t]: Money balance in t
Declare our problem:
m = LpProblem("Project planning", LpMaximize)
planned_project = LpVariable.dicts('pp', product(P, T), lowBound = 0, upBound = 1, cat = LpInteger)
stocked_material = LpVariable.dicts('ms', product(R, T), lowBound = 0)
consumption_material = LpVariable.dicts('cm', product(R, T), lowBound = 0)
purchase_material = LpVariable.dicts('pm', product(R, T), lowBound = 0, cat = LpInteger)
budget = LpVariable.dicts('b', T, lowBound = 0)
Our objective is added to the problem as follows. I multiply every variable with (len(T) - t), that means a project is worth more early rather than later.
m += lpSum((len(T) - t) * planned_project[p, t] for p in P for t in T)
Now we can restrict the values of our decision variables by adding the necessary constraints. The first constraint restricts our material stock to the difference of purchased and consumed materials.
for r in R:
for t in T:
if t != 0:
m += stocked_material[r, t] == stocked_material[r, t-1] + purchase_material[r, t] - consumption_material[r, t]
else:
m += stocked_material[r, t] == 0 + purchase_material[r, 0] - consumption_material[r, 0]
The second constraint makes sure that the correct amount of materials is consumed for the projects conducted in each period.
for r in R:
for t in T:
m += lpSum([project_consumption[p, r] * planned_project[p, t] for p in P]) <= consumption_material[r, t]
The third constraint ensures that we do not spend more than our budget, additionally the leftover amount can be used in future periods.
for t in T:
if t > 0:
m += budget[t] == budget[t-1] + 60 - lpSum([resource_prices[r] * purchase_material[r, t] for r in R])
else:
m += budget[0] == 60 - lpSum([resource_prices[r] * purchase_material[r, 0] for r in R])
Finally, each project shall only be carried out once.
for p in P:
m += lpSum([planned_project[p, t] for t in T]) <= 1
We can optimize our problem by calling:
m.solve()
After optimization we can access each optimal decision variable value with its .value() method. To print some useful information about our optimal plan of action:
for (p, t), var in planned_project.items():
if var.value() == 1:
print("Project {} is conducted in period {}".format(p, t))
for t, var in budget.items():
print("At time {} we have a balance of {} $".format(t, var.value()))
for (r, t), var in purchase_material.items():
if var.value() > 0:
print("At time {}, we purchase {} of material {}.".format(t, var.value(), r))
Output:
Project 0 is conducted in period 0
Project 2 is conducted in period 3
Project 1 is conducted in period 0
At time 0 we have a balance of 1.0 $
At time 1 we have a balance of 1.0 $
At time 2 we have a balance of 61.0 $
At time 3 we have a balance of 0.0 $
At time 0, we purchase 1.0 of material 0.
At time 3, we purchase 1.0 of material 3.
At time 0, we purchase 1.0 of material 3.
At time 1, we purchase 2.0 of material 1.
At time 0, we purchase 1.0 of material 2.
At time 3, we purchase 3.0 of material 2.
At time 3, we purchase 6.0 of material 4.
At time 0, we purchase 1.0 of material 1.
At time 3, we purchase 4.0 of material 0.
Note in the solution we purchase 6 units of material 4 (6*12 wood pine) at time 3. We never really use that much but the solution is still considered optimal, since we do not have budget in our objective and it does not impact the amounts of project we can do if we buy more or less. So there are multiple optimal solutions. As a multi-criteria optimization problem, you could use Big-M values to also minimize budget utilization in the objective.
I hope this gets you started for your problem. You can find countless resources and examples for mathematical programming on the internet.
Something like this could work, creating a list of materials for each project, a list of projects, and dictionary of prices. Each call of compareAll() would show the cheapest project in the list. You could also add a loop which removes the cheapest project from the list and adds it to a to-do list each time it runs, so that the next run finds the next cheapest.
p1 = ["glue","wood","nails"]
p2 = ["screws","wood"]
p3 = ["screws","wood","glue","nails"]
projects = [p1,p2,p3]
prices = {"glue":1,"wood":4,"nails":2,"screws":1}
def check(project,prices):
for i in project:
iPrice = 0
projectTotal = 0
total = 0
for n in prices:
if(i == n):
iPrice = prices[n]
total = total + iPrice
print("Total: " + str(total))
return total
def compareAll(projectList):
best = 100 #Or some other number which exceeds your budget
for i in projectList:
if (check(i,prices) < best):
best = check(i,prices)

Finding cheapest combination of items with conditions on the selection

Lets say that I have 3 sellers of a particular item. Each seller has different amounts of this items stored. The also have a different price for the item.
Name Price Units in storage
Supplier #1 17$ 1 Unit
Supplier #2 18$ 3 Units
Supplier #3 23$ 5 Units
If I do not order enough items from the same supplier, I have to pay some extra costs per unit. Let's say, for example, that if I do not order at least 4 units, I do have to pay extra 5$ for each unit ordered.
Some examples:
If I wanted to buy 4 units, the best price would come from getting them from Supplier #1 and Supplier #2, rather than getting it all from Supplier #3
(17+5)*1 + (18+5)*3 = 91 <--- Cheaper
23 *4 = 92
But if I were to buy 5 units, getting them all from Supplier 3 gives me a better price, than getting first the cheaper ones and the rest from more expensive suppliers
(17+5)*1 + (18+5)*3 + (23+5)*1 = 119
23 *5 = 115$ <--- Cheaper
The question
Keeping all this in mind... If I knew beforehand how many items I want to order, what would be an algorithm to find out what is the best combination I can chose?
As noted in comments, you can use a graph search algorithm for this, like Dijkstra's algorithm. It might also be possible to use A*, but in order to do so, you need a good heuristic function. Using the minimum price might work, but for now, let's stick with Dijkstra's.
One node in the graph is represented as a tuple of (cost, num, counts), where cost is the cost, obviously, num the total number of items purchased, and counts a breakdown of the number of items per seller. With cost being the first element in the tuple, the item with the lowest cost will always be at the front of the heap. We can handle the "extra fee" by adding the fee if the current count for that seller is lower than the minimum, and subtracting it again once we reach that minimum.
Here's a simple implementation in Python.
import heapq
def find_best(goal, num_cheap, pay_extra, price, items):
# state is tuple (cost, num, state)
heap = [(0, 0, tuple((seller, 0) for seller in price))]
visited = set()
while heap:
cost, num, counts = heapq.heappop(heap)
if (cost, num, counts) in visited:
continue # already seen this combination
visited.add((cost, num, counts))
if num == goal: # found one!
yield (cost, num, counts)
for seller, count in counts:
if count < items[seller]:
new_cost = cost + price[seller] # increase cost
if count + 1 < num_cheap: new_cost += pay_extra # pay extra :(
if count + 1 == num_cheap: new_cost -= (num_cheap - 1) * pay_extra # discount! :)
new_counts = tuple((s, c + 1 if s == seller else c) for s, c in counts)
heapq.heappush(heap, (new_cost, num+1, new_counts)) # push to heap
The above is a generator function, i.e. you can either use next(find_best(...)) to find just the best combination, or iterate over all the combinations:
price = {1: 17, 2: 18, 3: 23}
items = {1: 1, 2: 3, 3: 5}
for best in find_best(5, 4, 5, price, items):
print(best)
And as we can see, there's an even cheaper solution for buying five items:
(114, 5, ((1, 1), (2, 0), (3, 4)))
(115, 5, ((1, 0), (2, 0), (3, 5)))
(115, 5, ((1, 0), (2, 1), (3, 4)))
(119, 5, ((1, 1), (2, 3), (3, 1)))
(124, 5, ((1, 1), (2, 2), (3, 2)))
(125, 5, ((1, 0), (2, 3), (3, 2)))
(129, 5, ((1, 1), (2, 1), (3, 3)))
(130, 5, ((1, 0), (2, 2), (3, 3)))
Update 1: While the above works fine for the example, there can be cases where it fails, since subtracting the extra cost once we reach the minimum number means that we could have edges with negative cost, which can be a problem in Dijkstra's. Alternatively, we can add all four elements at once in a single "action". For this, replace the inner part of the algorithm with this:
if count < items[seller]:
def buy(n, extra): # inner function to avoid code duplication
new_cost = cost + (price[seller] + extra) * n
new_counts = tuple((s, c + n if s == seller else c) for s, c in counts)
heapq.heappush(heap, (new_cost, num + n, new_counts))
if count == 0 and items[seller] >= num_cheap:
buy(num_cheap, 0) # buy num_cheap in bulk
if count < num_cheap - 1: # do not buy single item \
buy(1, pay_extra) # when just 1 lower than num_cheap!
if count >= num_cheap:
buy(1, 0) # buy with no extra cost
Update 2: Also, since the order in which the items are added to the "path" does not matter, we can restrict the sellers to those that are not before the current seller. We can add the for seller, count in counts: loop to his:
used_sellers = [i for i, (_, c) in enumerate(counts) if c > 0]
min_sellers = used_sellers[0] if used_sellers else 0
for i in range(min_sellers, len(counts)):
seller, count = counts[i]
With those two improvements, the states in the explored graph looks for next(find_best(5, 4, 5, price, items)) like this (click to enlarge):
Note that there are many states "below" the goal state, with costs much worse. This is because those are all the states that have been added to the queue, and for each of those states, the predecessor state was still better than out best state, thus they were expanded and added to, but never actually popped from the queue. Many of those could probably be trimmed away by using A* with a heuristic function like items_left * min_price.
This is a Bounded Knapsack problem. Where you want to optimize(minimize) the cost with the constraints of price and quantity.
Read about 0-1 KnapSack problem here. Where you have only 1 quantity for given supplier.
Read how to extend the 0-1 KnapSack problem for given quantity ( called Bounded Knapsack ) here
A more detailed discussion of Bounded KnapSack is here
These all will be sufficient to come up with an algorithm which requires a bit of tweaking ( i.g. adding 5$ when the quantity is below some given quantity )

"Assignment problems" which does not require to find min/max but just a valid assignment?

I was reading a problem which seemed to be an assignment problem to me .Here is the abstract:
A company has N jobs with it.N candidates have come to apply for it but at different times.
Given an NxN matrix in which cell (i,j) denotes the time when job-seeker i approaches for jth to the company. You have to find a valid one to one assignment . if a job is assigned to a candidate then that candidate does not look for more jobs.No two candidates must be given the same job.Also at any given moment no two candidates must be at the same job office.Output should be any one permutation which satisfies the above constraints.
eg:
Input:
1 2 3
4 5 6
7 8 9
Output:
3 2 1
Explantion: At time =1sec 1st candidate goes to the first job.Then at time=2sec to the second job.But he is finally assigned the job 3 at time 3.Then at 5th sec job 2 will be assigned to 2nd cand. So he will not go for the job 3 at time =6.Then finally the 1st job will be assigned to the 3rd cand at t=7.
Note that any other permutation is incorrect.For output (1 2 3) will be wrong because the 1st candidate will be assigned the first job.So He will not look for the jobs 2 and 3 .But at the 4 sec the 2nd candidate will also apply for the 1st job which already has the 1st person in the office.
My question is that how to deal with such assignment problems ??
If you order the (i, j) by time, Now which ever person applied for a job last, give that person that job. There will still be someone available for all other jobs at an earlier time (because otherwise it wouldn't have been the maximum time).
Now keep repeating this you will get an assignment fairly quickly:
matrix = [[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
dictionary = {}
for person in range(3):
for job in range(3):
time = matrix[person][job]
dictionary[time] = (person, job)
ordered_time = sorted(dictionary.keys(), reverse=True)
taken_job = set()
taken_person = set()
assignment = []
for time in ordered_time:
person, job = dictionary[time]
if person not in taken_person and job not in taken_job:
assignment.append("t=%s, i=%s, j=%s" % (time, person, job))
taken_job.add(job)
taken_person.add(person)
print(assignment)
#['t=9, i=2, j=2', 't=5, i=1, j=1', 't=1, i=0, j=0']
This is the BLOCKING problem from CodeChef August Challenge programming competition which is currently running. It is against the rules to ask for these sort of hints while the competition is running.
http://www.codechef.com/AUG12/problems/BLOCKING
Once the compeititon has completed on the weekend you will be able to get your answer by looking at other competitors answers.

How to determine best combinations from 2 lists

I'm looking for a way to make the best possible combination of people in groups. Let me sketch the situation.
Say we have persons A, B, C and D. Furthermore we have groups 1, 2, 3, 4 and 5. Both are examples and can be less or more. Each person gives a rating to each other person. So for example A rates B a 3, C a 2, and so on. Each person also rates each group. (Say ratings are 0-5). Now I need some sort of algorithm to distribute these people evenly over the groups while keeping them as happy as possible (as in: They should be in a highrated group, with highrated people). Now I know it's not possible for the people to be in the best group (the one they rated a 5) but I need them to be in the best possible solution for the entire group.
I think this is a difficult question, and I would be happy if someone could direct me to some more information about this types of problems, or help me with the algo I'm looking for.
Thanks!
EDIT:
I see a lot of great answers but this problem is too great for me too solve correctly. However, the answers posted so far give me a great starting point too look further into the subject. Thanks a lot already!
after establishing this is NP-Hard problem, I would suggest as a heuristical solution: Artificial Intelligence tools.
A possible approach is steepest ascent hill climbing [SAHC]
first, we will define our utility function (let it be u). It can be the sum of total happiness in all groups.
next,we define our 'world': S is the group of all possible partitions.
for each legal partition s of S, we define:
next(s)={all possibilities moving one person to a different group}
all we have to do now is run SAHC with random restarts:
1. best<- -INFINITY
2. while there is more time
3. choose a random partition as starting point, denote it as s.
4. NEXT <- next(s)
5. if max{ U(NEXT) } < u(s): //s is the top of the hill
5.1. if u(s) > best: best <- u(s) //if s is better then the previous result - store it.
5.2. go to 2. //restart the hill climbing from a different random point.
6. else:
6.1. s <- max{ NEXT }
6.2. goto 4.
7. return best //when out of time, return the best solution found so far.
It is anytime algorithm, meaning it will get a better result as you give it more time to run, and eventually [at time infinity] it will find the optimal result.
The problem is NP-hard: you can reduce from Maximum Triangle Packing, that is, finding at least k vertex-disjoint triangles in a graph, to the version where there are k groups of size 3, no one cares about which group he is in, and likes everyone for 0 or for 1. So even this very special case is hard.
To solve it, I would try using an ILP: have binary variables g_ik indicating that person i is in group k, with constraints to ensure a person is only in one group and a group has an appropriate size. Further, binary variables t_ijk that indicate that persons i and j are together in group k (ensured by t_ijk <= 0.5 g_ik + 0.5 g_jk) and binary variables t_ij that indicate that i and j are together in any group (ensured by t_ij <= sum_k t_ijk). You can then maximize the happiness function under these constraints.
This ILP has very many variables, but modern solvers are pretty good and this approach is very easy to implement.
This is an example of an optimization problem. It is a very well
studied type of problems with very good methods to solve them. Read
Programming Collective Intelligence which explains it much better
than me.
Basically, there are three parts to any kind of optimization problem.
The input to the problem solving function.
The solution outputted by the problem solving function.
A scoring function that evaluates how optimal the solution is by
scoring it.
Now the problem can be stated as finding the solution that produces
the highest score. To do that, you first need to come up with a format
to represent a possible solution that the scoring function can then
score. Assuming 6 persons (0-5) and 3 groups (0-2), this python data structure
would work and would be a possible solution:
output = [
[0, 1],
[2, 3],
[4, 5]
]
Person 0 and 1 is put in group 0, person 2 and 3 in group 1 and so
on. To score this solution, we need to know the input and the rules for
calculating the output. The input could be represented by this data
structure:
input = [
[0, 4, 1, 3, 4, 1, 3, 1, 3],
[5, 0, 1, 2, 1, 5, 5, 2, 4],
[4, 1, 0, 1, 3, 2, 1, 1, 1],
[2, 4, 1, 0, 5, 4, 2, 3, 4],
[5, 5, 5, 5, 0, 5, 5, 5, 5],
[1, 2, 1, 4, 3, 0, 4, 5, 1]
]
Each list in the list represents the rating the person gave. For
example, in the first row, the person 0 gave rating 0 to person 0 (you
can't rate yourself), 4 to person 1, 1 to person 2, 3 to 3, 4 to 4 and
1 to person 5. Then he or she rated the groups 0-2 3, 1 and 3
respectively.
So above is an example of a valid solution to the given input. How do
we score it? That's not specified in the question, only that the
"best" combination is desired therefore I'll arbitrarily decide that
the score for a solution is the sum of each persons happiness. Each
persons happiness is determined by adding his or her rating of the
group with the average of the rating for each person in the group,
excluding the person itself.
Here is the scoring function:
N_GROUPS = 3
N_PERSONS = 6
def score_solution(input, output):
tot_score = 0
for person, ratings in enumerate(input):
# Check what group the person is a member of.
for group, members in enumerate(output):
if person in members:
# Check what rating person gave the group.
group_rating = ratings[N_PERSONS + group]
# Check what rating the person gave the others.
others = list(members)
others.remove(person)
if not others:
# protect against zero division
person_rating = 0
else:
person_ratings = [ratings[o] for o in others]
person_rating = sum(person_ratings) / float(len(person_ratings))
tot_score += group_rating + person_rating
return tot_score
It should return a score of 37.0 for the given solution. Now what
we'll do is to generate valid outputs while keeping track of which one
is best until we are satisfied:
from random import choice
def gen_solution():
groups = [[] for x in range(N_GROUPS)]
for person in range(N_PERSONS):
choice(groups).append(person)
return groups
# Generate 10000 solutions
solutions = [gen_solution() for x in range(10000)]
# Score them
solutions = [(score_solution(input, sol), sol) for sol in solutions]
# Sort by score, take the best.
best_score, best_solution = sorted(solutions)[-1]
print 'The best solution is %s with score %.2f' % (best_solution, best_score)
Running this on my computer produces:
The best solution is [[0, 1], [3, 5], [2, 4]] with score 47.00
Obviously, you may think it is a really stupid idea to randomly just
generate solutions to throw at the problem, and it is. There are much
more sophisticated methods to generate solutions such as simulated
annealing or genetic optimization. But they all build upon the same
framework as given above.

What is the best way to compute trending topics or tags?

Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions.
I want to compute such a "buzz" for a topic, too. How could I do this? The algorithm should weight the topics which are always hot less. The topics which normally (almost) no one mentions should be the hottest ones.
Google offers "Hot Trends", topix.com shows "Hot Topics", fav.or.it shows "Keyword Trends" - all these services have one thing in common: They only show you upcoming trends which are abnormally hot at the moment.
Terms like "Britney Spears", "weather" or "Paris Hilton" won't appear in these lists because they're always hot and frequent. This article calls this "The Britney Spears Problem".
My question: How can you code an algorithm or use an existing one to solve this problem? Having a list with the keywords searched in the last 24h, the algorithm should show you the 10 (for example) hottest ones.
I know, in the article above, there is some kind of algorithm mentioned. I've tried to code it in PHP but I don't think that it'll work. It just finds the majority, doesn't it?
I hope you can help me (coding examples would be great).
This problem calls for a z-score or standard score, which will take into account the historical average, as other people have mentioned, but also the standard deviation of this historical data, making it more robust than just using the average.
In your case a z-score is calculated by the following formula, where the trend would be a rate such as views / day.
z-score = ([current trend] - [average historic trends]) / [standard deviation of historic trends]
When a z-score is used, the higher or lower the z-score the more abnormal the trend, so for example if the z-score is highly positive then the trend is abnormally rising, while if it is highly negative it is abnormally falling. So once you calculate the z-score for all the candidate trends the highest 10 z-scores will relate to the most abnormally increasing z-scores.
Please see Wikipedia for more information, about z-scores.
Code
from math import sqrt
def zscore(obs, pop):
# Size of population.
number = float(len(pop))
# Average population value.
avg = sum(pop) / number
# Standard deviation of population.
std = sqrt(sum(((c - avg) ** 2) for c in pop) / number)
# Zscore Calculation.
return (obs - avg) / std
Sample Output
>>> zscore(12, [2, 4, 4, 4, 5, 5, 7, 9])
3.5
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20])
0.0739221270955
>>> zscore(20, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
1.00303599234
>>> zscore(2, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1])
-0.922793112954
>>> zscore(9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0])
1.65291949506
Notes
You can use this method with a sliding window (i.e. last 30 days) if you wish not to take to much history into account, which will make short term trends more pronounced and can cut down on the processing time.
You could also use a z-score for values such as change in views from one day to next day to locate the abnormal values for increasing/decreasing views per day. This is like using the slope or derivative of the views per day graph.
If you keep track of the current size of the population, the current total of the population, and the current total of x^2 of the population, you don't need to recalculate these values, only update them and hence you only need to keep these values for the history, not each data value. The following code demonstrates this.
from math import sqrt
class zscore:
def __init__(self, pop = []):
self.number = float(len(pop))
self.total = sum(pop)
self.sqrTotal = sum(x ** 2 for x in pop)
def update(self, value):
self.number += 1.0
self.total += value
self.sqrTotal += value ** 2
def avg(self):
return self.total / self.number
def std(self):
return sqrt((self.sqrTotal / self.number) - self.avg() ** 2)
def score(self, obs):
return (obs - self.avg()) / self.std()
Using this method your work flow would be as follows. For each topic, tag, or page create a floating point field, for the total number of days, sum of views, and sum of views squared in your database. If you have historic data, initialize these fields using that data, otherwise initialize to zero. At the end of each day, calculate the z-score using the day's number of views against the historic data stored in the three database fields. The topics, tags, or pages, with the highest X z-scores are your X "hotest trends" of the day. Finally update each of the 3 fields with the day's value and repeat the process next day.
New Addition
Normal z-scores as discussed above do not take into account the order of the data and hence the z-score for an observation of '1' or '9' would have the same magnitude against the sequence [1, 1, 1, 1, 9, 9, 9, 9]. Obviously for trend finding, the most current data should have more weight than older data and hence we want the '1' observation to have a larger magnitude score than the '9' observation. In order to achieve this I propose a floating average z-score. It should be clear that this method is NOT guaranteed to be statistically sound but should be useful for trend finding or similar. The main difference between the standard z-score and the floating average z-score is the use of a floating average to calculate the average population value and the average population value squared. See code for details:
Code
class fazscore:
def __init__(self, decay, pop = []):
self.sqrAvg = self.avg = 0
# The rate at which the historic data's effect will diminish.
self.decay = decay
for x in pop: self.update(x)
def update(self, value):
# Set initial averages to the first value in the sequence.
if self.avg == 0 and self.sqrAvg == 0:
self.avg = float(value)
self.sqrAvg = float((value ** 2))
# Calculate the average of the rest of the values using a
# floating average.
else:
self.avg = self.avg * self.decay + value * (1 - self.decay)
self.sqrAvg = self.sqrAvg * self.decay + (value ** 2) * (1 - self.decay)
return self
def std(self):
# Somewhat ad-hoc standard deviation calculation.
return sqrt(self.sqrAvg - self.avg ** 2)
def score(self, obs):
if self.std() == 0: return (obs - self.avg) * float("infinity")
else: return (obs - self.avg) / self.std()
Sample IO
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(1)
-1.67770595327
>>> fazscore(0.8, [1, 1, 1, 1, 1, 1, 9, 9, 9, 9, 9, 9]).score(9)
0.596052006642
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(12)
3.46442230724
>>> fazscore(0.9, [2, 4, 4, 4, 5, 5, 7, 9]).score(22)
7.7773245459
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20]).score(20)
-0.24633160155
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(20)
1.1069362749
>>> fazscore(0.9, [21, 22, 19, 18, 17, 22, 20, 20, 1, 2, 3, 1, 2, 1, 0, 1]).score(2)
-0.786764452966
>>> fazscore(0.9, [1, 2, 0, 3, 1, 3, 1, 2, 9, 8, 7, 10, 9, 5, 2, 4, 1, 1, 0]).score(9)
1.82262469243
>>> fazscore(0.8, [40] * 200).score(1)
-inf
Update
As David Kemp correctly pointed out, if given a series of constant values and then a zscore for an observed value which differs from the other values is requested the result should probably be non-zero. In fact the value returned should be infinity. So I changed this line,
if self.std() == 0: return 0
to:
if self.std() == 0: return (obs - self.avg) * float("infinity")
This change is reflected in the fazscore solution code. If one does not want to deal with infinite values an acceptable solution could be to instead change the line to:
if self.std() == 0: return obs - self.avg
You need an algorithm that measures the velocity of a topic - or in other words, if you graph it you want to show those that are going up at an incredible rate.
This is the first derivative of the trend line, and it is not difficult to incorporate as a weighted factor of your overall calculation.
Normalize
One technique you'll need to do is to normalize all your data. For each topic you are following, keep a very low pass filter that defines that topic's baseline. Now every data point that comes in about that topic should be normalized - subtract its baseline and you'll get ALL of your topics near 0, with spikes above and below the line. You may instead want to divide the signal by its baseline magnitude, which will bring the signal to around 1.0 - this not only brings all signals in line with each other (normalizes the baseline), but also normalizes the spikes. A britney spike is going to be magnitudes larger than someone else's spike, but that doesn't mean you should pay attention to it - the spike may be very small relative to her baseline.
Derive
Once you've normalized everything, figure out the slope of each topic. Take two consecutive points, and measure the difference. A positive difference is trending up, a negative difference is trending down. Then you can compare the normalized differences, and find out what topics are shooting upward in popularity compared to other topics - with each topic scaled appropriate to it's own 'normal' which may be magnitudes of order different from other topics.
This is really a first-pass at the problem. There are more advanced techniques which you'll need to use (mostly a combination of the above with other algorithms, weighted to suit your needs) but it should be enough to get you started.
Regarding the article
The article is about topic trending, but it's not about how to calculate what's hot and what's not, it's about how to process the huge amount of information that such an algorithm must process at places like Lycos and Google. The space and time required to give each topic a counter, and find each topic's counter when a search on it goes through is huge. This article is about the challenges one faces when attempting such a task. It does mention the Brittney effect, but it doesn't talk about how to overcome it.
As Nixuz points out this is also referred to as a Z or Standard Score.
Chad Birch and Adam Davis are correct in that you will have to look backward to establish a baseline. Your question, as phrased, suggests that you only want to view data from the past 24 hours, and that won't quite fly.
One way to give your data some memory without having to query for a large body of historical data is to use an exponential moving average. The advantage of this is that you can update this once per period and then flush all old data, so you only need to remember a single value. So if your period is a day, you have to maintain a "daily average" attribute for each topic, which you can do by:
a_n = a_(n-1)*b + c_n*(1-b)
Where a_n is the moving average as of day n, b is some constant between 0 and 1 (the closer to 1, the longer the memory) and c_n is the number of hits on day n. The beauty is if you perform this update at the end of day n, you can flush c_n and a_(n-1).
The one caveat is that it will be initially sensitive to whatever you pick for your initial value of a.
EDIT
If it helps to visualize this approach, take n = 5, a_0 = 1, and b = .9.
Let's say the new values are 5,0,0,1,4:
a_0 = 1
c_1 = 5 : a_1 = .9*1 + .1*5 = 1.4
c_2 = 0 : a_2 = .9*1.4 + .1*0 = 1.26
c_3 = 0 : a_3 = .9*1.26 + .1*0 = 1.134
c_4 = 1 : a_4 = .9*1.134 + .1*1 = 1.1206
c_5 = 4 : a_5 = .9*1.1206 + .1*5 = 1.40854
Doesn't look very much like an average does it? Note how the value stayed close to 1, even though our next input was 5. What's going on? If you expand out the math, what you get that:
a_n = (1-b)*c_n + (1-b)*b*c_(n-1) + (1-b)*b^2*c_(n-2) + ... + (leftover weight)*a_0
What do I mean by leftover weight? Well, in any average, all weights must add to 1. If n were infinity and the ... could go on forever, then all weights would sum to 1. But if n is relatively small, you get a good amount of weight left on the original input.
If you study the above formula, you should realize a few things about this usage:
All data contributes something to the average forever. Practically speaking, there is a point where the contribution is really, really small.
Recent values contribute more than older values.
The higher b is, the less important new values are and the longer old values matter. However, the higher b is, the more data you need to water down the initial value of a.
I think the first two characteristics are exactly what you are looking for. To give you an idea of simple this can be to implement, here is a python implementation (minus all the database interaction):
>>> class EMA(object):
... def __init__(self, base, decay):
... self.val = base
... self.decay = decay
... print self.val
... def update(self, value):
... self.val = self.val*self.decay + (1-self.decay)*value
... print self.val
...
>>> a = EMA(1, .9)
1
>>> a.update(10)
1.9
>>> a.update(10)
2.71
>>> a.update(10)
3.439
>>> a.update(10)
4.0951
>>> a.update(10)
4.68559
>>> a.update(10)
5.217031
>>> a.update(10)
5.6953279
>>> a.update(10)
6.12579511
>>> a.update(10)
6.513215599
>>> a.update(10)
6.8618940391
>>> a.update(10)
7.17570463519
Typically "buzz" is figured out using some form of exponential/log decay mechanism. For an overview of how Hacker News, Reddit, and others handle this in a simple way, see this post.
This doesn't fully address the things that are always popular. What you're looking for seems to be something like Google's "Hot Trends" feature. For that, you could divide the current value by a historical value and then subtract out ones that are below some noise threshold.
I think they key word you need to notice is "abnormally". In order to determine when something is "abnormal", you have to know what is normal. That is, you're going to need historical data, which you can average to find out the normal rate of a particular query. You may want to exclude abnormal days from the averaging calculation, but again that'll require having enough data already, so that you know which days to exclude.
From there, you'll have to set a threshold (which would require experimentation, I'm sure), and if something goes outside the threshold, say 50% more searches than normal, you can consider it a "trend". Or, if you want to be able to find the "Top X Trendiest" like you mentioned, you just need to order things by how far (percentage-wise) they are away from their normal rate.
For example, let's say that your historical data has told you that Britney Spears usually gets 100,000 searches, and Paris Hilton usually gets 50,000. If you have a day where they both get 10,000 more searches than normal, you should be considering Paris "hotter" than Britney, because her searches increased 20% more than normal, while Britney's were only 10%.
God, I can't believe I just wrote a paragraph comparing "hotness" of Britney Spears and Paris Hilton. What have you done to me?
I was wondering if it is at all possible to use regular physics acceleration formula in such a case?
v2-v1/t or dv/dt
We can consider v1 to be initial likes/votes/count-of-comments per hour and v2 to be current "velocity" per hour in last 24 hours?
This is more like a question than an answer, but seems it may just work. Any content with highest acceleration will be the trending topic...
I am sure this may not solve Britney Spears problem :-)
probably a simple gradient of topic frequency would work -- large positive gradient = growing quickly in popularity.
the easiest way would be to bin the number of searched each day, so you have something like
searches = [ 10, 7, 14, 8, 9, 12, 55, 104, 100 ]
and then find out how much it changed from day to day:
hot_factor = [ b-a for a, b in zip(searches[:-1], searches[1:]) ]
# hot_factor is [ -3, 7, -6, 1, 3, 43, 49, -4 ]
and just apply some sort of threshold so that days where the increase was > 50 are considered 'hot'. you could make this far more complicated if you'd like, too. rather than absolute difference you can take the relative difference so that going from 100 to 150 is considered hot, but 1000 to 1050 isn't. or a more complicated gradient that takes into account trends over more than just one day to the next.
I had worked on a project, where my aim was finding Trending Topics from Live Twitter Stream and also doing sentimental analysis on the trending topics (finding if Trending Topic positively/negatively talked about). I've used Storm for handling twitter stream.
I've published my report as a blog: http://sayrohan.blogspot.com/2013/06/finding-trending-topics-and-trending.html
I've used Total Count and Z-Score for the ranking.
The approach that I've used is bit generic, and in the discussion section, I've mentioned that how we can extend the system for non-Twitter Application.
Hope the information helps.
You could use log-likelihood-ratios to compare the current date with the last month or year. This is statistically sound (given that your events are not normally distributed, which is to be assumed from your question).
Just sort all your terms by logLR and pick the top ten.
public static void main(String... args) {
TermBag today = ...
TermBag lastYear = ...
for (String each: today.allTerms()) {
System.out.println(logLikelihoodRatio(today, lastYear, each) + "\t" + each);
}
}
public static double logLikelihoodRatio(TermBag t1, TermBag t2, String term) {
double k1 = t1.occurrences(term);
double k2 = t2.occurrences(term);
double n1 = t1.size();
double n2 = t2.size();
double p1 = k1 / n1;
double p2 = k2 / n2;
double p = (k1 + k2) / (n1 + n2);
double logLR = 2*(logL(p1,k1,n1) + logL(p2,k2,n2) - logL(p,k1,n1) - logL(p,k2,n2));
if (p1 < p2) logLR *= -1;
return logLR;
}
private static double logL(double p, double k, double n) {
return (k == 0 ? 0 : k * Math.log(p)) + ((n - k) == 0 ? 0 : (n - k) * Math.log(1 - p));
}
PS, a TermBag is an unordered collection of words. For each document you create one bag of terms. Just count the occurrences of words. Then the method occurrences returns the number of occurrences of a given word, and the method size returns the total number of words. It is best to normalize the words somehow, typically toLowerCase is good enough. Of course, in the above examples you would create one document with all queries of today, and one with all queries of the last year.
If you simply look at tweets, or status messages to get your topics, you're going to encounter a lot of noise. Even if you remove all stop words. One way to get a better subset of topic candidates is to focus only on tweets/messages that share a URL, and get the keywords from the title of those web pages. And make sure you apply POS tagging to get nouns + noun phrases as well.
Titles of web pages usually are more descriptive and contain words that describe what the page is about. In addition, sharing a web page usually is correlated with sharing news that is breaking (ie if a celebrity like Michael Jackson died, you're going to get a lot of people sharing an article about his death).
I've ran experiments where I only take popular keywords from titles, AND then get the total counts of those keywords across all status messages, and they definitely remove a lot of noise. If you do it this way, you don't need a complex algorith, just do a simple ordering of the keyword frequencies, and you're halfway there.
The idea is to keep track of such things and notice when they jump significantly as compared to their own baseline.
So, for queries that have more than a certain threshhold, track each one and when it changes to some value (say almost double) of its historical value, then it is a new hot trend.

Resources