I have a table in Oracle that records events for a user. This user may have many events. From these events I am calculating a reputation with a formula. My question is, what is this best approach to do this in calculating and returning the data. Using a view and using SQL, doing it in code by grabbing all the events and calculating it (problem with this is when you have a list of users and need to calculate the reputation for all), or something else. Like to hear your thoughts.
Comments * (.1) +
Blog Posts * (.3) +
Blog Posts Ratings * (.1) +
Followers * (.1) +
Following * (.1) +
Badges * (.2) +
Connections * (.1)
= 100%
One Example
Comments:
This parameter is based on the average comments per post.
• Max: 20
• Formula: AVE(#) / max * 100 = 100%
• Example: 5 /10 * 100 = 50%
Max is that maximum number to get all that percentage. Hope that makes some sense.
We are calculating visitation, so all unique visits / date of membership is another. The table contains an event name, some meta data, and it is tied to that user. Reputation just uses those events to formulate a reputation based on 100% as the highest.
85% reputation - Joe AuthorUser been a member for 3 years. He has:
• written 18 blog posts
o 2 in the past month
• commented an average of 115 times per month
• 3,000 followers
• following 2,000 people
• received an average like rating of 325 per post
• he's earned, over the past 3 years:
o 100 level 1 badges
o 50 level 2 badges
• he's connected his:
o FB account
o Twitter account
As a general approach I would be using PL/SQL. One package with several get_rep functions.
function calc_rep (i_comments in number, i_posts in number, i_ratings in number,
i_followers in number, i_following in number, i_badges in number,
i_connections in number) return number deterministic is
...
end calc_rep;
function get_rep_for_user (i_user_id in number) is
v_comments ....
begin
select .....
calc_rep (v_comments...)
end get_rep_for_user;
If you've got to recalculate rep for a lot of users a lot of the time, I'd look into parallel pipelined functions (which should be a separate question). The CALC_REP is deterministic as anyone with the same set of numbers will get the same result.
If the number of comments etc is stored in a single record, then it will be simple to call. If the details need to be summarised up, then use materialized views for the summaries. If they need to be gathered from multiple places, then a view can be used to encapsulate the joins.
Whether you can calculate on the fly fast enough to meet requirements is a factor of data volumes, database design, final calculation complexity..... to imagine that we can give you a cut-and-dry approach is unreasonable.
It may wind up being something that would be helped by storing summaries used for some calculated values. For example, look at the things that cause DML. If you had a user_reputation table, then a trigger on your blog_post table could increment/decrement a counter on user_reputation on insert or delete of a post. Same for comments, likes, follows, etc.
If you keep all of your summaries up to date in that manner, then the incremental costs to DML will be minor and the calculations will become simple.
Not saying that this is THE solution. Just saying that it might be worth exploring.
Related
I'm trying to develop a rating system for an application I'm working on. Basically app allows you to rate an object from 1 to 5(represented by stars). But I of course know that keeping a rating count and adding the rating the number itself is not feasible.
So the first thing that came up in my mind was dividing the received rating by the total ratings given. Like if the object has received the rating 2 from a user and if the number of times that object has been rated is 100 maybe adding the 2/100. However I believe this method is not good enough since 1)A naive approach 2) In order for me to get the number of times that object has been rated I have to do a look up on db which might end up having time complexity O(n)
So I was wondering what alternative and possibly better ways to approach this problem?
You can keep in DB 2 additional values - number of times it was rated and total sum of all ratings. This way to update object's rating you need only to:
Add new rating to total sum.
Divide total sum by total times it was rated.
There are many approaches to this but before that check
If all feedback givers treated at equal or some have more weight than others (like panel review, etc)
If the objective is to provide only an average or any score band or such. Consider scenario like this website - showing total reputation score
And yes - if average is to be omputed, you need to have total and count of feedback and then have to compute it - that's plain maths. But if you need any other method, be prepared for more compute cycles. balance between database hits and compute cycle but that's next stage of design. First get your requirement and approach to solution in place.
I think you should keep separate counters for 1 stars, 2 stars, ... to calcuate the rating, you'd have to compute rating = (1*numOneStars+2*numTwoStars+3*numThreeStars+4*numFourStars+5*numFiveStars)/numOneStars+numTwoStars+numThreeStars+numFourStars+numFiveStars)
This way you can, like amazon also show how many ppl voted 1 stars and how many voted 5 stars...
Have you considered a vote up/down mechanism over numbers of stars? It doesn't directly solve your problem but it's worth noting that other sites such as YouTube, Facebook, StackOverflow etc all use +/- voting as it is often much more effective than star based ratings.
Summary
As Ted Jaspers wisely pointed out, the methodology I described in the original proposal back in 2012 is actually a special case of an exponential moving average. The beauty of this approach is that it can be calculated recursively, meaning you only need to store a single popularity value with each object and then you can recursively adjust this value when an event occurs. There's no need to record every event.
This single popularity value represents all past events (within the limits of the data type being used), but older events begin to matter exponentially less as new events are factored in. This algorithm will adapt to different time scales and will respond to varying traffic volumes. Each time an event occurs, the new popularity value can be calculated using the following formula:
(a * t) + ((1 - a) * p)
a — coefficient between 0 and 1 (higher values discount older events faster)
t — current timestamp
p — current popularity value (e.g. stored in a database)
Reasonable values for a will depend on your application. A good starting place is a=2/(N+1), where N is the number of events that should significantly affect the outcome. For example, on a low-traffic website where the event is a page view, you might expect hundreds of page views over a period of a few days. Choosing N=100 (a≈0.02) would be a reasonable choice. For a high-traffic website, you might expect millions of page views over a period of a few days, in which case N=1000000 (a≈0.000002) would be more reasonable. The value for a will likely need to be gradually adjusted over time.
To illustrate how simple this popularity algorithm is, here's an example of how it can be implemented in Craft CMS in 2 lines of Twig markup:
{% set popularity = (0.02 * date().timestamp) + (0.98 * entry.popularity) %}
{% do entry.setFieldValue("popularity", popularity) %}
Notice that there's no need to create new database tables or store endless event records in order to calculate popularity.
One caveat to keep in mind is that exponential moving averages have a spin-up interval, so it takes a few recursions before the value can be considered accurate. This means the initial condition is important. For example, if the popularity of a new item is initialized using the current timestamp, the item immediately becomes the most popular item in the entire set before eventually settling down into a more accurate position. This might be desirable if you want to promote new content. Alternatively, you may want content to work its way up from the bottom, in which case you could initialize it with the timestamp of when the application was first launched. You could also find a happy medium by initializing the value with an average of all popularity values in the database, so it starts out right in the middle.
Original Proposal
There are plenty of suggested algorithms for calculating popularity based on an item's age and the number of votes, clicks, or purchases an item receives. However, the more robust methods I've seen often require overly complex calculations and multiple stored values which clutter the database. I've been contemplating an extremely simple algorithm that doesn't require storing any variables (other than the popularity value itself) and requires only one simple calculation. It's ridiculously simple:
p = (p + t) / 2
Here, p is the popularity value stored in the database and t is the current timestamp. When an item is first created, p must be initialized. There are two possible initialization methods:
Initialize p with the current timestamp t
Initialize p with the average of all p values in the database
Note that initialization method (1) gives recently added items a clear advantage over historical items, thus adding an element of relevance. On the other hand, initialization method (2) treats new items as equals when compared to historical items.
Let's say you use initialization method (1) and initialize p with the current timestamp. When the item receives its first vote, p becomes the average of the creation time and the vote time. Thus, the popularity value p still represents a valid timestamp (assuming you round to the nearest integer), but the actual time it represents is abstracted.
With this method, only one simple calculation is required and only one value needs to be stored in the database (p). This method also prevents runaway values, since a given item's popularity can never exceed the current time.
An example of the algorithm at work over a period of 1 day: http://jsfiddle.net/q2UCn/
An example of the algorithm at work over a period of 1 year: http://jsfiddle.net/tWU9y/
If you expect votes to steadily stream in at sub-second intervals, then you will need to use a microsecond timestamp, such as the PHP microtime() function. Otherwise, a standard UNIX timestamp will work, such as the PHP time() function.
Now for my question: do you see any major flaws with this approach?
I think this is a very good approach, given its simplicity. A very interesting result.
I made a quick set of calculations and found that this algorithm does seem to understand what "popularity" means. Its problem is that it has a clear tendency to favor recent votes like this:
Imagine we take the time and break it into discrete timestamp values ranging from 100 to 1000. Assume that at t=100 both A and B (two items) have the same P = 100.
A gets voted 7 times on 200, 300, 400, 500, 600, 700 and 800
resulting on a final Pa(800) = 700 (aprox).
B gets voted 4 times on 300, 500, 700 and 900
resulting on a final Pb(900) = 712 (aprox).
When t=1000 comes, both A and B receive votes, so:
Pa(1000) = 850 with 8 votes
Pb(1000) = 856 with 5 votes
Why? because the algorithm allows an item to quickly beat historical leaders if it receives more recent votes (even if the item has fewer votes in total).
EDIT INCLUDING SIMULATION
The OP created a nice fiddle that I changed to get the following results:
http://jsfiddle.net/wBV2c/6/
Item A receives one vote each day from 1970 till 2012 (15339 votes)
Item B receives one vote each month from Jan to Jul 2012 (7 votes)
The result: B is more popular than A.
The proposed algorithm is a good approach, and is a special case of an Exponential Moving Average where alpha=0.5:
p = alpha*p + (1-alpha)*t = 0.5*p + 0.5*t = (p+t)/2 //(for alpha = 0.5)
A way to tweak the fact that the proposed solution for alpha=0.5 tends to favor recent votes (as noted by daniloquio) is to choose higher values for alpha (e.g. 0.9 or 0.99). Note that applying this to the testcase proposed by daniloquio is not working however, because when alpha increases the algorithm needs more 'time' to settle (so the arrays should be longer, which is often true in real applications).
Thus:
for alpha=0.9 the algorithm averages approximately the last 10 values
for alpha=0.99 the algorithm averages approximately the last 100 values
for alpha=0.999 the algorithm averages approximately the last 1000 values
etc.
I see one problem, only the last ~24 votes count.
p_i+1 = (p + t) / 2
For two votes we have
p2 = (p1 + t2) / 2 = ((p0 + t1) /2 + t2 ) / 2 = p0/4 + t1/4 + t2/2
Expanding that for 32 votes gives:
p32 = t*2^-32 + t0*2^-32 + t1*2^-31 + t2*2^-30 + ... + t31*2^-1
So for signed 32 bit values, t0 has no effect on the result. Because t0 gets divided by 2^32, it will contribute nothing to p32.
If we have two items A and B (no matter how big the differences are) if they both get the same 32 votes, they will have the same popularity. So you're history goes back for only 32 votes. There is no difference in 2032 and 32 votes, if the last 32 votes are the same.
If the difference is less than a day, they will be equal after 17 votes.
The flaw is that something with 100 votes is usually more meaningful than something with only one recent vote. However it isn't hard to come up with variants of your scheme that work reasonably well.
I don't think that the above-discussed logic is going to work.
p_i+1= (p_i + t) /2
Article A gets viewed on timestamps: 70, 80, 90 popularity(Article A): 82.5
Article B gets viewed on timestamps: 50, 60, 70, 80, 90 popularity(Article B): 80.625
In this case, the popularity of Article B should have been more. Firstly Article B was viewed as recently as Article A and secondly, it was also viewed more times than Article A.
I was recently wondering how http://500px.com calculates their "Pulse" rating.
The "Pulse" is a score from 1..100 based on the popularity of the photo.
I think it might use some of the following criteria:
Number of likes
Number of "favorites"
Number of comments
Total views
maybe the time since the photo has been uploaded
maybe some other non-obvious criteria like the users follower count, user rank, camera model or similar
How would I achieve some sort of algorithm like this?
Any advice on how to implement an algorithm with this criteria (and maybe some code) would be appreciated too.
I don't know too much about the site but systems like this generally work the same way. Normalize a set of weighted values to produce a single comparable value.
Define your list of rules, weight them based on importance, then run them all together to get your final value.
In this case it would be something like.
Total number of visits = 10%
Total number of Likes = 10%
Number of vists / number of likes = 40% (popularity = percentage of visitors that liked it)
number of Likes in last 30 days = 20% (current popularity)
author rating = 20%
Now we need to normalize the values for those rules. Depending on what your data is, scale of numbers etc this will be different for each rule so we need a workable value, say between 1 and 100.
Example normalizations for the above:
= percentage of vistors out of 50,000 vists (good number of vists)
(vists / 50000 ) * 100
= percentage of likes out of 10,000 likes (good number of likes)
(likes / 10000) * 100
= percentage of vistors that liked it
(likes / vists) * 100
= percentage of likes in last 30 days out of 1,000 likes (good number of likes for a 30 day period)
(likesIn30Days / 1000) * 100
= arbitrary rating of the author
Make sure all of these have a maximum value of 100 (if it's over bring it back down). Then we need to combine all these up depending on their weighting:
Popularity = (1 * 0.1) + (2 * 0.1) + (3 * 0.4) + (4 * 0.2) + (5 * 0.2)
This is all off the top of my head and rough. There are obviously also much more efficient ways to this as you don't need to normalize to a percentage at every stage but I hope it helps you get the gist.
Update
I've not really got any references or extra reading. I've never really worked with it as a larger concept only in small implementations.
I think most of what you read though is going to be methodological ranking systems in general and theories. Because depending on your rules and data format, your implementation will be very different. It seems such a huge concept when actually it will probably come down to arround 10 lines of code, not counting aggregating your data.
You may want to also refer to the following
How Reddit ranking algorithms work
How Hacker News ranking algorithm works
How to Build a Popularity Algorithm You can be Proud of
500px explains their (in the meantime outdated) Pulse ranking algorithm to some extend in their blog:
https://500px.com/blog/52/how-rating-works-and-why-there-s-a-lot-more-to-a-rating-than-just-a-number
Pretty interesting and different than what the solutions here on SO suggest so far.
Here is my scenario,
I run a Massage Place which offers various type of massages. Say 30 min Massage, 45 min massage, 1 hour massage, etc. I have 50 rooms, 100 employees and 30 pieces of equipment.When a customer books a massage appointment, the appointment requires 1 room, 1 employee and 1 piece of equipment to be available.
What is a good algorithm to find available resources for 10 guests for a given day
Resources:
Room – 50
Staff – 100
Equipment – 30
Business Hours : 9AM - 6PM
Staff Hours: 9AM- 6PM
No of guests: 10
Services
5 Guests- (1 hour massages)
3 Guests - (45mins massages)
2 Guests - (1 hour massage).
They are coming around the same time. Assume there are no other appointment on that day
What is the best way to get ::
Top 10 result - Fastest search which meets all conditions gets the top 10 result set. Top ten is defined by earliest available time. 9 – 11AM is best result set. 9 – 5pm is not that good.
Exhaustive search (Find all combinations) - all sets – Every possible combination
First available met (Only return the first match) – stop after one of the conditions have been met
I would appreciate your help.
Thanks
Nick
First, it seems the number of employees, rooms, and equipment are irrelevant. It seems like you only care about which of those is the lowest number. That is your inventory. So in your case, inventory = 30.
Next, it sounds like you can service all 10 people at the same time within the first hour of business. In fact, you can service 30 people at the same time.
So, no algorithm is necessary to figure that out, it's a static solution. If you take #Mario The Spoon's advice and weight the different duration massages with their corresponding profits, then you can start optimizing when you have more than 30 customers at a time.
Looks like you are trying to solve a problem for which there are quite specialized software applications. If your problem is small enough, you could try to do a brute force approach using some looping and backtracking, but as soon as the problem becomes too big, it will take too much time to iterate through all possibilities.
If the problem starts to get big, look for more specialized software. Things to look for are "constraint based optimization" and "constraint programming".
E.g. the ECLIPSe tool is an open-source constraint programming environment. You can find some examples on http://eclipseclp.org/examples/index.html. One nice example you can find there is the SEND+MORE=MONEY problem. In this problem you have the following equation:
S E N D
+ M O R E
-----------
= M O N E Y
Replace every letter by a digit so that the sum is correct.
This also illustrates that although you can solve this brute-force, there are more intelligent ways to solve this (see http://eclipseclp.org/examples/sendmore.pl.txt).
Just an idea to find a solution:
You might want to try to solve it with a constraint satisfaction problem (CSP) algorithm. That's what some people do if they have to solve timetable problems in general (e.g. room reservation at the University).
There are several tricks to improve CSP performance like forward checking, building a DAG and then do a topological sort and so on...
Just let me know, if you need more information about CSP :)
I'm working on a project where I need to sort a list of user-submitted articles by their popularity (last week, last month and last year).
I've been mulling on this for a while, but I'm not a great statitician so I figured I could maybe get some input here.
Here are the variables available:
Time [date] the article was originally published
Time [date] the article was recommended by editors (if it has been)
Amount of votes the article has received from users (total, in the last week, in the last month, in the last year)
Number of times the article has been viewed (total, in the last week, in the last month, in the last year)
Number of times the article has been downloaded by users (total, in the last week, in the last month, in the last year)
Comments on the article (total, in the last week, in the last month, in the last year)
Number of times a user has saved the article to their reading-list (Total, in the last week, in the last month, in the last year)
Number of times the article has been featured on a kind of "best we've got to offer" (editorial) list (Total, in the last week, in the last month, in the last year)
Time [date] the article was dubbed 'article of the week' (if it has been)
Right now I'm doing some weighting on each variable, and dividing by the times it has been read. That's pretty much all I could come up with after reading up on Weighted Means. My biggest problem is that there are some user-articles that are always on the top of the popular-list. Probably because the author is "cheating".
I'm thinking of emphasizing the importance of the article being relatively new, but I don't want to "punish" articles that are genuinely popular just because they're a bit old.
Anyone with a more statistically adept mind than mine willing to help me out?
Thanks!
I think the weighted means approach is a good one. But I think there are two things you need to work out.
How to weigh the criteria.
How to prevent "gaming" of the system
How to weigh the criteria
This question falls under the domain of Multi-Criteria Decision Analysis. Your approach is the Weighted Sum Model. In any computational decision making process, ranking the criteria is often the most difficult part of the process. I suggest you take the route of pairwise comparisons: how important do you think each criterion is compared to the others? Build yourself a table like this:
c1 c2 c3 ...
c1 1 4 2
c2 1/4 1 1/2
c3 1/2 2 1
...
This shows that C1 is 4 times as important as C2 which is half as important as C3. Use a finite pool of weightings, say 1.0 since that's easy. Distributing it over the criteria we have 4 * C1 + 2 * C3 + C2 = 1 or roughly C1 = 4/7, C3 = 2/7, C2 = 1/7. Where discrepencies arise (for instance if you think C1 = 2*C2 = 3*C3, but C3 = 2*C2), that's a good error indication: it means that you're inconsistent with your relative rankings so go back and reexamine them. I forget the name of this procedure, comments would be helpful here. This is all well documented.
Now, this all probably seems a bit arbitrary to you at this point. They're for the most part numbers you pulled out of your own head. So I'd suggest taking a sample of maybe 30 articles and ranking them in the way "your gut" says they should be ordered (often you're more intuitive than you can express in numbers). Finagle the numbers until they produce something close to that ordering.
Preventing gaming
This is the second important aspect. No matter what system you use, if you can't prevent "cheating" it will ultimately fail. You need to be able to limit voting (should an IP be able to recommend a story twice?). You need to be able to prevent spam comments. The more important the criterion, the more you need to prevent it from being gamed.
You can use outlier theory for detecting anomalies. A very naive way of looking for outliers is using the mahalanobis distance. This is a measure that takes into account the spread of your data, and calculates the relative distance from the center. It can be interpreted as how many standard deviations the article is from the center. This will however include also genuinely very popular articles, but it gives you a first indication that something is odd.
A second, more general approach is building a model. You could regress the variables that can be manipulated by users against those related to editors. One would expect that users and editors would agree to some extent. If they don't, then it's again an indication something is odd.
In both cases, you'll need to define some treshold and try to find some weighting based on that. A possible approach is to use the square rooted mahalanobis distance as an inverse weight. If you're far away from the center, your score will be pulled down. Same can be done using the residuals from the model. Here you could even take the sign into account. If the editor score is lower than what would be expected based on the user score, the residual will be negative. if the editor score is higher than what would be expected based on the user score, the residual is positive and it's very unlikely that the article is gamed. This allows you to define some rules to reweigh the given scores.
An example in R:
Code :
#Test data frame generated at random
test <- data.frame(
quoted = rpois(100,12),
seen = rbinom(100,60,0.3),
download = rbinom(100,30,0.3)
)
#Create some link between user-vars and editorial
test <- within(test,{
editorial = round((quoted+seen+download)/10+rpois(100,1))
})
#add two test cases
test[101,]<-c(20,18,13,0) #bad article, hyped by few spammers
test[102,]<-c(20,18,13,8) # genuinely good article
# mahalanobis distances
mah <- mahalanobis(test,colMeans(test),cov(test))
# simple linear modelling
mod <- lm(editorial~quoted*seen*download,data=test)
# the plots
op <- par(mfrow=c(1,2))
hist(mah,breaks=20,col="grey",main="Mahalanobis distance")
points(mah[101],0,col="red",pch=19)
points(mah[102],0,,col="darkgreen",pch=19)
legend("topright",legend=c("high rated by editors","gamed"),
pch=19,col=c("darkgreen","red"))
hist(resid(mod),breaks=20,col="grey",main="Residuals model",xlim=c(-6,4))
points(resid(mod)[101],0,col="red",pch=19)
points(resid(mod)[102],0,,col="darkgreen",pch=19)
par(op)
There are any number of ways to do this, and what works for you will depend on your actual dataset and what outcomes you desire for specific articles. As a rough reworking though, I would suggest moving the times it has been read to the weighted numbers and dividing by age of the article, since the older an article is, the more likely it is to have higher numbers in each category.
For example
// x[i] = any given variable above
// w[i] = weighting for that variable
// age = days since published OR
// days since editor recommendation OR
// average of both OR
// ...
score = (x[1]w[1] + ... + x[n]w[n])/age
Your problem of wanting to promote new articles more but not wanting to punish genuinely popular old articles requires consideration of how you can tell whether or not an article is genuinely popular. Then just use the "genuine-ness" algorithm to weight the votes or views rather than a static weighting. You can also change any of the other weightings to be functions rather than constants, and then have non-linear weightings for any variables you wish.
// Fw = some non-linear function
// (possibly multi-variable) that calculates
// a sub-score for the given variable(s)
score = (Fw1(x[1]) + ... + FwN(x[n]))/FwAge(age)