Simulating amazon.com best seller for books

Simulating amazon.com best seller for books - algorithm

I was just going through amazon.com and an interesting thing that caught my eye is how they calculate best sells in books.
I was thinking of writing a sample program to calculate this. I was thinking that suppose i am calculating best sellers for the month than just sum the sales count of the individual books and show the top 10. Is it ok or am I missing something?
EDIT
One more interesting thing can happen: suppose one book having id1 was sold 10 pieces on first day but after that it has not been sold but book having id2 is getting sold for 1 or 2 pieces regularly. So how it would affect the best seller calculation. Thanks.

Sounds about right. Depends on how exactly you want to define it.
"best sellers" is the number of units sold.
Another way to do it, if you don't want to fix it to one month is to have some distribution function (like square decay, t^2) and add the counts weighted by the distribution function.
This way, even though you don't have a fixed timed window you look at both new comers and old books. Your function should look like this:
for a_book in books:
score = 0
for a_sale in sales[a_book]:
score += 1 / (days(now() - a_sale.time()) ** 2) # pow 2
I think you get the idea. You can try different functions like exp(days) or different powers. Experiment and see what makes sense for you.

Related

Rating Algorithm

I'm trying to develop a rating system for an application I'm working on. Basically app allows you to rate an object from 1 to 5(represented by stars). But I of course know that keeping a rating count and adding the rating the number itself is not feasible.
So the first thing that came up in my mind was dividing the received rating by the total ratings given. Like if the object has received the rating 2 from a user and if the number of times that object has been rated is 100 maybe adding the 2/100. However I believe this method is not good enough since 1)A naive approach 2) In order for me to get the number of times that object has been rated I have to do a look up on db which might end up having time complexity O(n)
So I was wondering what alternative and possibly better ways to approach this problem?

You can keep in DB 2 additional values - number of times it was rated and total sum of all ratings. This way to update object's rating you need only to:
Add new rating to total sum.
Divide total sum by total times it was rated.

There are many approaches to this but before that check
If all feedback givers treated at equal or some have more weight than others (like panel review, etc)
If the objective is to provide only an average or any score band or such. Consider scenario like this website - showing total reputation score
And yes - if average is to be omputed, you need to have total and count of feedback and then have to compute it - that's plain maths. But if you need any other method, be prepared for more compute cycles. balance between database hits and compute cycle but that's next stage of design. First get your requirement and approach to solution in place.

I think you should keep separate counters for 1 stars, 2 stars, ... to calcuate the rating, you'd have to compute rating = (1*numOneStars+2*numTwoStars+3*numThreeStars+4*numFourStars+5*numFiveStars)/numOneStars+numTwoStars+numThreeStars+numFourStars+numFiveStars)
This way you can, like amazon also show how many ppl voted 1 stars and how many voted 5 stars...

Have you considered a vote up/down mechanism over numbers of stars? It doesn't directly solve your problem but it's worth noting that other sites such as YouTube, Facebook, StackOverflow etc all use +/- voting as it is often much more effective than star based ratings.

algorithm for equal groups according to Parameters

I have some people's data .each people has grades for few parameters
I want to divide the peoples for N groups that will be as equals as possible in all the parameters.
the parameters are rating. for example - it is most important that parameter 1 will be
equals in the groups,the second parameter is in second priority and the last parameter is The least priority
for example :
there are 100 peoples with data like this:
people1 = ["param1"=12,"param2"=70,"param3"=6]
people2 = ["param1"=9,"param2"=79,"param3"=2]
and I want to divide the peoples to 3 groups (more or less in a same size)
that will have as most as possible equals grades
can someone help me? give idea?
thanks in advance

This post makes me think of me being kid and playing soccer games in the yard with other kids.
There were 2 captains selected, and each one chosen turn by turn one player from the pool for the team. This way teams were balanced at the end.
You sure can make an algorithm from this story, and it's super easy (even for kids :) and brings good results on large amount of data.
Only thing you need - to sort the data by "Strength" of players and divide them.

An algorithm for economic simulation?

I would like to create a game where the players create differents products with different prices (call it offers), and I give them a certain number of customers (call it demands).
Now, I want an algorithm to decid what's the market share of each players. Of course, I could just make mine right now, using random. But before doing this, I prefer to ask, because I'm sure that's a lot of people already tried to do this before me!
My question is not really precise, it's because your answer doesn't need to be precise too ;)
Thank you in advance!

It really depends on the variables you have set up, and the kind of "market" you want to create. You could start with the following simple formula below (which fundamentally reduces market share to a question of profits) as a start, and I'll go through what I mean by "kind of market" after.
marketShare = totalCompanyProfit/allMoneyInProductCategory;
marketShare = ( (productSalePrice * demand)-(productManufactureCost * supply) ) / allMoneyInProductCategory;
It gets interesting here because the "kind of market" is determined by your definition of demand. For example, say the product was ferraris, and the market you were trying to simulate was the Republic of Congo, which had a GDP of $189/capita.
targetMarketSize = (percentOfFittingDemographic * totalPopulation)
percentWhoHateYourProduct = AVERAGE( ( ABS(productVariable1 - variableIdeal1) / variableIdeal1 ), ( ABS(productVariable2 - variableIdeal2) / variableIdeal2 ), etc )
demand = (targetMarketSize) * ( 1- percentWhoHateYourProduct )
percentOfFittingDemographic is the percent of the population which fits into the demographic which would buy such a product (i.e people with enough disposable income to afford $100,000 car), which in the Congo example above could be something like .001 .
The average of the absolute value of the difference of certain product attributes (productVariable) from their ideals (variableIdeals) over their ideals give the % of the population which are going to be turned off by the product not being what they want. Subtracting that from 1 gives the percent of people who DO want to buy your product, and multiplying that by targetMarketSize gives you the people who want to buy your product- ie demand. If the product is perfect, it becomes the average of 0's, and the whole target market becomes a user of the product.
One could also add weighting to the average to say, for example, that the market prefers a lower price over a bigger screen size. To imply that more of one attribute increases the desire for the product in a population (i.e instead of "one month of free service", you give away "6 months of free service", and people want it more), you could add it into the average with
percentLikesProductNow = 1 - e^(-1 * infinitelyLikedAttribute)
This goes from 0% at infinitelyLikedAttribute=0, to about 0.005% at infintelyLikedAttribute=10, so you could play around and find a way to "scale" that attribute to roughly be between 1 and 10. This does sort of make sense with real life, because there are products I would never have bought if they didnt have a free trial. For example: 3 free months of verizon internet. I would have probably gone with comcast otherwise, as I only was living there for 6 months, but saving 100 bucks was pretty big at the time. At the other extreme however, if verizon were to offer me 100 free years of internet, another 50 extra years on top of that (assuming it's not transferable, etc) really doesn't add much more to the attractiveness of the offer.
You can always multiply all of these things with a random number generator as well, to maybe give it a +/- 15% variance, and keep everyone guessing :)
I hope this was even remotely useful :)

Converting preferences to ratings

Suppose I have a list of (e.g.) restaurants. A lot of users get a list of pairs of restaurants, and select the one of the two they prefer (a la hotornot).
I would like to convert these results into absolute ratings: For each restaurant, 1-5 stars (rating can be non-integer, if necessary).
What are the general ways to go with this problem?
Thanks

I would consider each pairwise decision as a vote in favor of one of the restaurants, and each non-preferred partner as a downvote. Count the votes across all users and restaurants, and then sort cluster them equally (so that that each star "weighs" for a number of votes).

Elo ratings come to mind. It's how the chess world computes a rating from your win/loss/draw record. Losing a matchup against an already-high-scoring restaurant gets penalized less than against a low-scoring one, a little like how PageRank cares more about a link from a website it also ranks highly. There's no upper bound to your possible score; you'd have to renormalize somehow for a 1-5 star system.

Algorithm to determine most popular article last week, month and year?

I'm working on a project where I need to sort a list of user-submitted articles by their popularity (last week, last month and last year).
I've been mulling on this for a while, but I'm not a great statitician so I figured I could maybe get some input here.
Here are the variables available:
Time [date] the article was originally published
Time [date] the article was recommended by editors (if it has been)
Amount of votes the article has received from users (total, in the last week, in the last month, in the last year)
Number of times the article has been viewed (total, in the last week, in the last month, in the last year)
Number of times the article has been downloaded by users (total, in the last week, in the last month, in the last year)
Comments on the article (total, in the last week, in the last month, in the last year)
Number of times a user has saved the article to their reading-list (Total, in the last week, in the last month, in the last year)
Number of times the article has been featured on a kind of "best we've got to offer" (editorial) list (Total, in the last week, in the last month, in the last year)
Time [date] the article was dubbed 'article of the week' (if it has been)
Right now I'm doing some weighting on each variable, and dividing by the times it has been read. That's pretty much all I could come up with after reading up on Weighted Means. My biggest problem is that there are some user-articles that are always on the top of the popular-list. Probably because the author is "cheating".
I'm thinking of emphasizing the importance of the article being relatively new, but I don't want to "punish" articles that are genuinely popular just because they're a bit old.
Anyone with a more statistically adept mind than mine willing to help me out?
Thanks!

I think the weighted means approach is a good one. But I think there are two things you need to work out.
How to weigh the criteria.
How to prevent "gaming" of the system
How to weigh the criteria
This question falls under the domain of Multi-Criteria Decision Analysis. Your approach is the Weighted Sum Model. In any computational decision making process, ranking the criteria is often the most difficult part of the process. I suggest you take the route of pairwise comparisons: how important do you think each criterion is compared to the others? Build yourself a table like this:
c1 c2 c3 ...
c1 1 4 2
c2 1/4 1 1/2
c3 1/2 2 1
...
This shows that C1 is 4 times as important as C2 which is half as important as C3. Use a finite pool of weightings, say 1.0 since that's easy. Distributing it over the criteria we have 4 * C1 + 2 * C3 + C2 = 1 or roughly C1 = 4/7, C3 = 2/7, C2 = 1/7. Where discrepencies arise (for instance if you think C1 = 2*C2 = 3*C3, but C3 = 2*C2), that's a good error indication: it means that you're inconsistent with your relative rankings so go back and reexamine them. I forget the name of this procedure, comments would be helpful here. This is all well documented.
Now, this all probably seems a bit arbitrary to you at this point. They're for the most part numbers you pulled out of your own head. So I'd suggest taking a sample of maybe 30 articles and ranking them in the way "your gut" says they should be ordered (often you're more intuitive than you can express in numbers). Finagle the numbers until they produce something close to that ordering.
Preventing gaming
This is the second important aspect. No matter what system you use, if you can't prevent "cheating" it will ultimately fail. You need to be able to limit voting (should an IP be able to recommend a story twice?). You need to be able to prevent spam comments. The more important the criterion, the more you need to prevent it from being gamed.

You can use outlier theory for detecting anomalies. A very naive way of looking for outliers is using the mahalanobis distance. This is a measure that takes into account the spread of your data, and calculates the relative distance from the center. It can be interpreted as how many standard deviations the article is from the center. This will however include also genuinely very popular articles, but it gives you a first indication that something is odd.
A second, more general approach is building a model. You could regress the variables that can be manipulated by users against those related to editors. One would expect that users and editors would agree to some extent. If they don't, then it's again an indication something is odd.
In both cases, you'll need to define some treshold and try to find some weighting based on that. A possible approach is to use the square rooted mahalanobis distance as an inverse weight. If you're far away from the center, your score will be pulled down. Same can be done using the residuals from the model. Here you could even take the sign into account. If the editor score is lower than what would be expected based on the user score, the residual will be negative. if the editor score is higher than what would be expected based on the user score, the residual is positive and it's very unlikely that the article is gamed. This allows you to define some rules to reweigh the given scores.
An example in R:
Code :
#Test data frame generated at random
test <- data.frame(
quoted = rpois(100,12),
seen = rbinom(100,60,0.3),
download = rbinom(100,30,0.3)
)
#Create some link between user-vars and editorial
test <- within(test,{
editorial = round((quoted+seen+download)/10+rpois(100,1))
})
#add two test cases
test[101,]<-c(20,18,13,0) #bad article, hyped by few spammers
test[102,]<-c(20,18,13,8) # genuinely good article
# mahalanobis distances
mah <- mahalanobis(test,colMeans(test),cov(test))
# simple linear modelling
mod <- lm(editorial~quoted*seen*download,data=test)
# the plots
op <- par(mfrow=c(1,2))
hist(mah,breaks=20,col="grey",main="Mahalanobis distance")
points(mah[101],0,col="red",pch=19)
points(mah[102],0,,col="darkgreen",pch=19)
legend("topright",legend=c("high rated by editors","gamed"),
pch=19,col=c("darkgreen","red"))
hist(resid(mod),breaks=20,col="grey",main="Residuals model",xlim=c(-6,4))
points(resid(mod)[101],0,col="red",pch=19)
points(resid(mod)[102],0,,col="darkgreen",pch=19)
par(op)

There are any number of ways to do this, and what works for you will depend on your actual dataset and what outcomes you desire for specific articles. As a rough reworking though, I would suggest moving the times it has been read to the weighted numbers and dividing by age of the article, since the older an article is, the more likely it is to have higher numbers in each category.
For example
// x[i] = any given variable above
// w[i] = weighting for that variable
// age = days since published OR
// days since editor recommendation OR
// average of both OR
// ...
score = (x[1]w[1] + ... + x[n]w[n])/age
Your problem of wanting to promote new articles more but not wanting to punish genuinely popular old articles requires consideration of how you can tell whether or not an article is genuinely popular. Then just use the "genuine-ness" algorithm to weight the votes or views rather than a static weighting. You can also change any of the other weightings to be functions rather than constants, and then have non-linear weightings for any variables you wish.
// Fw = some non-linear function
// (possibly multi-variable) that calculates
// a sub-score for the given variable(s)
score = (Fw1(x[1]) + ... + FwN(x[n]))/FwAge(age)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio