I'm going to be starting a banner-rotation script soon and I'm getting a bit perplexed over how exactly to develop it. Suppose a client asks for
"10,000 impressions in the next 10 days for $10,000 dollars."
Another client asks for
"1,000 impressions for $100 dollars."
And a third asks for
"1,000 clicks or 10,000 impressions for $5,000."
How exactly do I determine which banner to show upon a page-request? How do I weigh one against another? Clearly the first request is rather important, as I'm expected to serve a set number of impressions within a time-window.
The second client is not nearly as important, as they don't care about a time-window, they just want some face-time.
And the last client wants to place an n or m restraint on the impressions/clicks, making matters slightly more difficult.
I'm already pretty confident that I'll need to abstract some weight from these scenarios to determine who gets the most attention. My question is what type of algorithm could handle this, and secondly how could I serve up banners by weight without always serving up the most important banner with each request?
The difficulty comes from the time constraint more than anything else. I would divide anyone's priority who did not specify a time constraint by 365 (a year), and then use time as part of the weight factor. So:
Client 1 priority: 10000/10 = 1000
Client 2 priority: 1000/365 ~ 3
Client 3 priority: 10000/365 ~30
That should get you a fairly decent indicator of priority. Now, you can't mix and match impressions and clicks can you? They either go the impression route or the click route. Seeing as you cannot control click, but you can control impressions (at least, moreso than clicks), I would weigh it according to impressions.
Use a random-number generator to pick which ad to show, and weight it with a priority for each ad. Set the weighting factor higher for clients that want more impressions or have a deadline. You can increase weighting factor if the time is almost up.
Once a client hits their requested impressions, drop weighting to 0 to prevent their ad from showing.
Default weighting could be 1 or so, with clients being allowed to pay extra to increase priority (without telling them the mechanics -- bill it as "premium" placement, etc).
Edit: weighting details
You can make this as simple or complex as you like, but a basic version would include the following terms:
weight is 0 if ad has reached purchased impressions/clicks
base weighting (1.0 probably)
multiply weight by impressions_remaining / TOTAL impressions remaining for all clients
add a small constant if remaining impressions/clicks is small -- ensures they get the last few ones needed to finish the account
for deadline clients: add term for (remaining impressions/purchased impressions)/(time left/total time)
The deadline clients should be capped at 90% of all page displays or something to ensure they don't outcompete others. The last term gives the "urgency" for deadline clients -- it goes to infinity as deadline hits, so you should put a condition on the remaining time piece to prevent problems with this.
Microsoft Commerce Server contains a NOD algorithm
(see http://msdn.microsoft.com/en-us/library/ms960081%28v=cs.70%29.aspx
and http://msdn.microsoft.com/en-us/library/ee825423%28v=cs.10%29.aspx )
I've used derived versions of this formula in 3 different ad servers, which turned out to work nice for my conditions.
The basic formula regarding your situation uses a variable called NOD, short for "Need of Delivery". At any given time, the "basic" NOD formula of a banner is:
NOD=(Remaining Events / Total Events Requested) * (Total Runtime /
Remaining Runtime)
Note that "Events" is a general term, which may represent impressions, clicks, conversions, etc. depending on your system.
The equation states that all banners start with the initial value of 1.0 to their lives, because (e / e) * (t / t) = 1.0
A higher-than-1 NOD value means you are behind your schedule, while a NOD between 0 and 1 generally means that you have displayed the banner "too fast". Values between 0.9 and 1.2 are generally in acceptable range (this is not a technical range, rather a business experience).
As long as the serving ratios match duration ratios, values stay around 1.0.
For a specific ad slot, the algorithm checks the NODs of all available banners targettable on the slot. Suppose you have 3 banners available on a slot, with NOD values 0.6, 1.35 and 1.05, which add up to 3.0. Then relative probabilities of each banner to be displayed become 20%, 45% and 35% in order [ 0.6 / (0.6 + 1.35 + 1.05)] = 20%
The algorithm uses weighted probability distribution, which means that even the banner with the least NOD value has the chance to be displayed. While the basic formula uses this approach, business decisions generally always forced me to implement algorithms favoring the urgent NOD values more than the original formula. So, I took the base NODs and multiplied them with themselves. In the same example, probabilities become 11%, 55,5%
and 33,5% in order.
For your condition, you might consider changing the formula a little bit to serve your needs. First to be able to compare the income you will earn by displaying a banner, you should convert all display types (impression, click, action, etc) to a common eCPM value. Then you might use this eCPM as a multiplier to the original equation.
Calculating eCPM (effective CPM) might be tricky for not-yet-published campaigns, in this case you should use historical data.
Let me explain this part a little bit more: When trying to compare the probable income you will earn by "displaying" a single banner, you don't need to compare impression based budgets. For click based budgets, you should use historical CTR value to guess "how many impressions does my system need to serve to get X clics". A more advanced algorithm might utilize "how many impressions does my system need to serve to get a campaign in X category, in y inventory".
Then your final equation becomes:
NOD = eCPM * (Remaining Events / Total Events Requested) * (Total
Runtime / Remaining Runtime)
You can always consider using powers of eCPM to compare the results. Like my way of changing the original formula to favor more urgent campaigns, you might favor "more paying" campaigns.
I really like AlbertoPL's time-based approach, but he doesn't factor in the clicks. Its easy to demonstrate pathological cases where clicks are relevant:
Client A offers $1000 for 1 click or 10,000 impressions
Client B offers $1000 for 5000 clicks or 10,000 impressions.
Any reasonable person would give the 1-click guy higher priority. The calculation is actually pretty trivial: assume your click-through is 100 impressions per click.
Client A wants 10,000 impressions or 1 click, so we require a bare minimum of 100 impressions to get paid. At a cost of $1000 per 100 impressions, you can figure that your client is willing to pay $10/impression.
Client B wants 10,000 impressions or 5000 clicks. 5000 clicks requires 500,000 impressions, we'll clearly meet the 10,000 impression mark before then, so we assume the client is really offering to pay $1000 for 10,000 impressions, or $0.10/impression.
We maximize revenue by maximizing our $$$$$/impression, so client A takes priority. Let's use the figures provided in the OP:
Client 1:
10,000 impressions in the next 10 days for $10,000 dollars
= minimum of 10,000 impressions * $1/impression / 10 days
= $1000/day
Client 2:
1,000 impressions for $100 dollars
= minimum of 1,000 impressions * $.01/impression / 365 days
= $0.27/day.
Client 3:
1,000 clicks or 10,000 impressions for $5000
= min(100,000 impressions to get 1,000 clicks, 10,000 impressions) = 10,000 impressions for $5000
= minimum of 10,000 impressions * $0.5/impression / 365
= $13.7/day.
Clients take priority based on how much they pay per day.
Related
My question is not about programming languages but definetly about programming.
I have a model portfolio with shares:
Part Code Price, $ Number of shares in portfolio
23,80% CSIQ 24,91 ?
18,90% TSL 10,52 ?
11,20% JKS 24,40 ?
10,70% YGE 2,90 ?
35,40% DQ 26,05 ?
I need to calculate minimum number of shares that should be in portfolio so that part of that share in portfolio would equal to part in model portfolio.
Just imagine that you want to purchase such portfolio in real world. How many of each stocks should you buy, to get desired part (which is shown in model portfolio). I can't buy non-integer number of shares and part in recalculated (after purchase) portfolio should equal part in model portfolio.
Example: I need to get portfolio with 50.0% in Google ($500 per share) and 50.0% in Apple ($700 per share). Solution is 5 shares of Apple (total value $3500) and 7 shares of Google (total value $3500).
Let us expand on the approach devised in the comments.
The first step is to choose a share to be a reference point; this can be any, so we'll go with the first one, CSIQ. Let us say then that we will purchase one share of this, so we now know that 23.8% of the portfolio is worth $24.91.
For the second share, this is now the problem we have:
Part Code Price, $ Number of shares in portfolio
23,80% CSIQ 24,91 1
18,90% TSL 10,52 ?
Since we know the value of a fraction of the portfolio, let us work out what the whole portfolio would be:
total_value = (100 / 23.8) * 24.91
= $104.663865546
That means the amount we can spend on TSL is:
tsl_value = 104.663865546 * (18.8/100)
= $19.676806723
We know how much a TSL share costs, so we must buy a non-integer amount of this share:
share_amount = 19.676806723/10.52
= 1.87041908
You can then go through each share in the same way, and end up with a portfolio in the desired ratios.
If you already own a number of shares in one stock, you can modify the algorithm but instead of starting with 1 share, you start with X shares - multiply everything by X and it will still work.
After you added the constraint that shares can only be purchased in integer amounts, I would suggest that you use the X multiplier approach above, coupled with rounding share amounts to the closest integer. As you increase X exponentially (10, 100, etc) your level of inaccuracy due to rounding will get progressively smaller.
As I suggested in the comments, you could build this in a spreadsheet first and determine the level of inaccuracy for inputs of X. Of course, if you plan to actually buy these shares, X is constrained by the amount of money you have; conversely if it is theoretical you can make it 6 or 7 figures and achieve good levels of accuracy.
I am currently working on writing an algorithm for my new site I plan to launch soon. The index page will display the "hottest" posts at the moment.
Variables to consider are:
Number of votes
How controversial the post is (# between 0-1)
Time since post
I have come up with two possible algorithms, the first and most simple is:
controversial * (numVotesThisHour / (numVotesTotal - numVotesThisHour)
Denom = numVotesTuisHour if numVotesTotal - numVotesThisHour == 0
Highest number is hottest
My other option is to use an algorithm similar to Reddit's (except that the score decreases as time goes by):
[controversial * log(x)] - (TimePassed / interval)
x = { numVotesTotal if numVotesTotal >= 10, 10 if numVotesTotal < 10
Highest number is hottest
The first algorithm would allow older posts to become "hot" again in the future while the second one wouldn't.
So my question is, which one of these two algorithms do you think is more effective? Which one do you think will display the truly "hot" topics at the moment? Can you think of any advantages or disadvantages to using one over the other? I just want to make sure I don't overlook anything so that I can ensure the content is as relevant as possible. Any feedback would be great! Thanks!
Am I missing something. In the first formula you have numVotesTotal in the denominator. So higher number of votes all time will mean it will never be so hot even if it is not so old.
For example if I have two posts - P1 and P2 (both equally controversial). Say P1 has numVotesTotal = 20, and P2 has numVotesTotal = 1000. Now in the last one hour P1 gets numVotesThisHour = 10 and P2 gets numVotesThisHour = 200.
According to the algorithm, P1 is more famous than P2. It doesn't make sense to me.
I think the first algorithm relies too heavily on instantaneous trend. Think of NASCAR, the current leader could be going 0 m.p.h. because he's at a pit stop. The second one uses the notion of average trend. I think both have their uses.
So for two posts with the same total votes and controversial rating, but where posts one receives 20 votes in the first hour and zero in the second, while the other receives 10 in each hour. The first post will be buried by the first algorithm but the second algorithm will rank them equally.
YMMV, but I think the 'hotness' is entirely dependent on the time frame, and not at all on the total votes unless your time frame is 'all time'. Also, it seems to me that the proportion of all votes in the relevant time frame, rather than the absolute number of them, is the important figure.
You might have several categories of hot:
Hottest this hour
Hottest this week
Hottest since your last visit
Hottest all time
So, 'Hottest in the last [whatever]' could be calculated like this:
votes_for_topic_in_timeframe / all_votes_in_timeframe
if you especially want a number between 0 and 1, (useful for comparing across categories) or, if you only want the ones in a specific timeframe, just take the votes_for_topic_in_timeframe values and sort into descending order.
If you don't want the user explicitly choosing the time frame, you may want to calculate all (say) four versions (or perhaps just the top 3), assign a multiplier to each category to give each category a relative importance, and calculate total values for each topic to take the top n. This has the advantage of potentially hiding from the user that no-one at all has voted in the last hour ;)
I'm trying to develop a rating system for an application I'm working on. Basically app allows you to rate an object from 1 to 5(represented by stars). But I of course know that keeping a rating count and adding the rating the number itself is not feasible.
So the first thing that came up in my mind was dividing the received rating by the total ratings given. Like if the object has received the rating 2 from a user and if the number of times that object has been rated is 100 maybe adding the 2/100. However I believe this method is not good enough since 1)A naive approach 2) In order for me to get the number of times that object has been rated I have to do a look up on db which might end up having time complexity O(n)
So I was wondering what alternative and possibly better ways to approach this problem?
You can keep in DB 2 additional values - number of times it was rated and total sum of all ratings. This way to update object's rating you need only to:
Add new rating to total sum.
Divide total sum by total times it was rated.
There are many approaches to this but before that check
If all feedback givers treated at equal or some have more weight than others (like panel review, etc)
If the objective is to provide only an average or any score band or such. Consider scenario like this website - showing total reputation score
And yes - if average is to be omputed, you need to have total and count of feedback and then have to compute it - that's plain maths. But if you need any other method, be prepared for more compute cycles. balance between database hits and compute cycle but that's next stage of design. First get your requirement and approach to solution in place.
I think you should keep separate counters for 1 stars, 2 stars, ... to calcuate the rating, you'd have to compute rating = (1*numOneStars+2*numTwoStars+3*numThreeStars+4*numFourStars+5*numFiveStars)/numOneStars+numTwoStars+numThreeStars+numFourStars+numFiveStars)
This way you can, like amazon also show how many ppl voted 1 stars and how many voted 5 stars...
Have you considered a vote up/down mechanism over numbers of stars? It doesn't directly solve your problem but it's worth noting that other sites such as YouTube, Facebook, StackOverflow etc all use +/- voting as it is often much more effective than star based ratings.
There's this question but it has nothing close to help me out here.
Tried to find information about it on the internet yet this subject is so swarmed with articles on "how to win" or other non-related stuff that I could barely find anything. None worth posting here.
My question is how would I assure a payout of 95% over a year?
Theoretically, of course.
So far I can think of three obvious variables to consider within the calculation: Machine payout term (year in my case), total paid and total received in that term.
Now I could simply shoot a random number between the paid/received gap and fix slots results to be shown to the player but I'm not sure this is how it's done.
This method however sounds reasonable, although it involves building the slots results backwards..
I could also make a huge list of all possibilities, save them in a database randomized by order and simply poll one of them each time.
This got many flaws - the biggest one is the huge list I'm going to get (millions/billions/etc' records).
I certainly hope this question will be marked with an "Answer" (:
You have to make reel strips instead of huge database. Here is brief example for very basic 3-reel game containing 3 symbols:
Paytable:
3xA = 5
3xB = 10
3xC = 20
Reels-strip is a sequence of symbols on each reel. For the calculations you only need the quantity of each symbol per each reel:
A = 3, 1, 1 (3 symbols on 1st reel, 1 symbol on 2nd, 1 symbol on 3rd reel)
B = 1, 1, 2
C = 1, 1, 1
Full cycle (total number of all possible combinations) is 5 * 3 * 4 = 60
Now you can calculate probability of each combination:
3xA = 3 * 1 * 1 / full cycle = 0.05
3xB = 1 * 1 * 2 / full cycle = 0.0333
3xC = 1 * 1 * 1 / full cycle = 0.0166
Then you can calculate the return for each combination:
3xA = 5 * 0.05 = 0.25 (25% from AAA)
3xB = 10 * 0.0333 = 0.333 (33.3% from BBB)
3xC = 20 * 0.0166 = 0.333 (33.3% from CCC)
Total return = 91.66%
Finally, you can shuffle the symbols on each reel to get the reels-strips, e.g. "ABACA" for the 1st reel. Then pick a random number between 1 and the length of the strip, e.g. 1 to 5 for the 1st reel. This number is the middle symbol. The upper and lower ones are from the strip. If you picked from the edge of the strip, use the first or last one to loop the strip (it's a virtual reel). Then score the result.
In real life you might want to have Wild-symbols, free spins and bonuses. They all are pretty complicated to describe in this answer.
In this sample the Hit Frequency is 10% (total combinations = 60 and prize combinations = 6). Most of people use excel to calculate this stuff, however, you may find some good tools for making slot math.
Proper keywords for Google: PAR-sheet, "slot math can be fun" book.
For sweepstakes or Class-2 machines you can't use this stuff. You have to display a combination by the given prize instead. This is a pretty different task, so you may try to prepare a database storing the combinations sorted by the prize amount.
Well, the first problem is with the keyword assure, if you are dealing with random, you cannot assure, unless you change the logic of the slot machine.
Consider the following algorithm though. I think this style of thinking is more reliable then plotting graphs of averages to achive 95%;
if( customer_able_to_win() )
{
calculate_how_to_win();
}
else
no_win();
customer_able_to_win() is your data log that says how much intake you have gotten vs how much you have paid out, if you are under 95%, payout, then customer_able_to_win() returns true; in that case, calculate_how_to_win() calculates how much the customer would be able to win based on your %, so, lets choose a sampling period of 24 hours. If over the last 24 hours i've paid out 90% of the money I've taken in, then I can pay out up to 5%.... lets give that 5% a number such as 100$. So calculate_how_to_win says I can pay out up to 100$, so I would find a set of reels that would pay out 100$ or less, and that user could win. You could add a little random to it, but to ensure your 95% you'll have to have some other rules such as a forced max payout if you get below say 80%, and so on.
If you change the algorithm a little by adding random to the mix you will have to have more of these caveats..... So to make it APPEAR random to the user, you could do...
if( customer_able_to_win() && payout_percent() < 90% )
{
calculate_how_to_win(); // up to 5% payout
}
else
no_win();
With something like that, it will go on a losing streak after you hit 95% until you reach 90%, then it will go on a winning streak of random increments until you reach 95%.
This isn't a full algorithm answer, but more of a direction on how to think about how the slot machine works.
I've always envisioned this is the way slot machines work especially with video poker. Because the no_win() function would calculate how to lose, but make it appear to be 1 card off to tease you to think you were going to win, instead of dealing with a 'fair' game and the random just happens to be like that....
Think of the entire process of.... first thinking if you are going to win, how are you going to win, if you're not going to win, how are you going to lose, instead of random number generators determining if you will win or not.
I worked many years ago for an internet casino in Australia, this one being the only one in the world that was regulated completely by a government body. The algorithms you speak of that produce "structured randomness" are obviously extremely complex especially when you are talking multiple lines in all directions, double up, pick the suit, multiple progressive jackpots and the like.
Our poker machine laws for our state demand a payout of 97% of what goes in. For rudely to be satisfied that our machine did this, they made us run 10 million mock turns of the machine and then wanted to see that our game paid off at what the law states with the tiniest range of error (we had many many machines running a script to auto playing using a script to simulate the click for about a week before we hit the 10 mil).
Anyhow the algorithms you speak of are EXPENSIVE! They range from maybe $500k to several million per machine so as you can understand, no one is going to hand them over for free, that's for sure. If you wanted a single line machine it would be easy enough to do. Just work out you symbols/cards and what pay structure you want for each. Then you could just distribute those payouts amongst non-payouts till you got you respective figure. Obviously the more options there are means the longer it will take to pay out at that respective rate, it may even payout more early in the piece. Hit frequency and prize size are also factors you may want to consider
A simple way to do it, if you assume that people win a constant number of times a time period:
Create a collection of all possible tumbler combinations with how much each one pays out.
The first time someone plays, in that time period, you can offer all combinations at equal probability.
If they win, take that amount off the total left for the time period, and remove from the available options any combination that would payout more than you have left.
Repeat with the reduced combinations until all the money is gone for that time period.
Reset and start again for the next time period.
What's the rationale behind the formula used in the hive_trend_mapper.py program of this Hadoop tutorial on calculating Wikipedia trends?
There are actually two components: a monthly trend and a daily trend. I'm going to focus on the daily trend, but similar questions apply to the monthly one.
In the daily trend, pageviews is an array of number of page views per day for this topic, one element per day, and total_pageviews is the sum of this array:
# pageviews for most recent day
y2 = pageviews[-1]
# pageviews for previous day
y1 = pageviews[-2]
# Simple baseline trend algorithm
slope = y2 - y1
trend = slope * log(1.0 +int(total_pageviews))
error = 1.0/sqrt(int(total_pageviews))
return trend, error
I know what it's doing superficially: it just looks at the change over the past day (slope), and scales this up to the log of 1+total_pageviews (log(1)==0, so this scaling factor is non-negative). It can be seen as treating the month's total pageviews as a weight, but tempered as it grows - this way, the total pageviews stop making a difference for things that are "popular enough," but at the same time big changes on insignificant don't get weighed as much.
But why do this? Why do we want to discount things that were initially unpopular? Shouldn't big deltas matter more for items that have a low constant popularity, and less for items that are already popular (for which the big deltas might fall well within a fraction of a standard deviation)? As a strawman, why not simply take y2-y1 and be done with it?
And what would the error be useful for? The tutorial doesn't really use it meaningfully again. Then again, it doesn't tell us how trend is used either - this is what's plotted in the end product, correct?
Where can I read up for a (preferably introductory) background on the theory here? Is there a name for this madness? Is this a textbook formula somewhere?
Thanks in advance for any answers (or discussion!).
As the in-line comment goes, this is a simple "baseline trend algorithm",
which basically means before you compare the trends of two different pages, you have to establish
a baseline. In many cases, the mean value is used, it's straightforward if you
plot the pageviews against the time axis. This method is widely used in monitoring
water quality, air pollutants, etc. to detect any significant changes w.r.t the baseline.
In OP's case, the slope of pageviews is weighted by the log of totalpageviews.
This sorta uses the totalpageviews as a baseline correction for the slope. As Simon put it, this puts a balance
between two pages with very different totalpageviews.
For exmaple, A has a slope 500 over 1000,000 total pageviews, B is 1000 over 1,000.
A log basically means 1000,000 is ONLY twice more important than 1,000 (rather than 1000 times).
If you only consider the slope, A is less popular than B.
But with a weight, now the measure of popularity of A is the same as B. I think it is quite intuitive:
though A's pageviews is only 500 pageviews, but that's because it's saturating, you still gotta give it enough credit.
As for the error, I believe it comes from the (relative) standard error, which has a factor 1/sqrt(n), where
n is the number of data points. In the code, the error is equal to (1/sqrt(n))*(1/sqrt(mean)).
It roughly translates into : the more data points, the more accurate the trend. I don't see
it is an exact math formula, just a brute trend analysis algorithm, anyway the relative
value is more important in this context.
In summary, I believe it's just an empirical formula. More advanced topics can be found in some biostatistics textbooks (very similar to monitoring the breakout of a flu or the like.)
The code implements statistics (in this case the "baseline trend"), you should educate yourself on that and everything becomes clearer. Wikibooks has a good instroduction.
The algorithm takes into account that new pages are by definition more unpopular than existing ones (because - for example - they are linked from relatively few other places) and suggests that those new pages will grow in popularity over time.
error is the error margin the system expects for its prognoses. The higher error is, the more unlikely the trend will continue as expected.
The reason for moderating the measure by the volume of clicks is not to penalise popular pages but to make sure that you can compare large and small changes with a single measure. If you just use y2 - y1 you will only ever see the click changes on large volume pages. What this is trying to express is "significant" change. 1000 clicks change if you attract 100 clicks is really significant. 1000 click change if you attract 100,000 is less so. What this formula is trying to do is make both of these visible.
Try it out at a few different scales in Excel, you'll get a good view of how it operates.
Hope that helps.
another way to look at it is this:
suppose your page and my page are made at same day, and ur page gets total views about ten million, and mine about 1 million till some point. then suppose the slope at some point is a million for me, and 0.5 million for you. if u just use slope, then i win, but ur page already had more views per day at that point, urs were having 5 million, and mine 1 million, so that a million on mine still makes it 2 million, and urs is 5.5 million for that day. so may be this scaling concept is to try to adjust the results to show that ur page is also good as a trend setter, and its slope is less but it already was more popular, but the scaling is only a log factor, so doesnt seem too problematic to me.