Finding similarities in a multidimensional array - algorithm

Consider a sales department that sets a sales goal for each day. The total goal isn't important, but the overage or underage is. For example, if Monday of week 1 has a goal of 50 and we sell 60, that day gets a score of +10. On Tuesday, our goal is 48 and we sell 46 for a score of -2. At the end of the week, we score the week like this:
[0,0]=10,[0,1]=-2,[0,2]=1,[0,3]=7,[0,4]=6
In this example, both Monday (0,0) and Thursday and Friday (0,3 and 0,4) are "hot"
If we look at the results from week 2, we see:
[1,0]=-4,[1,1]=2,[1,2]=-1,[1,3]=4,[1,4]=5
For week 2, the end of the week is hot, and Tuesday is warm.
Next, if we compare weeks one and two, we see that the end of the week tends to be better than the first part of the week. So, now let's add weeks 3 and 4:
[0,0]=10,[0,1]=-2,[0,2]=1,[0,3]=7,[0,4]=6
[1,0]=-4,[1,1]=2,[1,2]=-1,[1,3]=4,[1,4]=5
[2,0]=-8,[2,1]=-2,[2,2]=-1,[2,3]=2,[2,4]=3
[3,0]=2,[3,1]=3,[3,2]=4,[3,3]=7,[3,4]=9
From this, we see that the end of the week is better theory holds true. But we also see that end of the month is better than the start. Of course, we would want to next compare this month with next month, or compare a group of months for quarterly or annual results.
I'm not a math or stats guy, but I'm pretty sure there are algorithms designed for this type of problem. Since I don't have a math background (and don't remember any algebra from my earlier days), where would I look for help? Does this type of "hotspot" logic have a name? Are there formulas or algorithms that can slice and dice and compare multidimensional arrays?
Any help, pointers or advice is appreciated!

This data isn't really multidimensional, it's just a simple time series, and there are many ways to analyse it. I'd suggest you start with the Fourier Transform, it detects "rhythms" in a series, so this data would show a spike at 7 days, and also around thirty, and if you extended the data set to a few years it would show a one-year spike for seasons and holidays. That should keep you busy for a while, until you're ready to use real multidimensional data, say by adding in weather information, stock market data, results of recent sports events and so on.

The following might be relevant to you: Stochastic oscillators in technical analysis, which are used to determine whether a stock has been overbought or oversold.
I'm oversimplifying here, but essentially you have two moving calculations:
14-day stochastic: 100 * (today's closing price - low of last 14 days) / (high of last 14 days - low of last 14 days)
3-day stochastic: same calculation, but relative to 3 days.
The 14-day and 3-day stochastics will have a tendency to follow the same curve. Your stochastics will fall somewhere between 1.0 and 0.0; stochastics above 0.8 are considered overbought or bearish, below 0.2 indicates oversold or bullish. More specifically, when your 3-day stochastic "crosses" the 14-day stochastic in one of those regions, you have predictor of momentum of the prices.
Although some people consider technical analysis to be voodoo, empirical evidence indicates that it has some predictive power. For what its worth, a stochastic is a very easy and efficient way to visualize the momentum of prices over time.

It seems to me that an OLAP approach (like pivot tables in MS Excel) fit the problem perfectly.

What you want to do is quite simple - you just have to calculate the autocorrelation of your data and look at the correlogram. From the correlogram you can see 'hidden' periods of your data and then you can use this information to analyze the periods.
Here is the result - your numbers and their normalized autocorrelation.
10 1,000
-2 0,097
1 -0,121
7 0,084
6 0,098
-4 0,154
2 -0,082
-1 -0,550
4 -0,341
5 -0,027
-8 -0,165
-2 -0,212
-1 -0,555
2 -0,426
3 -0,279
2 0,195
3 0,000
4 -0,795
7 -1,000
9
I used Excel to get the values. But the sequence in column A and add the equation =CORREL($A$1:$A$20;$A1:$A20) to cell B1 and copy it then up to B19. If you the add a line diagram, you can nicely see the structure of the data.

You can already make reasonable guesses about the periods of patterns - you're looking at things like weekly and monthly. To look for weekly patterns, for example, just average all the mondays together and so on. Same goes for days of the month, for months of the year.
Sure, you could use a complex algorithm to find out that there's a weekly pattern, but you already know to expect that. If you think there really may be patterns buried there that you'd never suspect (there's a strange community of people who use a 5-day week and frequent your business), by all means, use a strong tool -- but if you know what kinds of things to look for, there's really no need.

Daniel has the right idea when he suggested correlation but I don't think autocorrelation is what you want. Instead I would suggest correlating each week with each other week. Peaks in your correlation--that is values close to 1--suggest that the values of the weeks resemble each other (I.e. are peiodic) for that particular shift.
For example when you cross correlate
0 0 1 2 0 0
with
0 0 0 1 1 0
the result would be
2 0 0 1 3 0
the highest value is 3, which corresponds to shifting (right) the second array by 4
0 0 0 1 1 0 --> 0 0 1 1 0 0
and thenn multiplying component wise
0 0 1 2 0 0
0 0 1 1 0 0
----------------------
0 + 0 + 1 + 2 + 0 + 0 = 3
Note that when you correlate you can create your own "fake" week and cross-correlate all your real weeks, the idea being that you are looking for "shapes" of your weekly values that correspond to the shape of your fake week by looking for peaks in the correlation result.
So if you are interested in finding weeks that are close near the end of the week you could use the "fake" week
-1 -1 -1 -1 1 1
and if you get a high response in the first value of the correlation this means that the real week that you correlated with has roughly this shape.

This is probably beyond the scope of what you're looking for, but one technical approach that would give you the ability to do forecasting, look at things like statistical significance, etc., would be ARIMA or similar Box-Jenkins models.

Related

Does Cognos Framework Manager has the "Last" function like Dynamic Cubes in Cognos?

I was wondering if Cognos Framework Manager has the built-in function "Last" like in Dynamic Cubes?
Or does someone know how to model following case:
We have two dimensions - a time dimension with year, half-year, quarter and month and another dimension that categorises people depending how long they are attending a project (1-30 days, 31-60 d, 60-180, 180 -365, 1-2 years, +2 years). However the choice of the time dimension level (year, half-year etc.) influences the categorization of the other dimension).
An example:
A person attends a project starting from 15.11.2018 and ends 30.06.2020. The cognos user uses for the time dimension the year level thus 2018, 2019 & 2018 will be displayed.
For 2018 the person will be in the category 31-60 days, since 46 days have passed until 31.12.2018. For 2019 the person will be listed in category 1-2 years as 46 + 365 days will have been passed since 31.12.2019. For 2020 the person will also be in that category as 46 + 365 + 180 day have gone by.
The categories will change if the user selects another time dimension level e.g. half-years:
2nd HY 2018: 31-60 (46 days passed)
1st HY 2019: 180-365 days (46 + 180 --> End of HY2019)
2nd HY 2019: 1-2 years (46 + 180 + 180)
1st HY 2020: 1-2 years (46 + 180 + 180 + 180)
Does someone know how to model dynamic dimension categories based on selection of another dimension (here time dimension)?
The fact table contains monthly data and for the mentioned peroson above there will be 20 seperate records (for each month between november 2018 and june 2020).
For any period, a person may or may not be working on a project.
Without knowing exactly what your data and metadata is it would be somewhat difficult to prescribe an exact solution but the approach would probably be somewhat similar to a degenerate dimension scenario.
You would want to model the project dimension as a fact as well as a dimension. You would have relationships between it and time and whatever other dimensions you need.
Depending on the data and the metadata you might need to do some gymnastics to get there.
If the data was in a form similar to this it would be not too difficult. This is an example to get you an idea about some ways of approaching the problem.
Date_Key Person_Key Project_Key commitment_status, which would be the measure.
20200101 1 1 1
20200101 1 2 0
20200101 1 3 0
20200102 1 1 1
20200102 1 2 0
20200102 1 3 0
20200103 1 1 0
20200103 1 2 1
20200103 1 3 0
In the above, person 1 was working on project 1 for 2 days and then put onto project 2 for a day. By aggregating the commitment status, which is done by setting the aggregate rule property, you would be able to determine the number of days a person has been working on a project no matter what time period you have set in your query.

is there an algorithm to solve current rating with now one?

I have a old rating from the db and a new Rating from the user,
i tried to search for "rating algorithm" but they save the ratings per user. In my case i don't save the previous ratings. My rating bar is up to max of 5
currently my solution is oldR + newR/ 2. does this make sense?
Not really. When you think about it, such a formula would mean that newer votes are weighted much more than old ones. Imagine a sequence of votes like this:
Vote: 1 1 1 1 1 1 1 1 1 1 1 5
Rating: 1 1 1 1 1 1 1 1 1 1 1 3
Clearly the rating should not be 3 in this case, it should still be 1 (or at worst 2), but with your formula it will be.
At the very least you should store the number of votes as well as the average rating, allowing you to calculate newR = ((oldR*votesCast)+newVote)/(votesCast+1). This also requires to store the rating with a higher precision, not just as an integer. (You can round it off when you display it but internally you should keep track of fractions too.)
A slightly better solution is to separately store how many votes have been cast for the 5 different ratings so far, allowing you to calculate different kinds of means (geometric for example).
And obviously the most flexible (but most storage and computation intensive) is to store each individual vote with user id and timestamp, allowing you to use any algorithm you can think of.

Recommendation system and baseline predictors

I have a bunch of data where the first column represents users, the second column is movies, and the third is a ten-points rating.
0 0 9
0 1 8
1 1 4
1 2 6
2 2 7
And I have to predict the third number for another ser of data (user, movie, ?):
0 2
1 0
2 0
2 1
I use this way for finding bias values https://youtube.com/watch?v=dGM4bNQcVKI and this way for predicting https://www.youtube.com/watch?v=4RSigTais8o.
Bias value for user number 0: 9 + 8 / 2 = 8.5 - 1.5 = 7.
Bias value for movie number 2: 6 + 7 / 2 = 6.5 - 1.5 = 5.
And baseline predictors:
1.5 + 7 + 5, where result is 13.5, but in contest result is: 7.052009.
But the problem description says the result of my Recommendation system should be:
0 2 7.052009
1 0 6.687943
2 0 6.995272
2 1 6.687943
Where is my mistake?
The raw average is the average of ALL the present scores ((9+8+4+6+7) / 5 = 6.8), I don't see that number anywhere, so I guess that's your error.
In the video Prof. used the raw average of 3.5 on all the calculations, including calculating bias, he skipped how to reach that number, if you add all numbers on the table of the video and divide, you get 3.5.
0 2 9.2 is the answer for the first one, using your videos as guide. The videos claims to have avoided calculus, the different final answers of the contest probably come from using the "full" method.
0 2 ?, user 0 (row 0: 9 8 x), movie 2 (column 2: x 6 7)
raw average = 6.8
bias user 0: (9+8) / 2 - 6.8 = 1.7
bias movie 2: (6+7) / 2 - 6.8 = -0.3
prediction: 6.8+1.7-0.3 = 8.2
The problem looks like a variation of the Netflix Contest, the contest' host knows the actual answers (the ratings), he doesn't give them to you, you are expected to guess/predict them, the winner of the contest is the one that gets the closest to the actual answers.
The winner of you contest got the closest, but he got there using an unknown method, or his own variation of a know method, if your goal is to match his answer exactly, you are better off asking him what method he used and how did he modify it, and try to replicate his results.
If this was homework and not a contest, then the teacher would expect you to use the "correct" method he taught you (there's no set method, just many methods that work with different accuracy), you'd have to use it exactly like he taught you. But it is a contest, your goal is to find a base method that approximates the best (the one you used is very low on accuracy), and tinker with it a bit to get even better results.
If you want to understand the link I suggest you research and later ask a statistics question, because it's just plain statistics. You can try to understand the link or research Matrix factorization on your own. Remember that to get contest winning results (or close) you won't be able to use a simple method like the one you found on the youtube video, but require a method with a lot more math.

slot machine payout calculation

There's this question but it has nothing close to help me out here.
Tried to find information about it on the internet yet this subject is so swarmed with articles on "how to win" or other non-related stuff that I could barely find anything. None worth posting here.
My question is how would I assure a payout of 95% over a year?
Theoretically, of course.
So far I can think of three obvious variables to consider within the calculation: Machine payout term (year in my case), total paid and total received in that term.
Now I could simply shoot a random number between the paid/received gap and fix slots results to be shown to the player but I'm not sure this is how it's done.
This method however sounds reasonable, although it involves building the slots results backwards..
I could also make a huge list of all possibilities, save them in a database randomized by order and simply poll one of them each time.
This got many flaws - the biggest one is the huge list I'm going to get (millions/billions/etc' records).
I certainly hope this question will be marked with an "Answer" (:
You have to make reel strips instead of huge database. Here is brief example for very basic 3-reel game containing 3 symbols:
Paytable:
3xA = 5
3xB = 10
3xC = 20
Reels-strip is a sequence of symbols on each reel. For the calculations you only need the quantity of each symbol per each reel:
A = 3, 1, 1 (3 symbols on 1st reel, 1 symbol on 2nd, 1 symbol on 3rd reel)
B = 1, 1, 2
C = 1, 1, 1
Full cycle (total number of all possible combinations) is 5 * 3 * 4 = 60
Now you can calculate probability of each combination:
3xA = 3 * 1 * 1 / full cycle = 0.05
3xB = 1 * 1 * 2 / full cycle = 0.0333
3xC = 1 * 1 * 1 / full cycle = 0.0166
Then you can calculate the return for each combination:
3xA = 5 * 0.05 = 0.25 (25% from AAA)
3xB = 10 * 0.0333 = 0.333 (33.3% from BBB)
3xC = 20 * 0.0166 = 0.333 (33.3% from CCC)
Total return = 91.66%
Finally, you can shuffle the symbols on each reel to get the reels-strips, e.g. "ABACA" for the 1st reel. Then pick a random number between 1 and the length of the strip, e.g. 1 to 5 for the 1st reel. This number is the middle symbol. The upper and lower ones are from the strip. If you picked from the edge of the strip, use the first or last one to loop the strip (it's a virtual reel). Then score the result.
In real life you might want to have Wild-symbols, free spins and bonuses. They all are pretty complicated to describe in this answer.
In this sample the Hit Frequency is 10% (total combinations = 60 and prize combinations = 6). Most of people use excel to calculate this stuff, however, you may find some good tools for making slot math.
Proper keywords for Google: PAR-sheet, "slot math can be fun" book.
For sweepstakes or Class-2 machines you can't use this stuff. You have to display a combination by the given prize instead. This is a pretty different task, so you may try to prepare a database storing the combinations sorted by the prize amount.
Well, the first problem is with the keyword assure, if you are dealing with random, you cannot assure, unless you change the logic of the slot machine.
Consider the following algorithm though. I think this style of thinking is more reliable then plotting graphs of averages to achive 95%;
if( customer_able_to_win() )
{
calculate_how_to_win();
}
else
no_win();
customer_able_to_win() is your data log that says how much intake you have gotten vs how much you have paid out, if you are under 95%, payout, then customer_able_to_win() returns true; in that case, calculate_how_to_win() calculates how much the customer would be able to win based on your %, so, lets choose a sampling period of 24 hours. If over the last 24 hours i've paid out 90% of the money I've taken in, then I can pay out up to 5%.... lets give that 5% a number such as 100$. So calculate_how_to_win says I can pay out up to 100$, so I would find a set of reels that would pay out 100$ or less, and that user could win. You could add a little random to it, but to ensure your 95% you'll have to have some other rules such as a forced max payout if you get below say 80%, and so on.
If you change the algorithm a little by adding random to the mix you will have to have more of these caveats..... So to make it APPEAR random to the user, you could do...
if( customer_able_to_win() && payout_percent() < 90% )
{
calculate_how_to_win(); // up to 5% payout
}
else
no_win();
With something like that, it will go on a losing streak after you hit 95% until you reach 90%, then it will go on a winning streak of random increments until you reach 95%.
This isn't a full algorithm answer, but more of a direction on how to think about how the slot machine works.
I've always envisioned this is the way slot machines work especially with video poker. Because the no_win() function would calculate how to lose, but make it appear to be 1 card off to tease you to think you were going to win, instead of dealing with a 'fair' game and the random just happens to be like that....
Think of the entire process of.... first thinking if you are going to win, how are you going to win, if you're not going to win, how are you going to lose, instead of random number generators determining if you will win or not.
I worked many years ago for an internet casino in Australia, this one being the only one in the world that was regulated completely by a government body. The algorithms you speak of that produce "structured randomness" are obviously extremely complex especially when you are talking multiple lines in all directions, double up, pick the suit, multiple progressive jackpots and the like.
Our poker machine laws for our state demand a payout of 97% of what goes in. For rudely to be satisfied that our machine did this, they made us run 10 million mock turns of the machine and then wanted to see that our game paid off at what the law states with the tiniest range of error (we had many many machines running a script to auto playing using a script to simulate the click for about a week before we hit the 10 mil).
Anyhow the algorithms you speak of are EXPENSIVE! They range from maybe $500k to several million per machine so as you can understand, no one is going to hand them over for free, that's for sure. If you wanted a single line machine it would be easy enough to do. Just work out you symbols/cards and what pay structure you want for each. Then you could just distribute those payouts amongst non-payouts till you got you respective figure. Obviously the more options there are means the longer it will take to pay out at that respective rate, it may even payout more early in the piece. Hit frequency and prize size are also factors you may want to consider
A simple way to do it, if you assume that people win a constant number of times a time period:
Create a collection of all possible tumbler combinations with how much each one pays out.
The first time someone plays, in that time period, you can offer all combinations at equal probability.
If they win, take that amount off the total left for the time period, and remove from the available options any combination that would payout more than you have left.
Repeat with the reduced combinations until all the money is gone for that time period.
Reset and start again for the next time period.

How to rank stories based on "controversy"?

I'd like to rank my stories based on "controversy" quotient. For example, reddit.com currently has "controversial" section: http://www.reddit.com/controversial/
When a story has a lot of up and a lot of down votes, it's controversial even though the total score is 0 (for example). How should I calculate this quotient score so that when there's a lot of people voting up and down, I can capture this somehow.
Thanks!!!
Nick
I would recommend using the standard deviation of the votes.
A controversial vote that's 100% polarised would have equal numbers of -1 and +1 votes, so the mean would be 0 and the stddev would be around 1.0
Conversely a completely consistent set of votes (with no votes in the opposite direction) would have a mean of 1 or -1 and a stddev of 0.0.
Votes that aren't either completely consistent or completely polarised will produce a standard deviation figure between 0 and ~1.0 where that value will indicate the degree of controversy in the vote.
The easiest method is to count the number of upvote/downvote pairings for a given comment within the timeframe (e.g. 1 week, 48 hours etc), and have comments with the most parings appear first. Anything more complex requires trial-and-error or experimentation on the best algorithm - as always, it varies on the content of the site and how you want it weighted.
Overall, it's not much different than a hotness algorithm, which works by detecting the most upvotes or views within a timeframe.
What about simply getting the smaller of the two values (up or down) of a point in time? If it goes up a lot and goes down a little, or the other way around it, is not controversial.
If for example the items has 10 ups and 5 downs, the "controversiality level" is 5, since there is 5 people disagreeing about liking it or not. On the other hand if it has either 10 ups or 10 downs, the "controversiality level" is 0, since no one is disagreeing.
So in the end the smaller of both items in this case defines the "hotness" or the "controversiality". Does this make sense?
// figure out if up or down is winning - doesn't matter which
if (up_votes > down_votes)
{
win_votes = up_votes;
lose_votes = down_votes;
}
else
{
win_votes = down_votes;
lose_votes = up_votes;
}
// losewin_ratio is always <= 1, near 0 if win_votes >> lose_votes
losewin_ratio = lose_votes / win_votes;
total_votes = up_votes + down_votes;
controversy_score = total_votes * losewin_ratio; // large means controversial
This formula will produce high scores for stories that have a lot of votes and a near 50/50 voting split, and low scores for stories that have either few votes or many votes for one choice.

Resources