What is the spaced repetition algorithm to generate the day intervals? - algorithm

I am implementing a flashcard game and I want to implement spaced repetition. I don't need something complex like in SuperMemo, but simply space the learning based on the score for each card.
What I am looking for at the moment is how to calculate the number of days until a card is shown again, based on its score. I found that ZDT uses the list in the screenshot below (1, 2, 3, 5, etc.). Does anybody know how to dynamically generate this list (so that I can calculate beyond a score of 12)?
Or perhaps could someone guess what math function I could use to generate the numbers on the ZDT list? They increase exponentially.

It looks very similar to a logistic curve. I'll run a logistic regression on it and see what comes out.
Here is the data (plotted using WolframAlpha)
Here is the equation I got:
f(x) = 115/(1+2192*EXP(-0.79*x))
Here is the plot with the curve:
Unfortunately the curve isn't very accurate for small numbers.

Related

Algorithm to determine the global minima of a blackbox function

I recently got this question in an interview, and it's kind of making me mad thinking about it.
Suppose you have a family of functions that each take a fixed number of parameters (different functions can take different numbers of parameters), each with the following properties:
Each input is between 0-1
Each output is between 0-1
The function is continuous
The function is a blackbox (i.e you cannot look at the equation for it)
He then asked me to create an algorithm to find the global minima of this function.
To me, looking at this question was like trying to answer the basis of machine learning. Obviously if there was some way to guarantee to find the global minima of a function, then we'd have perfect machine learning algorithms. Obviously we don't, so this question seems kind of impossible.
Anyways, the answer I gave was a mixture of divide and conquer with stochastic gradient descent. Since all functions are continuous, you'll always be able to calculate the partial gradient with respect to a certain dimension. You split each dimension in half and once you've reached a certain granularity, you apply stochastic gradient descent. In gradient descent, you initialize a certain start point, and evaluate the left and right side of that point based on a small delta with respect to every dimension to get the slope at that point. Then you update your point based on a certain learning rate and recalculate your partial derivatives until you've reached a point where the distance between old and new point is below a certain threshold. Then you re-merge and return the minimum of the two sections until you return the minimum value from all your divisions. My hope was to get around the fact that SGD can get stuck in local minima, so I thought dividing the dimension space would reduce the chance of that happening.
He seemed pretty unimpressed with my algorithm in the end. Does anybody have a faster/more accurate way of solving this problem?
The range is [0, 1], therefore f(x) = 0, where x on R^n, is the global minima. Moreover, it's not guaranteed that the function will be a convex, by knowing the domain, range, and continuity holds.
ex. f(x) = sqrt(x), it's a concave function (i.e. has no minimum), and x - [0, 1] belongs to its domain.

Theory on how to find the equation of a curve given a variable number of data points

I have recently started working on a project. One of the problems I ran into was converting changing accelerations into velocity. Accelerations at different points in time are provided through sensors. If you get the equation of these data points, the derivative of a certain time (x) on that equation will be the velocity.
I know how to do this on the computer, but how would I get the equation to start with? I have searched around but I have not found any existing programs that can form an equation given a set of points. In the past, I have created a neural net algorithm to form an equation, but it takes an incredibly long time to run.
If someone can link me a program or explain the process of doing this, that would be fantastic.
Sorry if this is in the wrong forum. I would post into math, but a programming background will be needed to know the realm of possibility of what a computer can do quickly.
This started out as a comment but ended up being too big.
Just to make sure you're familiar with the terminology...
Differentiation takes a function f(t) and spits out a new function f'(t) that tells you how f(t) changes with time (i.e. f'(t) gives the slope of f(t) at time t). This takes you from displacement to velocity or from velocity to acceleration.
Integreation takes a function f(t) and spits out a new function F(t) which measures the area under the function f(t) from the beginning of time up until a given point t. What's not obvious at first is that integration is actually the reverse of differentiation, a fact called the The Fundamental Theorem of Calculus. So integration takes you from acceleration to velocity or velocity to displacement.
You don't need to understand the rules of calculus to do numerical integration. The simplest (and most naive) method for integrating a function numerically is just by approximating the area by dividing it up into small slices between time points and summing the area of rectangles. This approximating sum is called a Reimann sum.
As you can see, this tends to really overshoot and undershoot certain parts of the function. A more accurate but still very simple method is the trapezoid rule, which also approximates the function with a series of slices, except the tops of the slices are straight lines between the function values rather than constant values.
Still more complicated, but yet a better approximation, is Simpson's rules, which approximates the function with parabolas between time points.
(source: tutorvista.com)
You can think of each of these methods as getting a better approximation of the integral because they each use more information about the function. The first method uses just one data point per area (a constant flat line), the second method uses two data points per area (a straight line), and the third method uses three data points per area (a parabola).
You could read up on the math behind these methods here or in the first page of this pdf.
I agree with the comments that numerical integration is probably what you want. In case you still want a function going through your data, let me further argue against doing that.
It's usually a bad idea to find a curve that goes exactly through some given points. In almost any applied math context you have to accept that there is a little noise in the inputs, and a curve going exactly through the points may be very sensitive to noise. This can produce garbage outputs. Finding a curve going exactly through a set of points is asking for overfitting to get a function that memorizes rather than understands the data, and does not generalize.
For example, take the points (0,0), (1,1), (2,4), (3,9), (4,16), (5,25), (6,36). These are seven points on y=x^2, which is fine. The value of x^2 at x=-1 is 1. Now what happens if you replace (3,9) with (2.9,9.1)? There is a sixth order polynomial passing through all 7 points,
4.66329x - 8.87063x^2 + 7.2281x^3 - 2.35108x^4 + 0.349747x^5 - 0.0194304x^6.
The value of this at x=-1 is -23.4823, very far from 1. While the curve looks ok between 0 and 2, in other examples you can see large oscillations between the data points.
Once you accept that you want an approximation, not a curve going exactly through the points, you have what is known as a regression problem. There are many types of regression. Typically, you choose a set of functions and a way to measure how well a function approximates the data. If you use a simple set of functions like lines (linear regression), you just find the best fit. If you use a more complicated family of functions, you should use regularization to penalize overly complicated functions such as high degree polynomials with large coefficients that memorize the data. If you either use a simple family or regularization, the function tends not to change much when you add or withhold a few data points, which indicates that it is a meaningful trend in the data.
Unfortunately, integrating accelerometer data to get velocity is a numerically unstable problem. For most applications, your error will diverge far too soon to get results of any practical value.
Recall that:
So:
However well you fit a function to your accelerometer data, you will still essentially be doing a piecewise interpolation of the underlying acceleration function:
Where the error terms from each integration will add!
Typically you will see wildly inaccurate results after just a few seconds.

"Covering" the space of all possible histogram shapes

There is a very expensive computation I must make frequently.
The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.
I have 2 things going for me:
I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1
.1 .75 .05] will yield similar results.
Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.
To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.
I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.
So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?
Finding N points in M-space that are a best spread-out set is more-or-less equivalent to hypersphere packing (1,2) and in general answers are not known for M>10. While a fair amount of research has been done to develop faster methods for hypersphere packings or approximations, it is still regarded as a hard problem.
It probably would be better to apply a technique like principal component analysis or factor analysis to as large a set of histograms as you can conveniently generate. The results of either analysis will be a set of M numbers such that linear combinations of histogram data elements weighted by those numbers will predict some objective function. That function could be the “something that you can store pretty easily” numbers, or could be case numbers. Also consider developing and training a neural net or using other predictive modeling techniques to predict the objective function.
Building on #jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.
The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.
I second jwpat7's answer, but my very naive approach was to consider the count of items in each histogram bin as a y value, to consider the x values as just 0..1 in 20 steps, and then to obtain parameters a,b,c that describe x vs y as a cubic function.
To get a "covering" of the histograms I just iterated through "possible" values for each parameter.
e.g. to get 27 histograms to cover the "shape space" of my cubic histogram model I iterated the parameters through -1 .. 1, choosing 3 values linearly spaced.
Now, you could change the histogram model to be quartic if you think your data will often be represented that way, or whatever model you think is most descriptive, as well as generate however many histograms to cover. I used 27 because three partitions per parameter for three parameters is 3*3*3=27.
For a more comprehensive covering, like 100, you would have to more carefully choose your ranges for each parameter. 100**.3 isn't an integer, so the simple num_covers**(1/num_params) solution wouldn't work, but for 3 parameters 4*5*5 would.
Since the actual values of the parameters could vary greatly and still achieve the same shape it would probably be best to store ratios of them for comparison instead, e.g. for my 3 parmeters b/a and b/c.
Here is an 81 histogram "covering" using a quartic model, again with parameters chosen from linspace(-1,1,3):
edit: Since you said your histograms were described by arrays that were ~20 elements, I figured fitting parameters would be very fast.
edit2 on second thought I think using a constant in the model is pointless, all that matters is the shape.

Singular Value Decomposition to predict a missing value from an otherwise FULLY POPULATED matrix

this is my first question, and I hope it's not misdirected/in the wrong place.
Let's say I have a matrix of data that is fully populated except for one value. For example, Column 1 is Height, Column 2 is Weight, and Column 3 is Bench Press. So I surveyed 20 people and got their height, weight, and bench press weight. Now I have a 5'11 individual weighing 170 pounds, and would like to predict his/her bench press weight. You could look at this as the matrix having a missing value, or you could look at it as wanting to predict a dependent variable given a vector of independent variables. There are curve fitting approaches to this kind of problem, but I would like to know how to use the Singular Value Decomposition to answer this question.
I am aware of the Singular Value Decomposition as a means of predicting missing values, but virtually all the information I have found has been in relation to huge, highly sparse matrices, with respect to the Netflix Prize and related problems. I cannot figure out how to use SVD or a similar approach to predict a missing value from a small or medium sized, fully populated (except for one missing value) matrix.
A step-by-step algorithm for solving the example above using SVD would be very helpful to me. Thank you!
I was planning this as a comment but its too long by a fair bit so I've submitted it as an answer.
My reading of SVD suggests to me that it is not very applicable to your example. In particular it seems that you would need to somehow assign some difficulty ranking to the bench-press column of your matrix, or some ability ranking of the individuals. Perhaps both. Since the amount he can bench-press depends solely on his own height and weight I don't think SVD would provide any optimization over just calculating the statistical average of what others in the list have accomplished and using that to predict the outcome for your 5'11 170lb lifter. Perhaps if there was BMI (body mass index) column and if BMI could be ranked... and probably a larger data set. I think the problem is that there is no noise in your matrix to be reduced by using SVD. Here's a tut that appears to use a similar problem: http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html

Translating a score into a probabilty

People visit my website, and I have an algorithm that produces a score between 1 and 0. The higher the score, the greater the probability that this person will buy something, but the score isn't a probability, and it may not be a linear relationship with the purchase probability.
I have a bunch of data about what scores I gave people in the past, and whether or not those people actually make a purchase.
Using this data about what happened with scores in the past, I want to be able to take a score and translate it into the corresponding probability based on this past data.
Any ideas?
edit: A few people are suggesting bucketing, and I should have mentioned that I had considered this approach, but I'm sure there must be a way to do it "smoothly". A while ago I asked a question about a different but possibly related problem here, I have a feeling that something similar may be applicable but I'm not sure.
edit2: Let's say I told you that of the 100 customers with a score above 0.5, 12 of them purchased, and of the 25 customers with a score below 0.5, 2 of them purchased. What can I conclude, if anything, about the estimated purchase probability of someone with a score of 0.5?
Draw a chart - plot the ratio of buyers to non buyers on the Y axis and the score on the X axis - fit a curve - then for a given score you can get the probability by the hieght of the curve.
(you don't need to phyically create a chart - but the algorithm should be evident from the exercise)
Simples.
That is what logistic regression, probit regression, and company were invented for. Nowdays most people would use logistic regression, but fitting involves iterative algorithms - there are, of course, lots of implementations, but you might not want to write one yourself. Probit regression has an approximate explicit solution described at the link that might be good enough for your purposes.
A possible way to assess whether logistic regression would work for your data, would be to look at a plot of each score versus the logit of the probability of purchase (log(p/(1-p)), and see whether these form a straight line.
I eventually found exactly what I was looking for, an algorithm called “pair-adjacent violators”. I initially found it in this paper, however be warned that there is a flaw in their description of the implementation.
I describe the algorithm, this flaw, and the solution to it on my blog.
Well, the straightforward way to do this would be to calculate which percentage of people in a score interval purchased something and do this for all intervals (say, every .05 points).
Have you noticed an actual correlation between a higher score and an increased likelihood of purchases in your data?
I'm not an expert in statistics and there might be a better answer though.
You could divide the scores into a number of buckets, e.g. 0.0-0.1, 0.1-0.2,... and count the number of customers who purchased and did not purchase something for each bucket.
Alternatively, you may want to plot each score against the amount spent (as a scattergram) and see if there is any obvious relationship.
You could use exponential decay to produce a weighted average.
Take your users, arrange them in order of scores (break ties randomly).
Working from left to right, start with a running average of 0. Each user you get, change the average to average = (1-p) * average + p * (sale ? 1 : 0). Do the same thing from the right to the left, except start with 1.
The smaller you make p, the smoother your curve will become. Play around with your data until you have a value of p that gives you results that you like.
Incidentally this is the key idea behind how load averages get calculated by Unix systems.
Based upon your edit2 comment you would not have enough data to make a statement. Your overall purchase rate is 11.2% That is not statistically different from your 2 purchase rates which are above/below .5 Additionally to validate your score, you would have to insure that the purchase percentages were monotonically increasing as your score increased. You could bucket but you would need to check your results against a probability calculator to make sure they did not occur by chance.
http://stattrek.com/Tables/Binomial.aspx

Resources