Let's say we have list of people and would like to find people like person X.
The feature vector has 3 items [weight, height, age] and there are 3 persons in our list. Note that we don't know height of person C.
A: [70kg, 170cm, 60y]
B: [60kg, 169cm, 50y]
C: [60kg, ?, 50y]
What would be the best way to find people closest to person A?
My guess
Let's calculate the average value for height, and use it instead of unknown value.
So, let's say we calculated that 170cm is average value for height, and redefining person C as [60kg, ~170cm, 50y].
Now we can find people closest to A, it will be A, C, B.
Problem
Now, the problem is that we put C with guessed ~170cm before than B with known 169cm.
It kinda feels wrong. We humans are smarter than machines, and know that there's little chance that C will be exactly 170cm. So, it would be better to put B with 169cm before than C.
But how can we calculate that penalty? (preferably in simple empiric algorithm) Should we somehow penalise vectors with unknown values? And by how much (maybe calculate average diff between every two person's height in the set)?
And how would that penalisation look like in a general case when dimension of feature vector is N and it has K known items and U unknown (K + U = N)?
In this particular example, would it be better to use linear regression to fill the missing values instead of taking average? This way you may have more confidence in the guessed value and may not need penalty.
But if you want penalty, I have an idea of taking the ratio of non-missing features. In the example, there are 3 features in total. C has values in 2 of the features. So the ratio of non-missing features for C is 2/3. Adjust the similarity score by multiplying it with the ratio of non-missing features. For example, if the similarity between A and C is 0.9, the adjusted similarity is 0.9 * 2 / 3 = 0.6. Whereas the similarity between A and B will not be impacted since B has values for all the features and the ratio will be 1.
You can also weight the features when computing the ratio. For example, (weight, height, age) get the weights (0.3, 0.4, 0.3) respectively. Then missing the height feature will have the weighted ratio of (0.3 + 0.3) = 0.6. You can see C is penalized even more since we think height is more important than weight and age.
I would suggest , with data points for we have the the known attributes , use a learning model , linear regression or a multi layer perceptron to learn the unknown attribute and then with use of this model fill the unknown attributes. the average case is a special case of linear model
You are interested in the problem of Data Imputation.
There are several approaches to solving this problem, and I am just going to list some:
Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable. This can further be classified as generalized and similar case imputation.
Prediction Model: Prediction model is one of the sophisticated method for handling missing data. Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable and another one with missing values. First data set become training data set of the model while second data set with missing values is test data set and variable with missing values is treated as target variable. Next, we create a model to predict target variable based on other attributes of the training data set and populate missing values of test data set.
KNN(k-nearest neighbor) Imputation: In this method of imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function.
Linear Regression: A linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. In prediction, linear regression can be used to fit a predictive model to an observed data set of y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y. Check this example if you want.
Related
I have a table of 21 students (A1…A21) and their 25 characteristics (table 1) and I have another matrix (table 2) which shows if a student likes another student or not (0 means likes and 100 means dislike).
How can I find the least no. of characteristics that can give me similar distance in space as the likeability matrix?
For Example:
If we get 5 dimensions with characteristics C1, C3, C4, C5, C10, then the points A1,..A21 when plotted for these characteristics will have the proportional distance as the likeability matrix.
For example, if A3 and A2 have a small distance between them in that 5D characteristics space, then they will have a corresponding smaller distance/value in the likeability matrix.
Table 1
Table 2
You can make this look like a well-known statistical problem, but you have made assumptions (that similar students like each other), I will make further assumptions, and most of the solutions to the statistical problem are not very respectable, so you should take the results with a pinch of salt.
With 21 students, you have 21*20/2 = 210 pairs of students. Treat each pair as a separate observation. You have a likeability value for that pair. For each pair compute, for each characteristic, the absolute value of the difference between the values for each of the two students. This gives you a vector of 25 elements for each observation. You will now try and predict the 210 likeabilities given the 210 25-long vectors of absolute differences.
Procedures for this go under the names of all-subsets regression and stepwise regression. See https://www.r-bloggers.com/variable-selection-using-automatic-methods/ and https://www.r-bloggers.com/variable-selection-using-automatic-methods/. One way to compute these is to use the free open source statistical package R from https://www.r-project.org/.
For each possible selection of variables you can use linear regression to predict likeability from the vector of absolute differences. From that linear regression you can get a measure of how good the prediction is, and so whether that particular selection of variables was any good or not. All subsets regression uses a variation on branch and bound to work out, for each N, the set of variables of size N which predicts best. Stepwise regression starts off with a possibly incomplete selection of variables and performs a sort of hillclimb, adding or subtracting one variable from the set at each stage, trying all of the variables and choosing the one that gives the best prediction. Typically you start with no variables and add one variable at a time, or start will all variables, and remove one variable at a time. Stepwise selection isn't guaranteed to find the absolute best selection of variables that all subsets regression will find, but all subsets regression can be very expensive.
From this you will get a best selection of variables (probably one best selection for each number of variables) and you may get some indication of statistical significance. You have broken so many rules about multiple testing and independence (inflating 21 observations to 210) that you shouldn't take any statistical significance seriously. If you want some idea of whether you are looking at real information or prettied-up random noise, automate the procedure and see what it looks like on fake data where there is no underlying effect at all, and perhaps on fake data where you have constructed data from which there is an underlying effect that you know about because you have constructed it. See also https://en.wikipedia.org/wiki/Bootstrapping_(statistics) and https://en.wikipedia.org/wiki/Resampling_(statistics)#Permutation_tests
Suppose I have a set of entities (for example people with their physical characteristics) and I want to find, for a given entity X, all entities related (or similar) to it, for some definition of similarity.
I can easily find such entities for one dimension (all people with height Y ~= X's height within a certain threshold) but is there some approach that I can use to find similar entities considering more than one attribute?
It is going to depend on what you define as similarity, but you can use the same approach you take for 1D, to any dimension, with a small generalization. Assuming each element is represented as a vector, you can measure the distance of 2 vectors x,y as d=|x-y|, and accept/reject depending on this d and some threshold.
In here, the minus operator is vector negation:
(a1,a2,...,an)-(b1,b2,...,bn)=(a1-b1,a2-b2,...,an-bn)
and the absolute value is again for vectors:
|(a1,a2,...,an)| = sqrt(a1^2 + a2^2 + ... + an^2).
It is easy to see that this is generalization of your 1D example, and invoking the same approach for vectors with a single element will do the same.
Downside of this approach is (0,0,0,...,0,10^20) and (0,0,0,....,0) will be very far away from each other - which might or might not be what you are after, and then you might need a different distance metric - but that really depends on what exactly are you after.
There is a very expensive computation I must make frequently.
The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.
I have 2 things going for me:
I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1
.1 .75 .05] will yield similar results.
Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.
To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.
I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.
So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?
Finding N points in M-space that are a best spread-out set is more-or-less equivalent to hypersphere packing (1,2) and in general answers are not known for M>10. While a fair amount of research has been done to develop faster methods for hypersphere packings or approximations, it is still regarded as a hard problem.
It probably would be better to apply a technique like principal component analysis or factor analysis to as large a set of histograms as you can conveniently generate. The results of either analysis will be a set of M numbers such that linear combinations of histogram data elements weighted by those numbers will predict some objective function. That function could be the “something that you can store pretty easily” numbers, or could be case numbers. Also consider developing and training a neural net or using other predictive modeling techniques to predict the objective function.
Building on #jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.
The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.
I second jwpat7's answer, but my very naive approach was to consider the count of items in each histogram bin as a y value, to consider the x values as just 0..1 in 20 steps, and then to obtain parameters a,b,c that describe x vs y as a cubic function.
To get a "covering" of the histograms I just iterated through "possible" values for each parameter.
e.g. to get 27 histograms to cover the "shape space" of my cubic histogram model I iterated the parameters through -1 .. 1, choosing 3 values linearly spaced.
Now, you could change the histogram model to be quartic if you think your data will often be represented that way, or whatever model you think is most descriptive, as well as generate however many histograms to cover. I used 27 because three partitions per parameter for three parameters is 3*3*3=27.
For a more comprehensive covering, like 100, you would have to more carefully choose your ranges for each parameter. 100**.3 isn't an integer, so the simple num_covers**(1/num_params) solution wouldn't work, but for 3 parameters 4*5*5 would.
Since the actual values of the parameters could vary greatly and still achieve the same shape it would probably be best to store ratios of them for comparison instead, e.g. for my 3 parmeters b/a and b/c.
Here is an 81 histogram "covering" using a quartic model, again with parameters chosen from linspace(-1,1,3):
edit: Since you said your histograms were described by arrays that were ~20 elements, I figured fitting parameters would be very fast.
edit2 on second thought I think using a constant in the model is pointless, all that matters is the shape.
I've seen some machine learning questions on here so I figured I would post a related question:
Suppose I have a dataset where athletes participate at running competitions of 10 km and 20 km with hilly courses i.e. every competition has its own difficulty.
The finishing times from users are almost inverse normally distributed for every competition.
One can write this problem as a matrix:
Comp1 Comp2 Comp3
User1 20min ?? 10min
User2 25min 20min 12min
User3 30min 25min ??
User4 30min ?? ??
I would like to complete the matrix above which has the size 1000x20 and a sparseness of 8 % (!).
There should be a very easy way to complete this matrix, since I can calculate parameters for every user (ability) and parameters for every competition (mu, lambda of distributions). Moreover the correlation between the competitions are very high.
I can take advantage of the rankings User1 < User2 < User3 and Item3 << Item2 < Item1
Could you maybe give me a hint which methods I could use?
Your astute observation that this is a matrix completion problem gets
you most of the way to the solution. I'll codify your intuition that
the combination of ability of a user and difficulty of the course
yields the time of a race, then present various algorithms.
Model
Let the vector u denote the speed of the users so that u_i is user i's
speed. Let the vector v denote the difficulty of the courses so
that v_j is course j's difficulty. Also when available, let t_ij be user i's time on
course j, and define y_ij = 1/t_ij, user i's speed on course j.
Since you say the times are inverse Gaussian distributed, a sensible
model for the observations is
y_ij = u_i * v_j + e_ij,
where e_ij is a zero-mean Gaussian random variable.
To fit this model, we search for vectors u and v that minimize the
prediction error among the observed speeds:
f(u,v) = sum_ij (u_i * v_j - y_ij)^2
Algorithm 1: missing value Singular Value Decomposition
This is the classical Hebbian
algorithm. It
minimizes the above cost function by gradient descent. The gradient of
f wrt to u and v are
df/du_i = sum_j (u_i * v_j - y_ij) v_j
df/dv_j = sum_i (u_i * v_j - y_ij) u_i
Plug these gradients into a Conjugate Gradient solver or BFGS
optimizer, like MATLAB's fmin_unc or scipy's optimize.fmin_ncg or
optimize.fmin_bfgs. Don't roll your own gradient descent unless you're willing to implement a very good line search algorithm.
Algorithm 2: matrix factorization with a trace norm penalty
Recently, simple convex relaxations to this problem have been
proposed. The resulting algorithms are just as simple to code up and seem to
work very well. Check out, for example Collaborative Filtering in a Non-Uniform World:
Learning with the Weighted Trace Norm. These methods minimize
f(m) = sum_ij (m_ij - y_ij)^2 + ||m||_*,
where ||.||_* is the so-called nuclear norm of the matrix m. Implementations will end up again computing gradients with respect to u and v and relying on a nonlinear optimizer.
There are several ways to do this, perhaps the best architecture to try first is the following:
(As usual, as a preprocessing step normalize your data into a uniform function with 0 mean and 1 std deviation as best you can. You can do this by fitting a function to the distribution of all race results, applying its inverse, and then subtracting the mean and dividing by the std deviation.)
Select a hyperparameter N (you can tune this as usual with a cross validation set).
For each participant and each race create an N-dimensional feature vector, initially random. So if there are R races and P participants then there are R+P feature vectors with a total of N(R+P) parameters.
The prediction for a given participant and a given race is a function of the two corresponding feature vectors (as a first try use the scalar product of these two vectors).
Alternate between incrementally improving the participant feature vectors and the race feature vectors.
To improve a feature vector use gradient descent (or some more complex optimization method) on the known data elements (the participant/race pairs for which you have a result).
That is your loss function is:
total_error = 0
forall i,j
if (Participant i participated in Race j)
actual = ActualRaceResult(i,j)
predicted = ScalarProduct(ParticipantFeatures_i, RaceFeatures_j)
total_error += (actual - predicted)^2
So calculate the partial derivative of this function wrt the feature vectors and adjust them incrementally as per a usual ML algorithm.
(You should also include a regularization term on the loss function, for example square of the lengths of the feature vectors)
Let me know if this architecture is clear to you or you need further elaboration.
I think this is a classical task of missing data recovery. There exist some different methods. One of them which I can suggest is based on Self Organizing Feature Map (Kohonen's Map).
Below it's assumed that every athlet record is a pattern, and every competition data is a feature.
Basically, you should divide your data into 2 sets: first - with fully defined patterns, and second - patterns with partially lost features. I assume this is eligible because sparsity is 8%, that is you have enough data (92%) to train net on undamaged records.
Then you feed first set to the SOM and train it on this data. During this process all features are used. I'll not copy algorithm here, because it can be found in many public sources, and even some implementations are available.
After the net is trained, you can feed patterns from the second set to the net. For each pattern the net should calculate best matching unit (BMU), based only on those features that exist in the current pattern. Then you can take from the BMU its weigths, corresponding to missing features.
As alternative, you could not divide the whole data into 2 sets, but train the net on all patterns including the ones with missing features. But for such patterns learning process should be altered in the similar way, that is BMU should be calculated only on existing features in every pattern.
I think you can have a look at the recent low rank matrix completion methods.
The assumption is that your matrix has a low rank compared to the matrix dimension.
min rank(M)
s.t. ||P(M-M')||_F=0
M is the final result, and M' is the uncompleted matrix you currently have.
This algorithm minimizes the rank of your matrix M. P in the constraint is an operator that takes the known terms of your matrix M', and constraint those terms in M to be the same as in M'.
The optimization of this problem has a relaxed version, which is:
min ||M||_* + \lambda*||P(M-M')||_F
rank(M) is relaxed to its convex hull ||M||_* Then you trade off the two terms by controlling the parameter lambda.
I'm trying to develop a level surface visualizer using this method (don't know if this is the standard method or if there's something better):
1. Take any function f(x,y,z)=k (where k is constant), and bounds for x, y, and z. Also take in two grid parameters stepX and stepZ.
2. to reduce to a level curve problem, iterate from zMin to zMax with stepZ intervals. So f(x,y,z)=k => f(x,y,fixedZ)=k
3. Do the same procedure with stepX, reducing the problem to f(fixedX, y, fixedZ)=k
4. Solve f(fixedX, y, fixedZ) - k = 0 for all values of y which will satisfy that equation (using some kind of a root finding algorithm).
5. For all points generated, plot those as a level curve (the inner loop generates level curves at a given z, then for different z values there are just stacks of level curves)
6 (optional). Generate a mesh from these level curves/points which belong to the level set.
The problem I'm running into is with step 4. I have no way of knowing before-hand how many possible values of y will satisfy that equation (more specifically, how many unique and real values of y).
Also, I'm trying to keep the program as general as possible so I'm trying to not limit the original function f(x,y,z)=k to any constraints such as smoothness or polynomial other than k must be constant as required for a level surface.
Is there an algorithm (without using a CAS/symbolic solving) which can identify the root(s) of a function even if it has multiple roots? I know that bisection methods have a hard time with this because of the possibility of no sign changes over the region, but how does the secant/newtons method fare? What set of functions can the secant/newtons method be used on, and can it detect and find all unique real roots within two given bounds? Or is there a better method for generating/visualizing level surfaces?
I think I've found the solution to my problem. I did a little bit more research and discovered that level surface is synonymous with isosurface. So in theory something like a marching cubes method should work.
In case you're in need of an example of the Marching Cubes algorithm, check out
http://stemkoski.github.com/Three.js/Marching-Cubes.html
(uses JavaScript/Three.js for the graphics).
For more details on the theory, you should check out the article at
http://paulbourke.net/geometry/polygonise/
A simple way,
2D: plot (x,y) with color = floor(q*f(x,y)) in grayscale where q is some arbitrary factor.
3D: plot (x,y, floor(q*f(x,y))
Effectively heights of the function that are equivalent will be representing on the same level surface.
If you to get the level curves you can use the 2D method and edge detection/region categorization to get the points (x,y) on the same level.