Is there a standard approach to find related/similar objects? - algorithm

Suppose I have a set of entities (for example people with their physical characteristics) and I want to find, for a given entity X, all entities related (or similar) to it, for some definition of similarity.
I can easily find such entities for one dimension (all people with height Y ~= X's height within a certain threshold) but is there some approach that I can use to find similar entities considering more than one attribute?

It is going to depend on what you define as similarity, but you can use the same approach you take for 1D, to any dimension, with a small generalization. Assuming each element is represented as a vector, you can measure the distance of 2 vectors x,y as d=|x-y|, and accept/reject depending on this d and some threshold.
In here, the minus operator is vector negation:
(a1,a2,...,an)-(b1,b2,...,bn)=(a1-b1,a2-b2,...,an-bn)
and the absolute value is again for vectors:
|(a1,a2,...,an)| = sqrt(a1^2 + a2^2 + ... + an^2).
It is easy to see that this is generalization of your 1D example, and invoking the same approach for vectors with a single element will do the same.
Downside of this approach is (0,0,0,...,0,10^20) and (0,0,0,....,0) will be very far away from each other - which might or might not be what you are after, and then you might need a different distance metric - but that really depends on what exactly are you after.

Related

Nearest Neighbor for partially unknown vector

Let's say we have list of people and would like to find people like person X.
The feature vector has 3 items [weight, height, age] and there are 3 persons in our list. Note that we don't know height of person C.
A: [70kg, 170cm, 60y]
B: [60kg, 169cm, 50y]
C: [60kg, ?, 50y]
What would be the best way to find people closest to person A?
My guess
Let's calculate the average value for height, and use it instead of unknown value.
So, let's say we calculated that 170cm is average value for height, and redefining person C as [60kg, ~170cm, 50y].
Now we can find people closest to A, it will be A, C, B.
Problem
Now, the problem is that we put C with guessed ~170cm before than B with known 169cm.
It kinda feels wrong. We humans are smarter than machines, and know that there's little chance that C will be exactly 170cm. So, it would be better to put B with 169cm before than C.
But how can we calculate that penalty? (preferably in simple empiric algorithm) Should we somehow penalise vectors with unknown values? And by how much (maybe calculate average diff between every two person's height in the set)?
And how would that penalisation look like in a general case when dimension of feature vector is N and it has K known items and U unknown (K + U = N)?
In this particular example, would it be better to use linear regression to fill the missing values instead of taking average? This way you may have more confidence in the guessed value and may not need penalty.
But if you want penalty, I have an idea of taking the ratio of non-missing features. In the example, there are 3 features in total. C has values in 2 of the features. So the ratio of non-missing features for C is 2/3. Adjust the similarity score by multiplying it with the ratio of non-missing features. For example, if the similarity between A and C is 0.9, the adjusted similarity is 0.9 * 2 / 3 = 0.6. Whereas the similarity between A and B will not be impacted since B has values for all the features and the ratio will be 1.
You can also weight the features when computing the ratio. For example, (weight, height, age) get the weights (0.3, 0.4, 0.3) respectively. Then missing the height feature will have the weighted ratio of (0.3 + 0.3) = 0.6. You can see C is penalized even more since we think height is more important than weight and age.
I would suggest , with data points for we have the the known attributes , use a learning model , linear regression or a multi layer perceptron to learn the unknown attribute and then with use of this model fill the unknown attributes. the average case is a special case of linear model
You are interested in the problem of Data Imputation.
There are several approaches to solving this problem, and I am just going to list some:
Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable. This can further be classified as generalized and similar case imputation.
Prediction Model: Prediction model is one of the sophisticated method for handling missing data. Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable and another one with missing values. First data set become training data set of the model while second data set with missing values is test data set and variable with missing values is treated as target variable. Next, we create a model to predict target variable based on other attributes of the training data set and populate missing values of test data set.
KNN(k-nearest neighbor) Imputation: In this method of imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function.
Linear Regression: A linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. In prediction, linear regression can be used to fit a predictive model to an observed data set of y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y. Check this example if you want.

When calculating Z Order, how does one implement BIGMIN and LITMAX for more than 2 dimensions?

I'm writing a UB Tree for fun using a Z Order Curve. It is currently capable of storing points in any number of dimensions and when queried it performs a naive search between two Z Indexes, filtering and discarding any false positives. I would like to implement BIGMIN and LITMAX to minimize the number of false positives that it traverses, but I can't seem to find any information on how to implement those in a way that does not limit my tree to storing two dimensional data. For example, both this whitepaper and this blog post describe their implementation in terms that are heavily tied to working with 2D values.
Is there a dimensionality-agnostic way to implement this functionality?
For 2 dimension you can treat the z curve as a base-4 number (quadkey). IMO when you sort the quadkey from left to right it is similar to litmin and bigmin. For n-dimension treat it like a base-n number.

Spatial sorting Million points in 3d space

I have a collection of Million points in 3d space.
Each point is an object
Struct Point
{
double x;
double y;
double z;
};
The million points are stored inside an c++ vector MyPoints in some random order.
I want to sort these million points according to spatial distribution of points in space such that points which are physically closer should also be closer inside my array after sorting.
My first guess on how to do this is as follows: first sort points w.r.t Z-axis, then sort points along Y-axis and then sort points along X-axis
MyPointsSortedAlongZ = Sort(MyPoints, AlongZAxis )
MyPointsSortedAlongY = Sort(MyPointsSortedAlongZ , AlongYAxis )
MyPointsSortedAlongX = Sort(MyPointsSortedAlongY , AlongYAxis )
Now firstly, I dont know if this method is correct. Will my final array of points MyPointsSortedAlongX be sorted perfectly spatially (or nearly sorted spatially) ?
Secondly, if this method is correct, is it the fastest way to do this. What is a better method to do this ?
The CGAL library provides an implementation of a space filling curve algorithm that can be useful for that task.
Well, it really depends on what the metric you are going to use to compare between two arrays, but look for example on the metric which is sum of differences between adjacent points:
metric(arr) = sum[ d(arr[i],arr[i-1]) | i from 1 to n ]
where d(x,y) is the distance between point x and point y
Note that an optimal (smallest) solution to this metric is basically an optimal (shortest) path that goes through all points. This is the Traveling Salesman Problem (TSP), which is NP-Hard, so there is no known polynomial solution to it.
I'd suggest - first define exactly what is the metric to compare two arrays.
Then, use heuristics or approximations to the metric, such as Genetic Algorithms or hill climbing, or reduce the problem to TSP, and use a known heuristic/approximation for it.
Regarding your method:
It is easy to see it is not optimal for the simple example:
[(1,100),(1,-100),(2,0)]
Let's assume main sort by x, secondary sort by y.
It will give us the 'sorted' vector:
[(1,-100),(1,100),(2,0)]
according to the above metric, we get metric(arr) ~= 300
However, the order [(1,-100),(2,0),(1,100)] will get us metric(arr) ~= 200
So, the suggested heuristic is not optimal (as expected).
Maybe this helps:
A Template for the Nearest Neighbor Problem (DDJ 2001)
Sorting three times on the three axis is a waste. The third sort will completely undo what the other sorts have done.

Visualizing Level surfaces

I'm trying to develop a level surface visualizer using this method (don't know if this is the standard method or if there's something better):
1. Take any function f(x,y,z)=k (where k is constant), and bounds for x, y, and z. Also take in two grid parameters stepX and stepZ.
2. to reduce to a level curve problem, iterate from zMin to zMax with stepZ intervals. So f(x,y,z)=k => f(x,y,fixedZ)=k
3. Do the same procedure with stepX, reducing the problem to f(fixedX, y, fixedZ)=k
4. Solve f(fixedX, y, fixedZ) - k = 0 for all values of y which will satisfy that equation (using some kind of a root finding algorithm).
5. For all points generated, plot those as a level curve (the inner loop generates level curves at a given z, then for different z values there are just stacks of level curves)
6 (optional). Generate a mesh from these level curves/points which belong to the level set.
The problem I'm running into is with step 4. I have no way of knowing before-hand how many possible values of y will satisfy that equation (more specifically, how many unique and real values of y).
Also, I'm trying to keep the program as general as possible so I'm trying to not limit the original function f(x,y,z)=k to any constraints such as smoothness or polynomial other than k must be constant as required for a level surface.
Is there an algorithm (without using a CAS/symbolic solving) which can identify the root(s) of a function even if it has multiple roots? I know that bisection methods have a hard time with this because of the possibility of no sign changes over the region, but how does the secant/newtons method fare? What set of functions can the secant/newtons method be used on, and can it detect and find all unique real roots within two given bounds? Or is there a better method for generating/visualizing level surfaces?
I think I've found the solution to my problem. I did a little bit more research and discovered that level surface is synonymous with isosurface. So in theory something like a marching cubes method should work.
In case you're in need of an example of the Marching Cubes algorithm, check out
http://stemkoski.github.com/Three.js/Marching-Cubes.html
(uses JavaScript/Three.js for the graphics).
For more details on the theory, you should check out the article at
http://paulbourke.net/geometry/polygonise/
A simple way,
2D: plot (x,y) with color = floor(q*f(x,y)) in grayscale where q is some arbitrary factor.
3D: plot (x,y, floor(q*f(x,y))
Effectively heights of the function that are equivalent will be representing on the same level surface.
If you to get the level curves you can use the 2D method and edge detection/region categorization to get the points (x,y) on the same level.

What's a good weighting function?

I'm trying to perform some calculations on a non-directed, cyclic, weighted graph, and I'm looking for a good function to calculate an aggregate weight.
Each edge has a distance value in the range [1,∞). The algorithm should give greater importance to lower distances (it should be monotonically decreasing), and it should assign the value 0 for the distance ∞.
My first instinct was simply 1/d, which meets both of those requirements. (Well, technically 1/∞ is undefined, but programmers tend to let that one slide more easily than do mathematicians.) The problem with 1/d is that the function cares a lot more about the difference between 1/1 and 1/2 than the difference between 1/34 and 1/35. I'd like to even that out a bit more. I could use √(1/d) or ∛(1/d) or even ∜(1/d), but I feel like I'm missing out on a whole class of possibilities. Any suggestions?
(I thought of ln(1/d), but that goes to -∞ as d goes to ∞, and I can't think of a good way to push that up to 0.)
Later:
I forgot a requirement: w(1) must be 1. (This doesn't invalidate the existing answers; a multiplicative constant is fine.)
perhaps:
exp(-d)
edit: something along the lines of
exp(k(1-d)), k real
will fit your extra requirement (I'm sure you knew that but what the hey).
How about 1/ln (d + k)?
Some of the above answers are versions of a Gaussian distribution which I agree is a good choice. The Gaussian or normal distribution can be found often in nature. It is a B-Spline basis function of order-infinity.
One drawback to using it as a blending function is its infinite support requires more calculations than a finite blending function. A blend is found as a summation of product series. In practice the summation may stop when the next term is less than a tolerance.
If possible form a static table to hold discrete Gaussian function values since calculating the values is computationally expensive. Interpolate table values if needed.
How about this?
w(d) = (1 + k)/(d + k) for some large k
d = 2 + k would be the place where w(d) = 1/2
It seems you are in effect looking for a linear decrease, something along the lines of infinity - d. Obviously this solution is garbage, but since you are probably not using a arbitrary precision data type for the distance, you could use yourDatatype.MaxValue - d to get a linear decreasing function for this.
In fact you might consider using (yourDatatype.MaxValue - d) + 1 you are using doubles, because you could then assign the weight of 0 if your distance is "infinity" (since doubles actually have a value for that.)
Of course you still have to consider implementation details like w(d) = double.infinity or w(d) = integer.MaxValue, but these should be easy to spot if you know the actual data types you are using ;)

Resources