Algorithm: Understand when two lines chart are similar - algorithm

I am trying to develop a script capable of understanding when two lines chart are similar (they have similar direction or similar values).
For instance suppose I have two arrays:
array1 = [0,1,2,3,4,5,6,7,8,9,10];
array2 = [2,3,4,5,6,7,8,8,10,11,12];
As you can see they both growth and their values are quite similar.
At the moment I have found a perfectly working solution using a DTW algorithm.
The problem is that the DTW has a "training part" very fast (I just have to store a lot of lines chart) but it has a heavy prediction part because it compares the last line chart with all the others in memory.
So my question is: is it possible to move the computational complexity time during the training part in order to have a faster prediction?
For example creating a search tree or something like that?
And if it is possible accordingly to which specific value can I cluster the information?
Do you have any advice or useful links?

It is often possible by mapping the objects from your domain to a linear space. For example, you can see how that works for word embeddings in natural languages (word2vec tutorial, skip to "Visualizing the Learned Embeddings"). In this setting, similarity between objects is defined by a distance in the linear space, which is very fast to compute.
How complex should the mapping be in your case greatly depends on your data: how diverse are the charts and what kind of similarity you wish to capture.
In your example with two vectors, it's possible to compute a single value: the slope of the regression line. This will probably work is your charts are "somewhat linear" in nature. If you'd like to capture sinusoidal patterns as well, you can try to normalize the time series by subtracting the first value. Again, in your particular example it'll show a perfect fit.
Bottom line: complexity of the mapping is determined by the complexity of the data.

If they always have the same length, Pearson correlation should be much more appropriate, and much faster.
If you standardize your vectors, Pearson is Euclidean and you can use any multidimensional search tree for further acceleration.

Related

Select relevant features with PCA and K-MEANS

I am trying to understand PCA and K-Means algorithms in order to extract some relevant features from a set of features.
I don't know what branch of computer science study these topics, seems on internet there aren't good resources, just some paper that I don't understand well. An example of paper http://www.ifp.illinois.edu/~qitian/e_paper/icip02/icip02.pdf
I have csv files of pepole walks composed as follow:
TIME, X, Y, Z, these values are registred by the accelerometer
What I did
I transformed the dataset as a table in Python
I used tsfresh, a Python library, to extract from each walk a vector of features, these features are a lot, 2k+ features from each walk.
I have to use PFA, Principal Feature Analysis, to select the relevant features from the set of
vectors features
In order to do the last point, I have to reduce the dimension of the set of features walks with PCA (PCA will make the data different from the original one cause it modifies the data with the eigenvectors and eigenvalues of the covariance matrix of the original data). Here I have the first question:
How the input of PCA should look? The rows are the number of walks and the columns are the features or viceversa, so the rows are the number of the features and the columns are the number of walks of pepole?
After I reduced this data, I should use the K-Means algorithm on the reduced 'features' data. How the input should look in the K-Means? And what's the propouse on using this algorithm? All I know this algorithm it's used to 'cluster' some data, so in each cluster there are some 'points' based on some rule. What I did and think is:
If I use in PCA an input that looks like: the rows are the number of walks and the columns are the number of features, then for K-Means I should change the columns with rows cause in this way each point it's a feature (but this is not the original data with the features, it's just the reduced one, so I don't know). So then for each cluster I see with euclidean distance who has the lower distance from the centroid and select that feature. So how many clusters I should declare? If I declare that the clusters are the same as the number of features, I will extract always the same number of features. How can I say that a point in the reduced data correspond to this feature in the original set of features?
I know it's not correct what I am saying maybe, but I am trying to understand it, can some of you help me? If am I in the right way? Thanks!
For the PCA, make sure you separate the understanding of the method the algorithm uses (eigenvectors and such) and the result. The result, is a linear mapping, mapping the original space A, to A', where possibly, the dimension (number of features in your case) is less than the original space A.
So the first feature/element in space A', is a linear combination of features of A.
The row/column depends on implementation, but if you use scikit PCA the columns are the features.
You can feed the PCA output, the A' space, to K-means, and it will cluster them, based on a space of usually reduced dimension.
Each point will be part of a cluster, and the idea is that if you would calculate K-Means on A, you would probably end up with the same/similar clusters like with A'. Computationally A' is a lot cheaper. You now have a clustering, on A' and A. As we agree that points similar in A' are also similar in A.
The number of clusters is difficult to answer, if you don't know anything search the elbow method. But say you want to get a sense of different type of things you have, I argue go for 3~8 and not too much, compare 2-3 points closest to
each center, and you have something consumable. The number of features can be larger than the number of clusters. e.g. If we want to know the most dense area in some area (2D) you can easily have 50 clusters, to get a sense where 50 cities could be. Here we have number of cluster way higher than space dimension, and it makes sense.

How to simplify a spline?

I have an interesting algorithmic challenge in a project I am working on. I have a sorted list of coordinate points pointing at buildings on either side of a street that, sufficiently zoomed in, looks like this:
I would like to take this zigzag and smooth it out to linearize the underlying street.
I can think of a couple of solutions:
Calculate centroids using rolling averages of six or so points, and use those.
Spline regression.
Is there a better or best way to approach this problem? (I am using Python 3.5)
Based on your description and your comments, you are looking for a line simplification algorithms.
Ramer-Doublas algorithm (suggested in the comment) is most probably the most well-known algorithm in this family, but there are many more.
For example Visvalingam’s algorithm works by removing the point with the smallest change, which is calculated by the smallest square of the triangle. This makes it super easy to code and intuitively understandable. If it is hard to read research paper, you can read this easy article.
Other algorithms in this family are:
Opheim
Lang
Zhao
Read about them, understand what are they trying to minify and select the most suitable for you.
Dali's post correctly surmises that a line simplification algorithm is useful for this task. Before posting this question I actually examined a few such algorithms but wasn't quite comfortable with them because even though they resulted in the simplified geometry that I liked, they didn't directly address the issue I had of points being on either side of the feature and never in the middle.
Thus I used a two-step process:
I computed the centroids of the polyline by using a rolling average of the coordinates of the five surrounding points. This didn't help much with smoothing the function but it did mostly succeed in remapping them to the middle of the street.
I applied Visvalingam’s algorithm to the new polyline, with n=20 points specified (using this wonderful implementation).
The result wasn't quite perfect but it was good enough:
Thanks for the help everyone!

how to find the nearest neighbor of a sparse vector

I have about 500 vectors,each vector is a 1500-dimension vector,
and almost every vector is very sparse-- I mean only about 30-70 dimension of the vector is not 0。
Now, the problom is that here is a given vetor,also 1500 dimension,and I need to compare it to the 500 vectors to find which of the 500 is the nearest one.(In euclidean distance).
There is no doubt that brute-force method is a solution , but I need to calculate the distance for 500 times ,which takes a long time.
Yesterday I read an article "Object retrieval with large vocabularies and fast spatial matching", it says using inverted index will help,its says:
but after my test, it made almost no sense, imagine a 1500-vector in which 50 of the dimension are not zero, when it comes to another one, they may always have the same dimension that are not zero. In other words, this algorithm can only rule out a little vectors, I still need to compare with many vectors left.
Thank you for your nice that you have read to here, my question is that:
1.will this algorithm make sense?
2.is there any other way to do what I want to do? such as flann or Kd-TREE?
but I want the exact accurate nearest neighbor, a approxiate one is not enough
This kind of index is called inverted lists, and is commonly used for text.
For example, Apache Lucene uses this kind of indexing for text similarity search.
Essentially, you use a columnar layout, and you only store the non-zero values. For on-disk efficiency, various compression techniques can be employed.
You can then compute many similarities using set operations on these lists.
k-d-trees cannot be used here. They will be extremely inefficient if you have many duplicate (zero) values.
I don't know your context but if you don't care of having a long preprocess step and you have to make this check often and fast, you can build a neighborhood graph and sorting neighbors by distances.
To efficiently build this graph you can perform a taxicab distance or a square distance to sort the points by distances (This will avoid heavy calculations).
Then if you want the nearest neighbor you just have to pick the first neighbor :p.

Comparison of the runtime of Nearest Neighbor queries on different data structures

Given n points in d-dimensional space, there are several data structures, such as Kd-Trees, Quadtrees, etc. to index the points. On these data structures it is possible to implement straight-forward algorithm for nearest neighbor queries around a given input point.
Is there a book, paper, survey, ... that compares the theoretical (mostly expected) runtime of the nearest neighbor query on different data structures?
The data I am looking at is composed of fairly small point clouds, so it can all be processed in main memory. For the sake of simplicity, I assume the data to be uniformly distributed. That is, im am not interested in real-world performance, but rather theoretical results
You let the dimension of the points undefined and you just give an approximation for the number of points. What does small means? It's relative what one person means by small.
What you search, of course, doesn't exist. Your question is pretty much this:
Question:
For a small (whatever does small means to you) dataset, of any dimension with data that follow a uniform distribution, what's the optimal data structure to use?
Answer:
There is no such data structure.
Wouldn't it be too strange to have an answer on that? A false analogy would be to put as a synonym of this question, the "Which is the optimal programming language?" question that most of the first year undergrads have. Your question is not that naive, but it's walking on the same track.
Why there is no such data structure?
Because, the dimension of the dataset is variable. This means, that you might have a dataset in 2 dimensions, but it could also mean that you have a dataset in 1000 dimensions, or even better a dataset in 1000 dimensions, with an intrinsic dimension that is much less than 1000. Think about it, could one propose a data structure that would behave equally good for the three datasets I mentioned it? I doubt it.
In fact, there are some data structures that behave really nicely in low dimensions (quadtrees and KD-trees for example), while others are doing much better in higher dimensions (RKD-tree forest for instance).
Moreover, the algorithms and the expectations used for Nearest Neighbour search are heavily dependent on the dimension of the dataset (as well as the size of the dataset and the nature of the queries (for example a query that is too far from the dataset or equidistant from the points of the dataset will probably result in a slow search performance)).
In lower dimensions, one would perform a k-Nearest Neighbour(k-NN) search. In higher dimensions, it would be more wise to perform k-Approximate NN search. In this case, we follow the following trade-off:
Speed VS accuracy
What happens is that we achieve faster execution of the program, by sacrificing the correctness of our result. In other words, our search routine will be relatively fast, but it may (the possibility of this depends on many parameters, such as the nature of your problem and the library you are using) not return the true NN, but an approximation of the exact NN. For example it might not find the exact NN, but the third NN to the query point. You could also check the approximate-nn-searching wiki tag.
Why not always searching for the exact NN? Because of the curse of dimensionality, which results in the solutions provided in the lower dimensions to behave as good as the brute force would do (search all the points in the dataset for every query).
You see my answer already got big, so I should stop here. Your question is too broad, but interesting, I must admit. :)
In conclusion, which would be the optimal data structure (and algorithm) to use depends on your problem. The size of the dataset you are handling, the dimension and the intrinsic dimension of the points play a key role. The number and the nature of the queries also play an important role.
For nearest neighbor searches of potentially non-uniform point data I think a kd-tree will give you the best performance in general. As far as broad overviews and theoretical cost analysis I think Wikipedia is an OK place to start, but keep in mind it does not cover much real-world optimization:
http://en.wikipedia.org/wiki/Nearest_neighbor_search
http://en.wikipedia.org/wiki/Space_partitioning
Theoretical performance is one thing but real world performance is something else entirely. Real world performance depends as much on the details of the data structure implementation as it does on the type of data structure. For example, a pointer-less (compact array) implementation can be many times faster than a pointer-based implementation because of improved cache coherence and faster data allocation. Wider branching may be slower in theory but faster in practice if you leverage SIMD to test several branches simultaneously.
Also the exact nature of your point data can have a big impact on performance. Uniform distributions are less demanding and can be handled quickly with simpler data structures. Non-uniform distributions require more care. (Kd-trees work well for both uniform and non-uniform data.) Also, if your data is too large to process in-core then you will need to take an entirely different approach compared to smaller data sets.

How do I extend a support vector machine algorithm to a high dimensional data set?

I'm trying to implement an SVM algorithm, but I'm having a hard time understanding how d-dimensional data sets are actually handled. In my particular case, each 'point' has nearly 400 identifying features.
In the two dimensional space, it basically tries to find a line between the two groups that maximizes the margin from any point on either side. I can sort of imagine what such a 'line' would look like in a d-dimensional space, but I'm completely lost on how the classification would actually work.
There is a similar question here, but I'm not getting it. I sort of get how the separation would occur after you have the classifier, but I'm lost on how to actually get the classifier.
If you can imagine how the line of the 2D case would become a d-dimensional hyperplane for higher dimensions, then you are pretty much done. The actual classification occurs when you test a point over the hyperplane, which will give you a positive number if the point belongs to class 1 or negative if it belongs to class 2.
Notice that in the formula there is no restriction for the dimension of each point:
[Image courtesy of wikipedia]
And in case you are curious about what happens with the non-linear case when you use the kernel trick, I would like to share with you a video that illustrates very well the idea.
http://www.youtube.com/watch?v=3liCbRZPrZA

Resources