Using ELKI MiniGUI to create spatial KNN for spatial outlier detection of attribute values.‏ - algorithm

I'm having difficulties using the ELKI MiniGUI to run spatial outlier detection algorithms. Many of the algorithms require a list of KNN for each object in the database. It appears that a KNN label list first needs to be created from the spatial coordinate database only, not including the attributes. Then, I suppose the spatial outlier detection algorithms are run on the attribute database along with the external file of the spatial KNN.
My Java experience is limited, so I would like to use ELKI in the command line and use the MiniGUI to assemble code for each task. However, with the MiniGUI I have only been able to create, or materialize, external files for 1) the triangular distance matrix and 2) the KNN Distance Order, which seems to include the object itself as one of the KNN. It seems that I really need an external file, or cached data, of a list of each object and their spatial neighbors. Maybe a KNN Query, KNN Join, precomputed distances or preprocessed database filter would be helpful, but I really don't know.
What steps are needed to create and use files, or cached data, that are required to supply the KNN spatial relation for the spatial outlier detection attribute relation of each object to its neighbors? I am unclear of how to do this with the MiniGUI, especially since it looks like the spatial neighborhood relation needs to be created first, before using it with the spatial outlier detection algorithm and the attribute database.
Any advice is greatly appreciated.
Thanks!

Thank you for contributing a how-to to the ELKI wiki!
How to perform geo-spatial outlier detection with an external neighborhood specification
it is a nice step-by-step introduction to using ELKI, and I hope others will find it useful.
Posting as "answer" here, so that others have it easy to find.

Related

Algorithm for 2D nearest-neighbour queries with dynamic points

I am trying to find a fast algorithm for finding the (approximate, if need be) nearest neighbours of a given point in a two-dimensional space where points are frequently removed from the dataset and new points are added.
(Relatedly, there are two variants of this problem that interest me: one in which points can be thought of as being added and removed randomly and another in which all the points are in constant motion.)
Some thoughts:
kd-trees offer good performance, but are only suitable for static point sets
R*-trees seem to offer good performance for a variety of dimensions, but the generality of their design (arbitrary dimensions, general content geometries) suggests the possibility that a more specific algorithm might offer performance advantages
Algorithms with existing implementations are preferable (though this is not necessary)
What's a good choice here?
I agree with (almost) everything that #gsamaras said, just to add a few things:
In my experience (using large dataset with >= 500,000 points), kNN-performance of KD-Trees is worse than pretty much any other spatial index by a factor of 10 to 100. I tested them (2 KD-trees and various other indexes) on a large OpenStreetMap dataset. In the following diagram, the KD-Trees are called KDL and KDS, the 2D dataset is called OSM-P (left diagram): The diagram is taken from this document, see bullet points below for more information.
This research describes an indexing method for moving objects, in case you keep (re-)inserting the same points in slightly different positions.
Quadtrees are not too bad either, they can be very fast in 2D, with excellent kNN performance for datasets < 1,000,000 entries.
If you are looking for Java implementations, have a look at my index library. In has implementations of quadtrees, R-star-tree, ph-tree, and others, all with a common API that also supports kNN. The library was written for the TinSpin, which is a framework for testing multidimensional indexes. Some results can be found enter link description here (it doesn't really describe the test data, but 'OSM-P' results are based on OpenStreetMap data with up to 50,000,000 2D points.
Depending on your scenario, you may also want to consider PH-Trees. They appear to be slower for kNN-queries than R-Trees in low dimensionality (though still faster than KD-Trees), but they are faster for removal and updates than RTrees. If you have a lot of removal/insertion, this may be a better choice (see the TinSpin results, Figures 2 and 46). C++ versions are available here and here.
Check the Bkd-Tree, which is:
an I/O-efficient dynamic data structure based on the kd-tree. [..] the Bkd-tree maintains its high space utilization and excellent
query and update performance regardless of the number of updates performed on it.
However this data structure is multi dimensional, and not specialized to lower dimensions (like the kd-tree).
Play with it in bkdtree.
Dynamic Quadtrees can also be a candidate, with O(logn) query time and O(Q(n)) insertion/deletion time, where Q(n) is the time
to perform a query in the data structure used. Note that this data structure is specialized for 2D. For 3D however, we have octrees, and in a similar way the structure can be generalized for higher dimensions.
An implentation is QuadTree.
R*-tree is another choice, but I agree with you on the generality. A r-star-tree implementations exists too.
A Cover tree could be considered as well, but I am not sure if it fits your description. Read more here,and check the implementation on CoverTree.
Kd-tree should still be considered, since it's performance is remarkable on 2 dimensions, and its insertion complexity is logarithic in size.
nanoflann and CGAL are jsut two implementations of it, where the first requires no install and the second does, but may be more performant.
In any case, I would try more than one approach and benchmark (since all of them have implementations and these data structures are usually affected by the nature of your data).

Which classification algorithm can be used for document categorization?

Hey, Here is my problem,
Given a set of documents I need to assign each document to a predefined category.
I was going to use the n-gram approach to represent the text-content of each document and then train an SVM classifier on the training data that I have.
Correct me if I miss understood something please.
The problem now is that the categories should be dynamic. Meaning, my classifier should handle new training data with new category.
So for example, if I trained a classifier to classify a given document as category A, category B or category C, and then I was given new training data with category D. I should be able to incrementally train my classifier by providing it with the new training data for "category D".
To summarize, I do NOT want to combine the old training data (with 3 categories) and the new training data (with the new/unseen category) and train my classifier again. I want to train my classifier on the fly
Is this possible to implement with SVM ? if not, could u recommend me several classification algorithms ? or any book/paper that can help me.
Thanks in Advance.
Naive-Bayes is relatively fast incremental calssification algorithm.
KNN is also incremental by nature, and even simpler to implement and understand.
Both algorithms are implemented in the open source project Weka as NaiveBayes and IBk for KNN.
However, from personal experience - they are both vulnerable to large number of non-informative features (which is usually the case with text classification), and thus some kind of feature selection is usually used to squeeze better performance from these algorithms, which could be problematic to implement as incremental.
This blog post by Edwin Chen describes infinite mixture models to do clustering. I think this method supports automatically determining the number of clusters, but I am still trying to wrap my head all the way around it.
The class of algorithms that matches your criteria are called "Incremental Algorithms". There are incremental versions of almost any methods. The easiest to implement is naive bayes.

Appropriate clustering method for 1 or 2 dimensional data

I have a set of data I have generated that consists of extracted mass (well, m/z but that not so important) values and a time. I extract the data from the file, however, it is possible to get repeat measurements and this results in a large amount of redundancy within the dataset. I am looking for a method to cluster these in order to group those that are related based on either similarity in mass alone, or similarity in mass and time.
An example of data that should be group together is:
m/z time
337.65 1524.6
337.65 1524.6
337.65 1604.3
However, I have no way to determine how many clusters I will have. Does anyone know of an efficient way to accomplish this, possibly using a simple distance metric? I am not familiar with clustering algorithms sadly.
http://en.wikipedia.org/wiki/Cluster_analysis
http://en.wikipedia.org/wiki/DBSCAN
Read the section about hierarchical clustering and also look into DBSCAN if you really don't want to specify how many clusters in advance. You will need to define a distance metric and in that step is where you would determine which of the features or combination of features you will be clustering on.
Why don't you just set a threshold?
If successive values (by time) do not differ by at least +-0.1 (by m/s) they a grouped together. Alternatively, use a relative threshold: differ by less than +- .1%. Set these thresholds according to your domain knowledge.
That sounds like the straightforward way of preprocessing this data to me.
Using a "clustering" algorithm here seems total overkill to me. Clustering algorithms will try to discover much more complex structures than what you are trying to find here. The result will likely be surprising and hard to control. The straightforward change-threshold approach (which I would not call clustering!) is very simple to explain, understand and control.
For the simple one dimension K-means clustering (http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm) is appropriate and can be used directly. The only issue is selecting appropriate K. The best way to select a good K is to either plot K vs residual variance and select the K that "dramatically" reduces variance. Another strategy is to use some information criteria (eg. Bayesian Information Criteria).
You can extend K-Means to multi-dimensional data easily. But you should be beware of scaling the individual dimensions. Eg. Among items (1KG, 1KM) (2KG, 2KM) the nearest point to (1.7KG, 1.4KM) is (2KG, 2KM) with these scales. But once you start expression second item in meters, probably the alternative is true.

What's a good algorithm for nearest neighbour problem in two dimensions?

I would like to build an app that's going to give you the closest restaurant depending on your location. We'll have a database with all the POI corresponding to the restaurant and we'll get your location with the GPS of your phone...
What algorithm would be appropriate ? Where can I find good doc about it ?
Thanks
Here's an informative presentation: http://dimacs.rutgers.edu/Workshops/MiningTutorial/pindyk-slides.ppt
I would either use a Quadtree or a Kd-tree.
See some benchmarks here: http://www.flegg.net/brett/pubs/spatial/index.html. It really all depends on your data size and range.
The main problem is how do you store and search the data. If you are using a SQL database that doesn't support spatial indexes (let's say SQLite on Android), consider converting the spatial data to a linear Z-order curve. The algorithm is simple, I know about (well, wrote) this implementation.

Good data structure for euclidean 3d data queries?

What's a good way to store point cloud data so that it's optimal for an application that will do one of these two queries?
Nearest (i.e. lowest euclidean distance) data point to (x,y,z)
Get all the points inside a sphere with radius R around a point (x,y,z)
The structure will only be filled once, but read many times. A lowish memory footprint would be nice as I may be dealing with datasets of > 7 million points, but speed should be of primary concern. A library would be nice, but I wouldn't mind implementing it myself if it's something doable with limited expertise in the area.
Thanks in advance!
A Kd-Tree you get O(log(n)) nearest neighbor, and usually range queries will be fast.
There are a bunch of libraries referenced there. I have not used any of them.
You might also look at CGAL. I have used CGAL for other things, it is tolerably fast, extremely comprehensive, but the documentation will drive you to drink.
A huge portion of the decision in data structures will depend on the spatial organization of the data. For example, highly clustered data tends to have different performance charateristics in kd-trees than evenly distributed data.
KD-Trees are very good for both of these queries.
Octree can be a good option in many cases, as well, and is potentially easier to implement.
There are many libraries out there that do this, using various algorithms. Searching for k-nearest neighbor searching will reveal many useful libraries. I've had pretty good luck with ANN in the past, for example.

Resources