ELKI KNNDistancesSampler - knn

Does anybody know what does the KNNDistancesSampler calculate in ELKI? I can see the java code for the function : https://github.com/elki-project/elki/blob/master/elki/src/main/java/de/lmu/ifi/dbs/elki/algorithm/KNNDistancesSampler.java, but I am really bad at java - I can see it should get the distance of its neighbors by getKNNDistance()...
Is it returning average distance(Euclidean by default) of the k-nearest neighbors of each point? I know it should be used for epsilon estimation of dbscan etc.etc., but I'd also like to know what it is doing...
Thank you

References for this are given in the class documentation:
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases
with Noise
Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD '96)
Erich Schubert, Jörg Sander, Martin Ester, Hans-Peter Kriegel, Xiaowei Xu
DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN
ACM Trans. Database Systems (TODS)
The class is returning a sample, not just the average, of the kNN distances to help choosing the epsilon parameter using the "elbow" method on that plot. It does not automate choosing this - it only produces the plot.

Related

How does one arrive at "fair" priors for spatial and non-spatial effects

In a basic BYM model may be written as
sometimes with covariates but that doesn't matter much here. Where s are the spatially structured effects and u the unstructured effects over units.
In Congdon (2020) they refer to the fair prior on these as one in which
where is the average number of neighbors in the adjacency matrix.
It is defined similarly (in terms of precision, I think) in Bernardinelli et al. (1995).
However, for the gamma distribution, scaling appears to only impact the scale term
I haven't been able to find a worked example for this, and don't understand how the priors are arrived at, for example, in the well-known lip cancer data
I am hoping someone could help me understand how these are reached in this setting, even in the simple case of two gamma hyperpriors.
References
Congdon, P. D. (2019). Bayesian Hierarchical Models: With Applications Using R, Second Edition (2nd edition). Chapman and Hall/CRC.
Bernardinelli, L., Clayton, D. and Montomoli, C. (1995). Bayesian estimates of disease maps: How important are priors? Statistics in Medicine 14 2411–2431.

How to understand F-test based lmfit confidence intervals

The excellent lmfit package lets one to run nonlinear regression. It can report two different conf intervals - one based on the covarience matrix the other using a more sophisticated tecnique based on an F-test. Details can be found on the doc. I would like to understand he reasoning behind this technique in depth. Which topics should i read about? Note: i have sufficient stats knowledge
F stats and other associated methods for obtaining confidence intervals are far superior to a simple estimation of te co variance matrix for non-linear models (and others).
The primary reason for this is the lack of assumptions about the Gaussian nature of error when using these methods. For non-linear systems, confidence intervals can (they don't have to be) be asymmetric. This means that the parameter value can effect the error surface differently and therefore the one, two, or three sigma limits have different magnitudes in either direction from the best fit.
The analytical ultracentrifugation community has excellent articles involving error analysis (Tom Laue, John J. Correia, Jim Cole, Peter Schuck are some good names for article searches). If you want a good general read about proper error analysis, check out this article by Michael Johnson:
http://www.researchgate.net/profile/Michael_Johnson53/publication/5881059_Nonlinear_least-squares_fitting_methods/links/0deec534d0d97a13a8000000.pdf
Cheers!

How to cluster large datasets

I have a very large dataset (500 Million) of documents and want to cluster all documents according to their content.
What would be the best way to approach this?
I tried using k-means but it does not seem suitable because it needs all documents at once in order to do the calculations.
Are there any cluster algorithms suitable for larger datasets?
For reference: I am using Elasticsearch to store my data.
According to Prof. J. Han, who is currently teaching the Cluster Analysis in Data Mining class at Coursera, the most common methods for clustering text data are:
Combination of k-means and agglomerative clustering (bottom-up)
topic modeling
co-clustering.
But I can't tell how to apply these on your dataset. It's big - good luck.
For k-means clustering, I recommend to read the dissertation of Ingo Feinerer (2008). This guy is the developer of the tm package (used in R) for text mining via Document-Term-matrices.
The thesis contains case-studies (Ch. 8.1.4 and 9) on applying k-Means and then the Support Vector Machine Classifier on some documents (mailing lists and law texts). The case studies are written in tutorial style, but the dataset are not available.
The process contains lots of intermediate steps of manual inspection.
There are k-means variants thst process documents one by one,
MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1.
and k-means variants that repeatedly draw a random sample.
D. Sculley (2010). Web Scale K-Means clustering. Proceedings of the 19th international conference on World Wide Web
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633.
But in the end, it's still useless old k-means. It's a good quantization approach, but not very robust to noise, not capable of handling clusters of different size, non-convex shape, hierarchy (e.g. sports, inside baseball) etc. it's a signal processing technique, not a data organization technique.
So the practical impact of all these is 0. Yes, they can run k-means on insane data - but if you can't make sense of the result, why would you do so?

Algorithm, find local/global minima, function of 2 variables

Let us have a function of 2 variables:
z=f(x,y) = ....
Can you advise me any suitable method (simply algorithmizable, fast convergence) to calculate the the local extreme on some intervals or the global extreme?
Thanks for your help.
Gradient Descent is a wise choice for finding local minima for functions, assuming you can calculate the gradient.
Depending on the specific domain - sometimes there are other solutions as well.
For example, for Linear-Least-Squares (which is used for regression in the field of machine learning) , you can find local (and global, the function in this case is convex) - you can use normal equations
EDIT: As suggested in comments: If you don't have any information on the function, you might be able to use a hill climbing algorithm where you sample the candidates where to advance (you need to take a sample because there are infinite number of directions if the function is of real numbers) - and chose the most promising one.
You can also try to extract the derivatives numerically using numerical differentiation, and use gradient descent.
You might also look into simulated annealing if you like the idea of algorithms driven by ideas from thermodynamics and metallurgy.
Or perhaps you'd rather look at genetic algorithms, because you like the current explosion of knowledge in biology.

Distributed hierarchical clustering

Are there any algorithms that can help with hierarchical clustering?
Google's map-reduce has only an example of k-clustering. In case of hierarchical clustering, I'm not sure how it's possible to divide the work between nodes.
Other resource that I found is: http://issues.apache.org/jira/browse/MAHOUT-19
But it's not apparent, which algorithms are used.
First, you have to decide if you're going to build your hierarchy bottom-up or top-down.
Bottom-up is called Hierarchical agglomerative clustering. Here's a simple, well-documented algorithm: http://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html.
Distributing a bottom-up algorithm is tricky because each distributed process needs the entire dataset to make choices about appropriate clusters. It also needs a list of clusters at its current level so it doesn't add a data point to more than one cluster at the same level.
Top-down hierarchy construction is called Divisive clustering. K-means is one option to decide how to split your hierarchy's nodes. This paper looks at K-means and Principal Direction Divisive Partitioning (PDDP) for node splitting: http://scgroup.hpclab.ceid.upatras.gr/faculty/stratis/Papers/tm07book.pdf. In the end, you just need to split each parent node into relatively well-balanced child nodes.
A top-down approach is easier to distribute. After your first node split, each node created can be shipped to a distributed process to be split again and so on... Each distributed process needs only to be aware of the subset of the dataset it is splitting. Only the parent process is aware of the full dataset.
In addition, each split could be performed in parallel. Two examples for k-means:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.1882&rep=rep1&type=pdf
http://www.ece.northwestern.edu/~wkliao/Kmeans/index.html.
Clark Olson reviews several distributed algorithms for hierarchical clustering:
C. F. Olson. "Parallel Algorithms for
Hierarchical Clustering." Parallel
Computing, 21:1313-1325, 1995, doi:10.1016/0167-8191(95)00017-I.
Parunak et al. describe an algorithm inspired by how ants sort their nests:
H. Van Dyke Parunak, Richard Rohwer,
Theodore C. Belding,and Sven
Brueckner: "Dynamic Decentralized
Any-Time Hierarchical Clustering." In
Proc. 4th International Workshop on Engineering Self-Organising Systems
(ESOA), 2006, doi:10.1007/978-3-540-69868-5
Check out this very readable if a bit dated review by Olson (1995). Most papers since then require a fee to access. :-)
If you use R, I recommend trying pvclust which achieves parallelism using snow, another R module.
You can see also Finding and evaluating community structure in networks by Newman and Girvan, where they propose an aproach for evaluating communities in networks(and set of algoritms based on this approach) and measure of network division into communities quality (graph modularity).
You could look at some of the work being done with Self-Organizing maps (Kohonen's neural network method)... the guys at Vienna University of Technology have done some work on distributed calculation of their growing hierarchical map algorithm.
This is a little on the edge of your clustering question, so it may not help, but I can't think of anything closer ;)

Resources