Clustering in High Dimensions + some basic stuff [closed]

Clustering in High Dimensions + some basic stuff [closed] - algorithm

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've been studying Support Vector Machines(SVM) for a while, and recently started reading articles on Clustering. When using SVM, we did not need to worry about the dimension size of the data, however, I learned that in clustering, due to the "Curse of Dimensionality", the dimension size is of big issue. Furthermore, the sparsity and the data size greatly affects the clustering algorithms you choose as well. So I kind of understand that there is no "best algorithm" for clustering, and it all depends on the nature of the data.
Having said that, I want to ask some really basic questions on Clustering.
When people say "High Dimension", what do they mean specifically?? Is 100d a high dimension?? Or does this depend on the type of data you have?
I've seen answers on this website that said something like, "using k-means on data with 100's dimensions is very usual", and if this is true, does this hold true for other clustering algorithms that uses the same distance metric as k-means??
In pp.649 of the paper, "Survey of Clustering Algorithms"(http://goo.gl/WQyuxo), by Rui Xu et al., the table shows that CURE has "the capability of tackling high dimensional data", and was wondering if anybody has any ideas on how high of dimension they are talking about.
If I wanted to perform clustering on high dimensional datas with adequate size, which was randomly sampled from the initial big data, what kind of algorithms would be appropriate to use?? I understand that density based algorithms such as DBSCAN does not perform well under random sampling.
Can anybody tell me how well/bad CURE performs on high dimensional datas?? Intuitively, I guess CURE does not perform well considering the "Cure of Dimensionality", however, it would be great if there were some detailed results.
Are there any websites/papers/textbooks on explaining the pros and cons of clustering algorithms?? I've seen some papers on the pros/cons of basic algorithms, i.e, k-means, hierarchal clustering, DBSCAN, etc., but wanted to know more on other algorithms such as CURE, CLIQUE, CHAMELEON, etc.
Sorry for asking so much questions all at once!!
It will be awesome if anybody could answer any one of my questions. Also, if I had ill-stated a question or asked a completely pointless question, don't hesitate to tell me.
And if anybody knows a great textbook/survey paper on Clustering that elaborates on these subjects, please tell me!!
Thank you in advance.

You may be interested in this survey:
Kriegel, H. P., Kröger, P., & Zimek, A. (2009). Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1), 1.
one of the authors wrote DBSCAN, so it will likely help you shed some light in your DBSCAN questions.
100 dimensional data can be high-dimensional data. If it isn't sparse. For the NLP people, 100d is laughably little, but their data is special. It is derived essentially from a binary nature (word present, or not present), so it has actually less than 1 bit of information in each dimension... if you have dense 100 dimensional data, you usually are in trouble.
There are some nice figures in a related / follow up survey by the same authors:
Zimek, A., Schubert, E., & Kriegel, H. P. (2012). A survey on unsupervised outlier detection in high‐dimensional numerical data. Statistical Analysis and Data Mining, 5(5), 363-387.
They have analyzed the behavior of distance functions nicely for such data. The essence is:
high-dimensional data can be hard - or easy; it all depends on the signal to noise ratio. If you only have dimensions carrying signal, additional dimensions can make your problems actually easier. If the additional dimensions are distracting, things can break down.
Which may also explain why the "kernel trick" with SVMs works - it does not really add information content; the increased dimensionality is only virtual, not intrinsic. You have a larger search and solution space; but your data is still on a lower-dimensional manifold within this space.
k-means results in high-dimensional data tend to get meaningless. In many cases, they still work "good enough"; because often quality does not really matter a lot, and any convex partitioning will do (e.g. bag-of-words approaches for image similarity don't seem to improve substantially with "better" k-means clusterings)
CURE, which also seems to use sum-of-squares (like k-means) should suffer from the same problems. For large data, all sum-of-squares values become increasingly similar (i.e. any partitioning is as good as any other).
Yes, there are plenty of textbooks, surveys, and studies that tried to compare clustering algorithms. But in the end, there are too many factors involved: what does your data look like, how did you preprocess it, do you have a well-chosen and appropriate distance measure, how good is your implementation, do you have index acceleration to speed up some algorithms, etc. - there is no rule of thumb; you will have to try out things.

Related

How to cluster large datasets

I have a very large dataset (500 Million) of documents and want to cluster all documents according to their content.
What would be the best way to approach this?
I tried using k-means but it does not seem suitable because it needs all documents at once in order to do the calculations.
Are there any cluster algorithms suitable for larger datasets?
For reference: I am using Elasticsearch to store my data.

According to Prof. J. Han, who is currently teaching the Cluster Analysis in Data Mining class at Coursera, the most common methods for clustering text data are:
Combination of k-means and agglomerative clustering (bottom-up)
topic modeling
co-clustering.
But I can't tell how to apply these on your dataset. It's big - good luck.
For k-means clustering, I recommend to read the dissertation of Ingo Feinerer (2008). This guy is the developer of the tm package (used in R) for text mining via Document-Term-matrices.
The thesis contains case-studies (Ch. 8.1.4 and 9) on applying k-Means and then the Support Vector Machine Classifier on some documents (mailing lists and law texts). The case studies are written in tutorial style, but the dataset are not available.
The process contains lots of intermediate steps of manual inspection.

There are k-means variants thst process documents one by one,
MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1.
and k-means variants that repeatedly draw a random sample.
D. Sculley (2010). Web Scale K-Means clustering. Proceedings of the 19th international conference on World Wide Web
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633.
But in the end, it's still useless old k-means. It's a good quantization approach, but not very robust to noise, not capable of handling clusters of different size, non-convex shape, hierarchy (e.g. sports, inside baseball) etc. it's a signal processing technique, not a data organization technique.
So the practical impact of all these is 0. Yes, they can run k-means on insane data - but if you can't make sense of the result, why would you do so?

Finding an optimum learning rule for an ANN

How do you find an optimum learning rule for a given problem, say a multiple category classification?
I was thinking of using Genetic Algorithms, but I know there are issues surrounding performance. I am looking for real world examples where you have not used the textbook learning rules, and how you found those learning rules.

Nice question BTW.
classification algorithms can be classified using many Characteristics like:
What does the algorithm strongly prefer (or what type of data that is most suitable for this algorithm).
training overhead. (does it take a lot of time to be trained)
When is it effective. ( large data - medium data - small amount of data ).
the complexity of analyses it can deliver.
Therefore, for your problem classifying multiple categories I will use Online Logistic Regression (FROM SGD) because it's perfect with small to medium data size (less than tens of millions of training examples) and it's really fast.
Another Example:
let's say that you have to classify a large amount of text data. then Naive Bayes is your baby. because it strongly prefers text analysis. even that SVM and SGD are faster, and as I experienced easier to train. but these rules "SVM and SGD" can be applied when the data size is considered as medium or small and not large.
In general any data mining person will ask him self the four afomentioned points when he wants to start any ML or Simple mining project.
After that you have to measure its AUC, or any relevant, to see what have you done. because you might use more than just one classifier in one project. or sometimes when you think that you have found your perfect classifier, the results appear to be not good using some measurement techniques. so you'll start to check your questions again to find where you went wrong.
Hope that I helped.

When you input a vector x to the net, the net will give an output depend on all the weights (vector w). There would be an error between the output and the true answer. The average error (e) is a function of the w, let's say e = F(w). Suppose you have one-layer-two-dimension network, then the image of F may look like this:
When we talk about training, we are actually talking about finding the w which makes the minimal e. In another word, we are searching the minimum of a function. To train is to search.
So, you question is how to choose the method to search. My suggestion would be: It depends on how the surface of F(w) looks like. The wavier it is, the more randomized method should be used, because the simple method based on gradient descending would have bigger chance to guide you trapped by a local minimum - so you lose the chance to find the global minimum. On the another side, if the suface of F(w) looks like a big pit, then forget the genetic algorithm. A simple back propagation or anything based on gradient descending would be very good in this case.
You may ask that how can I know how the surface look like? That's a skill of experience. Or you might want to randomly sample some w, and calculate F(w) to get an intuitive view of the surface.

Genetic Programming - algorithms for population initialization [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Can someone provide me some pointers on population initialization algorithms for genetic programming?
I already know about the Grown, Full, Ramped half-half (taken from "A Field Guide to Genetic Programming") and saw one new algorithm Two Fast Tree-Creation (haven't read the paper yet.).

Initial population plays an important role in heuristic algorithms such as GA as it help to decrease the time those algorithms need to achieve an acceptable result. Furthermore, it may influence the quality of the final answer given by evolutionary algorithms. (http://arxiv.org/pdf/1406.4518.pdf)
so as you know about Koza's different population methods, you must also remember that each algorithm that is used, isn't 100% random, nor can it be as an algorithm can be used. therefore you can predict what the next value will be.
another method you could potentially use is something called Uniform initialisation (refer to the free pdf : "a field guide to genetic programming"). the idea of this is that initially, when a population is created, within a few generations, due to crossing over and selection, the entire syntax tree could be lost within a few generations. Langdon (2000) came up with the idea of a ramped uniform distribution which effectively allows the user to specify the range of sizes a possible tree can have, and if a permutation of the tree is generated in the search space that doesn't fulfil the range of sizes, then the tree is automatically discarded, regardless of its fitness evaluation value. from here, the ramped uniform distribution will create an equal amount of trees depending on the range that you have used - all of which are random, unique permutations of the functions and terminal values you are using. (again, refer to "a field guide on genetic programming" for more detail)
this method can be quite useful in terms of sampling where the desired solutions are asymmetric rather than symmetric (which is what the ramped half and half deal with).
other recommeded readings for population initialisation:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50.962&rep=rep1&type=pdf

I think that will depend on the problem that you want to solve. For example I'm working on a TSP and my initial population is generated using a simple greedy technique. Sometimes you need to create only feasible solutions so you have to create a mechanism for doing that. Usually you will find papers about your problem and how to create initial solutions. Hope this helps.

Designing a twenty questions algorithm

I am interested in writing a twenty questions algorithm similar to what akinator and, to a lesser extent, 20q.net uses. The latter seems to focus more on objects, explicitly telling you not to think of persons or places. One could say that akinator is more general, allowing you to think of literally anything, including abstractions such as "my brother".
The problem with this is that I don't know what algorithm these sites use, but from what I read they seem to be using a probabilistic approach in which questions are given a certain fitness based on how many times they have lead to correct guesses. This SO question presents several techniques, but rather vaguely, and I would be interested in more details.
So, what could be an accurate and efficient algorithm for playing twenty questions?
I am interested in details regarding:
What question to ask next.
How to make the best guess at the end of the 20 questions.
How to insert a new object and a new question into the database.
How to query (1, 2) and update (3) the database efficiently.
I realize this may not be easy and I'm not asking for code or a 2000 words presentation. Just a few sentences about each operation and the underlying data structures should be enough to get me started.

Update, 10+ years later
I'm now hosting a (WIP, but functional) implementation here: https://twentyq.evobyte.org/ with the code here: https://github.com/evobyte-apps/open-20-questions. It's based on the same rough idea listed below.
Well, over three years later, I did it (although I didn't work full time on it). I hosted a crude implementation at http://twentyquestions.azurewebsites.net/ if anyone is interested (please don't teach it too much wrong stuff yet!).
It wasn't that hard, but I would say it's the non-intuitive kind of not hard that you don't immediately think of. My methods include some trivial fitness-based ranking, ideas from reinforcement learning and a round-robin method of scheduling new questions to be asked. All of this is implemented on a normalized relational database.
My basic ideas follow. If anyone is interested, I will share code as well, just contact me. I plan on making it open source eventually, but once I have done a bit more testing and reworking. So, my ideas:
an Entities table that holds the characters and objects played;
a Questions table that holds the questions, which are also submitted by users;
an EntityQuestions table holds entity-question relations. This holds the number of times each answer was given for each question in relation to each entity (well, those for which the question was asked for anyway). It also has a Fitness field, used for ranking questions from "more general" down to "more specific";
a GameEntities table is used for ranking the entities according to the answers given so far for each on-going game. An answer of A to a question Q pushes up all the entities for which the majority answer to question Q is A;
The first question asked is picked from those with the highest sum of fitnesses across the EntityQuestions table;
Each next question is picked from those with the highest fitness associated with the currently top entries in the GameEntities table. Questions for which the expected answer is Yes are favored even before the fitness, because these have more chances of consolidating the current top ranked entity;
If the system is quite sure of the answer even before all 20 questions have been asked, it will start asking questions not associated with its answer, so as to learn more about that entity. This is done in a round-robin fashion from the global questions pool right now. Discussion: is round-robin fine, or should it be fully random?
Premature answers are also given under certain conditions and probabilities;
Guesses are given based on the rankings in GameEntities. This allows the system to account for lies as well, because it never eliminates any possibility, just decreases its likeliness of being the answer;
After each game, the fitness and answers statistics are updated accordingly: fitness values for entity-question associations decrease if the game was lost, and increase otherwise.
I can provide more details if anyone is interested. I am also open to collaborating on improving the algorithms and implementation.

This is a very interesting question. Unfortunately I don't have a full answer, let me just write down the ideas I could come up with in 10 minutes:
If you are able to halve the set of available answers on each question, you can distinguish between 2^20 ~ 1 million "objects". Your set is probably going to be larger, so it's right to assume that sometimes you have to make a guess.
You want to maximize utility. Some objects are chosen more often than others. If you want to make good guesses you have to take into consideration the weight of each object (= the probability of that object being picked) when creating the tree.
If you trust a little bit of your users you can gain knowledge based on their answers. This also means that you cannot use a static tree to ask questions because then you'll get the answers for the same questions.. and you'll learn nothing new if you encounter with the same object.
If a simple question is not able to divide the set to two halves, you could combine them to get better results: eg: "is the object green or blue?". "green or has a round shape?"

I am trying try to write a python implementation using a naïve Bayesian network for learning and minimizing the expected entropy after the question has been answered as criterium for selecting a question (with an epsilon chance of selecting a random question in order to learn more about that question), following the ideas in http://lists.canonical.org/pipermail/kragen-tol/2010-March/000912.html. I have put what I got so far on github.
Preferably choose questions with low remaining entropy expectation. (For putting together something quickly, I stole from ε-greedy multi-armed bandit learning and use: With probability 1–ε: Ask the question with the lowest remaining entropy expectation. With probability ε: Ask any random question. However, this approach seems far from optimal.)
Since my approach is a Bayesian network, I obtain the probabilities of the objects and can ask for the most probable object.
A new object is added as new column to the probabilities matrix, with low a priori probability and the answers to the questions as given if given or as guessed by the Bayes network if not given. (I expect that this second part would work much better if I would add Bayes network structure learning instead of just using naive Bayes.)
Similarly, a new question is a new row in the matrix. If it comes from user input, probably only very few answer probabilities are known, the rest needs to be guessed. (In general, if you can get objects by asking for properties, you can obtain properties by asking if given objects have them or not, and the transformation between these is essentially Bayes' theorem and breaks down to transposition in the easiest case. The guessing quality should improve again once the network has an appropriate structure.)
(This is a problem, since I calculate lots of probabilities. My goal is to do it using database-oriented sparse tensor calculations optimized for working with weighted directed acyclic graphs.)

It would be interesting to see how good a decision tree based algorithm would serve you. The trick here is purely in the learning/sorting of the tree. I'd like to note that this is stuff I remember from AI class and student work in the AI working group and should be taken with a semi-large grain (or nugget) of salt.
To answer the questions:
You just walk the tree :)
This is a big downside of decision trees. You'd only have one guess that can be attached to the end nodes of the tree at depth 20 (or earlier, if the tree is still sparse).
There are whole books dedicated to this topic. As far as I remember from AI class you try minimize entropy at all times, so you want to ask questions that ideally divide the set of remaining objects into two sets of equal size. I'm afraid you'd have to look this up in AI books.
Decision trees are highly efficient during the query phase, as you literally walk the tree and follow the 'yes' or 'no' branch at each node. Update efficiency depends on the learning algorithm applied. You might be able to do this offline as in a nightly batched update or something like that.

What are techniques and practices on measuring data quality?

If I have a large set of data that describes physical 'things', how could I go about measuring how well that data fits the 'things' that it is supposed to represent?
An example would be if I have a crate holding 12 widgets, and I know each widget weighs 1 lb, there should be some data quality 'check' making sure the case weighs 13 lbs maybe.
Another example would be that if I have a lamp and an image representing that lamp, it should look like a lamp. Perhaps the image dimensions should have the same ratio of the lamp dimensions.
With the exception of images, my data is 99% text (which includes height, width, color...).
I've studied AI in school, but have done very little outside of that.
Are standard AI techniques the way to go? If so, how do I map a problem to an algorithm?
Are some languages easier at this than others? Do they have better libraries?
thanks.

Your question is somewhat open-ended, but it sounds like you want is what is known as a "classifier" in the field of machine learning.
In general, a classifier takes a piece of input and "classifies" it, ie: determines a category for the object. Many classifiers provide a probability with this determination, and some may even return multiple categories with probabilities on each.
Some examples of classifiers are bayes nets, neural nets, decision lists, and decision trees. Bayes nets are often used for spam classification. Emails are classified as either "spam" or "not spam" with a probability.
For you question you'd want to classify your objects as "high quality" or "not high quality".
The first thing you'll need is a bunch of training data. That is, a set of objects where you already know the correct classification. One way to obtain this could be to get a bunch of objects and classify them by hand. If there are too many objects for one person to classify you could feed them to Mechanical Turk.
Once you have your training data you'd then build your classifier. You'll need to figure out what attributes are important to your classification. You'll probably need to do some experimentation to see what works well. You then have your classifier learn from your training data.
One approach that's often used for testing is to split your training data into two sets. Train your classifier using one of the subsets, and then see how well it classifies the other (usually smaller) subset.

AI is one path, natural intelligence is another.
Your challenge is a perfect match to Amazon's Mechanical Turk. Divvy your data space up into extremely small verifiable atoms and assign them as HITs on Mechanical Turk. Have some overlap to give yourself a sense of HIT answer consistency.
There was a shop with a boatload of component CAD drawings that needed to be grouped by similarity. They broke it up and set it loose on Mechanical Turk to very satisfying results. I could google for hours and not find that link again.
See here for a related forum post.

This is a tough answer. For example, what defines a lamp? I could google images a picture of some crazy looking lamps. Or even, look up the definition of a lamp (http://dictionary.reference.com/dic?q=lamp). Theres no physical requirements of what a lamp must look like. Thats the crux of the AI problem.
As for data, you could setup Unit testing on the project to ensure that 12 widget() weighs less than 13 lbs in the widetBox(). Regardless, you need to have the data at hand to be able to test things like that.
I hope i was able to answer your question somewhat. Its a bit vauge, and my answers are broad, but hopefully it'll at least send you in a good direction.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio