how does data clustering help in image or pattern recognition

I have been playing around with different data clustering algorithms working on finding clusters between random data points represented an nodes, I keep reading that data clustering is used for image recognition. I am failing to make the connection, how does clustering data help in recognizing an image or in facial recognition. can someone explain this?

It's no surprise that clustering is used for pattern recognition at large, and image recognition in particular: clustering is a reducing process, and images in this megapixel era need boiling down... It is also a process which produces categories and that is of course useful.
However there are many approaches to the use of clustering as a technique for image recognition. One of the reasons for this diversity is that clustering can be applied at different level, for different purposes: from basic pixel level to feature level (feature be a line, a geometric figure...), for classification or for other purposes.
At a very high level, clustering is a statistical tool, it helps discovering the relative importance of various dimensions in defining the belonging of particular item to a particular category.
One [of many] usage[s] of such a tool, is with supervised learning, whereby a set of human-selected items (say images) are fed into the cluster-based logic, along with a label associated with a particular item ("this is an apple", "this is another apple", "this is a lemon"...), the clustering logic then determines how much each dimension of the input matters for helping each group of items (apples, lemons...) fit in a distinct cluster (for example the color may matter relatively little, but the shape, or the presence of dots, or whatever may matter a lot). After this training phase, new images can be fed to the logic and by seeing how close to a particular cluster this image falls, it is "recognized" (as a banana!).
When it comes to image processing one needs to remember that whatever is "fed" to the clustering logic is not necessarily (in fact, rarely) the raw pixels, but various "objects"
characterizing various "elements" of the original data (essentially a collection of relatively high dimension vectors, not unlike some that one may have encountered in other other data clustering examples), and produced by previous stages of the process. For example a important element of facial recognition is probably the exact distance between the center of the eyes. In previous stages, the image is processed in a way that figures out where the eyes are (possibly relying on another clustering-based logic). Then the distance between the eyes, along with many other elements are fed to the final clustering logic.
The preceding description is only one example of the use of clustering for image recognition. Indeed, various forms of neural networks have been used, very successfully, in this domain, and it can be argued that in a sense these neural networks are clustering information. One of the reasons for the success of neural nets may lie in their ability to be more respectful of the locality dimension as found in the original input, and also their ability to work in a hierarchical fashion.
A good conclusion to this write up would be a short list of online resources, but I'm pressed for time at the moment... "to be continued" ;-)
Next day edit: (failed attempt to provide an introductory online bibliography on the subject)
My search for literature on the topic of clustering as applied to artificial vision and image processing revealed two distinct... clusters ;-)
Books such as Algorithms for image processing and computer vision J Parkey pub Wiley, or Machine Vision : Theory, Algorithms, Practicalities M Seul et. Al Cambridge UP. Such books generally cover the all important techniques associated with noise reduction, Edge detection, Color or intensity conversion, and many other elements of the image processing chain, most of which do not involve clustering or even statistical methods, and they reserve only a chapter or two, or even minor mentions, to clustering, as applied to pattern recognition or to other tasks.
Scholarly papers and conference handbooks, which specifically cover clustering techniques applied to artificial vision and such, but in the narrowest and deepest fashion (ex: Variations on the Fukunaga and Narendra algorithm, for applications in character recognition, or Fast methods for selections of Nearest Neighbor candidates in whatever context.)
In short I feel ill equipped to make any specific book or article suggestion.
You may find it informative to browse titles in say Google books, keying in by "Artificial vision" or "Image Recognition" or some or the titles mentioned above. With the preview feature and also the tag cloud (btw another application of clustering) found in the "about this book" link, one can get a good idea of the various books contents and maybe decide to purchase some of them. Unfortunately the reduced readership and the potentially lucrative applications in the field make these books relatively expensive. At the other end of the spectrum, you may download, sometimes for free, research papers discussing advanced topics in the field. These will also show up on regular (web) Google, or at specialized repositories such as CiteSeer.
Good luck with your exploration in that field!


Can a CNN recognize the difference in size if the images are the same?

Could a CNN tell the difference between different size range of the same organism? Like a puppy vs a adult or a child vs an adult? Or more like a large fly vs a small fly, where they look identical but one is just larger than the other?
This is a tricky question to answer but usually theoretical CNN is able to do. It is mainly dependent on the training data itself. In case of a child vs adult, you can gather a dataset that includes alot of variances in sizes and ages in order to make sure that you will have CNN model that able to find patterns and generalize at the end. At the end, the CNN will learn many other features that make the classification scale or size invariant (In dependent of Size) such as shapes,colors, clothes and face features ....etc. Such Intra-class classification problems, it is not easily tackled with traditional supervised learning and therefore some researchers are applying an approach called "Deep Metric Learning".
Metric learning is the task of learning a distance function over objects. A metric or distance function has to obey four axioms: non-negativity, identity of indiscernibles, symmetry and subadditivity (or the triangle inequality). In practice, metric learning algorithms ignore the condition of identity of indiscernibles and learn a pseudo-metric.Wiki Definition
It would be better to differentiate the metric that you mention in the question. At first, it is a different task to recognize age and size.
About the age, yes, it is doable. For deep learning-based approach, you will need appropriate data. For non-training based approach (old-school image processing), you would need to create some metrics for each object based on age (counting the wrinkle, white hair etc. for humans)
About the size, unfortunately, it is still under research and it is not clear to mention if it is properly doable or not. Whenever we mention object size recognition from a single image, there are more things to consider. The first thing is the perspective. If the object found in the image is large with respect to the image coordinates, is it close to the camera, even though its size is tiny, hence, it is showing as large or it is really huge but too far away from the camera? Such a problem may be overcome by knowing the object geometry in prior and by developing an algorithm based on that geometry along with deep learning. However, current deep learning technology is not accurate enough to distinguish the dimensions and location, hence object geometry precisely yet.
Another alternative would be to control the environment. For example, if you know that both objects lie on the same plane (i.e. on the table, next to each other) in the real world, the rest is a trivial problem to resolve.

Negative Training Image Examples for CNN

I am using the Caffe framework for CNN training. My aim is to perform simple object recognition for a few basic object categories. Since pretrained networks are not an alternative for my proposed usage I prepared an own training- and testset with about 1000 images for each of 2 classes (say chairs and cars).
The results are quite good. If I present an yet unseen image of a chair it is likely classified as such, same for a car image. My problem is that the results on miscellaneous images that do not show any of these classes often shows a very high confidence (=1) for one random class (which is not surprising regarding the onesided training data but a problem for my application). I thought about different solutions:
1) Adding a third class with also about 1000 negative examples that shows any objects except a chair and a car.
2) Adding more object categories in general, just to let the network classify other objects as such and not any more as a chair or car (of course this would require much effort). Maybe also the broader prediction results would show a more uniform distribution at negative images, allowing to evaluate the target objects presence based on a threshold?
Because it was not much time-consuming to grab random images as negative examples from the internet, I already tested my first solution with about 1200 negative examples. It helped, but the problem remains, perhaps because it were just too few? My concern is that if I increment the number of negative examples, the imbalance of the number of examples for each class leads to less accurate detection of the original classes.
After some research I found one person with a similar problem, but there was no solution:
Convolutional Neural Networks with Caffe and NEGATIVE IMAGES
My question is: Has anyone had the same problem and knows how to deal with it? What way would you recommend, adding more negative examples or more object categories or do you have any other recommendation?
The problem is not unique to Caffe or ConvNets. Any Machine Learning technique runs this risk. In the end, all classifiers take a vector in some input space (usually very high-dimensional), which means they partition that input space. You've given examples of two partitions, which helps to estimate the boundary between the two, but only that boundary. Both partitions have very, very large boundaries, precisely because the input space is so high-dimensional.
ConvNets do try to tackle the high-dimensionality of image data by having fairly small convolution kernels. Realistic negative data helps in training those, and the label wouldn't really matter. You could even use the input image as goal (i.e. train it as an autoencoder) when training the convolution kernels.
One general reason why you don't want to lump all counterexamples is because they may be too varied. If you have a class A with some feature value from the range [-1,+1] on some scale, with counterexamples B [-2,-1] and C [+1,+2], lumping B and C together creates a range [-2,+2] for counterexamples which overlaps the real real range. Given enough data and powerful enough classifiers, this is not fatal, but for instance an SVM can fail badly on this.

Comparing two English strings for similarities

So here is my problem. I have two paragraphs of text and I need to see if they are similar. Not in the sense of string metrics but in meaning. The following two paragraphs are related but I need to find out if they cover the 'same' topic. Any help or direction to solving this problem would be greatly appreciated.
Fossil fuels are fuels formed by natural processes such as anaerobic
decomposition of buried dead organisms. The age of the organisms and
their resulting fossil fuels is typically millions of years, and
sometimes exceeds 650 million years. The fossil fuels, which contain
high percentages of carbon, include coal, petroleum, and natural gas.
Fossil fuels range from volatile materials with low carbon:hydrogen
ratios like methane, to liquid petroleum to nonvolatile materials
composed of almost pure carbon, like anthracite coal. Methane can be
found in hydrocarbon fields, alone, associated with oil, or in the
form of methane clathrates. It is generally accepted that they formed
from the fossilized remains of dead plants by exposure to heat and
pressure in the Earth's crust over millions of years. This biogenic
theory was first introduced by Georg Agricola in 1556 and later by
Mikhail Lomonosov in the 18th century.
Fossil fuel reforming is a method of producing hydrogen or other
useful products from fossil fuels such as natural gas. This is
achieved in a processing device called a reformer which reacts steam
at high temperature with the fossil fuel. The steam methane reformer
is widely used in industry to make hydrogen. There is also interest in
the development of much smaller units based on similar technology to
produce hydrogen as a feedstock for fuel cells. Small-scale steam
reforming units to supply fuel cells are currently the subject of
research and development, typically involving the reforming of
methanol or natural gas but other fuels are also being considered such
as propane, gasoline, autogas, diesel fuel, and ethanol.
That's a tall order. If I were you, I'd start reading up on Natural Language Processing. NLP is a fairly large field -- I would recommend looking specifically at the things mentioned in the Wikipedia Text Analytics article's "Processes" section.
I think if you make use of information retrieval, named entity recognition, and sentiment analysis, you should be well on your way.
In general, I believe that this is still an open problem. Natural language processing is still a nascent field and while we can do a few things really well, it's still extremely difficult to do this sort of classification and categorization.
I'm not an expert in NLP, but you might want to check out these lecture slides that discuss sentiment analysis and authorship detection. The techniques you might use to do the sort of text comparison you've suggested are related to the techniques you would use for the aforementioned analyses, and you might find this to be a good starting point.
Hope this helps!
You can also have a look on Latent Dirichlet Allocation (LDA) model in machine learning. The idea there is to find a low-dimensional representation of each document (or paragraph), simply as a distribution over some 'topics'. The model is trained in an unsupervised fashion using a collection of documents/paragraphs.
If you run LDA on your collection of paragraphs, then by looking into the similarity of the hidden topics vector, you can find whether a given two paragraphs are related or not.
Of course, the baseline is to not use the LDA, and instead use the term frequencies (augmented with tf/idf) to measure similarities (vector space model).

Anomaly Detection Algorithms

I am tasked with detecting anomalies (known or unknown) using machine-learning algorithms from data in various formats - e.g. emails, IMs etc.
What are your favorite and most effective anomaly detection algorithms?
What are their limitations and sweet-spots?
How would you recommend those limitations be addressed?
All suggestions very much appreciated.
Statistical filters like Bayesian filters or some bastardised version employed by some spam filters are easy to implement. Plus there are lots of online documentation about it.
The big downside is that it cannot really detect unknown things. You train it with a large sample of known data so that it can categorize new incoming data. But you can turn the traditional spam filter upside down: train it to recognize legitimate data instead of illegitimate data so that anything it doesn't recognize is an anomaly.
There are various types of anomaly detection algorithms, depending on the type of data and the problem you are trying to solve:
Anomalies in time series signals:
Time series signals is anything you can draw as a line graph over time (e.g., CPU utilization, temperature, rate per minute of number of emails, rate of visitors on a webpage, etc). Example algorithms are Holt-Winters, ARIMA models, Markov Models, and more. I gave a talk on this subject a few months ago - it might give you more ideas about algorithms and their limitations.
The video is at:
Anomalies in Tabular data: These are cases where you have feature vector that describe something (e.g, transforming an email to a feature vector that describes it: number of recipients, number of words, number of capitalized words, counts of keywords, etc....). Given a large set of such feature vectors, you want to detect some that are anomalies compared to the rest (sometimes called "outlier detection"). Almost any clustering algorithm is suitable in these cases, but which one would be most suitable depends on the type of features and their behavior -- real valued features, ordinal, nominal or anything other. The type of features determine if certain distance functions are suitable (the basic requirement for most clustering algorithms), and some algorithms are better with certain types of features than others.
The simplest algo to try is k-means clustering, where an anomaly sample would be either very small clusters or vectors that are far from all cluster centers. One sided SVM can also detect outliers, and has the flexibility of choosing different kernels (and effectively different distance functions). Another popular algo is DBSCAN.
When anomalies are known, the problem becomes a supervised learning problem, so you can use classification algorithms and train them on the known anomalies examples. However, as mentioned - it would only detect those known anomalies and if the number of training samples for anomalies is very small, the trained classifiers may not be accurate. Also, because the number of anomalies is typically very small compared to "no-anomalies", when training the classifiers you might want to use techniques like boosting/bagging, with over sampling of the anomalies class(es), but optimize on very small False Positive rate. There are various techniques to do it in the literature --- one idea that I found to work many times very well is what Viola-Jones used for face detection - a cascade of classifiers. see:
(DISCLAIMER: I am the chief data scientist for Anodot, a commercial company doing real time anomaly detection for time series data).

What are techniques and practices on measuring data quality?

If I have a large set of data that describes physical 'things', how could I go about measuring how well that data fits the 'things' that it is supposed to represent?
An example would be if I have a crate holding 12 widgets, and I know each widget weighs 1 lb, there should be some data quality 'check' making sure the case weighs 13 lbs maybe.
Another example would be that if I have a lamp and an image representing that lamp, it should look like a lamp. Perhaps the image dimensions should have the same ratio of the lamp dimensions.
With the exception of images, my data is 99% text (which includes height, width, color...).
I've studied AI in school, but have done very little outside of that.
Are standard AI techniques the way to go? If so, how do I map a problem to an algorithm?
Are some languages easier at this than others? Do they have better libraries?
Your question is somewhat open-ended, but it sounds like you want is what is known as a "classifier" in the field of machine learning.
In general, a classifier takes a piece of input and "classifies" it, ie: determines a category for the object. Many classifiers provide a probability with this determination, and some may even return multiple categories with probabilities on each.
Some examples of classifiers are bayes nets, neural nets, decision lists, and decision trees. Bayes nets are often used for spam classification. Emails are classified as either "spam" or "not spam" with a probability.
For you question you'd want to classify your objects as "high quality" or "not high quality".
The first thing you'll need is a bunch of training data. That is, a set of objects where you already know the correct classification. One way to obtain this could be to get a bunch of objects and classify them by hand. If there are too many objects for one person to classify you could feed them to Mechanical Turk.
Once you have your training data you'd then build your classifier. You'll need to figure out what attributes are important to your classification. You'll probably need to do some experimentation to see what works well. You then have your classifier learn from your training data.
One approach that's often used for testing is to split your training data into two sets. Train your classifier using one of the subsets, and then see how well it classifies the other (usually smaller) subset.
AI is one path, natural intelligence is another.
Your challenge is a perfect match to Amazon's Mechanical Turk. Divvy your data space up into extremely small verifiable atoms and assign them as HITs on Mechanical Turk. Have some overlap to give yourself a sense of HIT answer consistency.
There was a shop with a boatload of component CAD drawings that needed to be grouped by similarity. They broke it up and set it loose on Mechanical Turk to very satisfying results. I could google for hours and not find that link again.
See here for a related forum post.
This is a tough answer. For example, what defines a lamp? I could google images a picture of some crazy looking lamps. Or even, look up the definition of a lamp ( Theres no physical requirements of what a lamp must look like. Thats the crux of the AI problem.
As for data, you could setup Unit testing on the project to ensure that 12 widget() weighs less than 13 lbs in the widetBox(). Regardless, you need to have the data at hand to be able to test things like that.
I hope i was able to answer your question somewhat. Its a bit vauge, and my answers are broad, but hopefully it'll at least send you in a good direction.
