Comparing two English strings for similarities - algorithm

So here is my problem. I have two paragraphs of text and I need to see if they are similar. Not in the sense of string metrics but in meaning. The following two paragraphs are related but I need to find out if they cover the 'same' topic. Any help or direction to solving this problem would be greatly appreciated.
Fossil fuels are fuels formed by natural processes such as anaerobic
decomposition of buried dead organisms. The age of the organisms and
their resulting fossil fuels is typically millions of years, and
sometimes exceeds 650 million years. The fossil fuels, which contain
high percentages of carbon, include coal, petroleum, and natural gas.
Fossil fuels range from volatile materials with low carbon:hydrogen
ratios like methane, to liquid petroleum to nonvolatile materials
composed of almost pure carbon, like anthracite coal. Methane can be
found in hydrocarbon fields, alone, associated with oil, or in the
form of methane clathrates. It is generally accepted that they formed
from the fossilized remains of dead plants by exposure to heat and
pressure in the Earth's crust over millions of years. This biogenic
theory was first introduced by Georg Agricola in 1556 and later by
Mikhail Lomonosov in the 18th century.
Second:
Fossil fuel reforming is a method of producing hydrogen or other
useful products from fossil fuels such as natural gas. This is
achieved in a processing device called a reformer which reacts steam
at high temperature with the fossil fuel. The steam methane reformer
is widely used in industry to make hydrogen. There is also interest in
the development of much smaller units based on similar technology to
produce hydrogen as a feedstock for fuel cells. Small-scale steam
reforming units to supply fuel cells are currently the subject of
research and development, typically involving the reforming of
methanol or natural gas but other fuels are also being considered such
as propane, gasoline, autogas, diesel fuel, and ethanol.

That's a tall order. If I were you, I'd start reading up on Natural Language Processing. NLP is a fairly large field -- I would recommend looking specifically at the things mentioned in the Wikipedia Text Analytics article's "Processes" section.
I think if you make use of information retrieval, named entity recognition, and sentiment analysis, you should be well on your way.

In general, I believe that this is still an open problem. Natural language processing is still a nascent field and while we can do a few things really well, it's still extremely difficult to do this sort of classification and categorization.
I'm not an expert in NLP, but you might want to check out these lecture slides that discuss sentiment analysis and authorship detection. The techniques you might use to do the sort of text comparison you've suggested are related to the techniques you would use for the aforementioned analyses, and you might find this to be a good starting point.
Hope this helps!

You can also have a look on Latent Dirichlet Allocation (LDA) model in machine learning. The idea there is to find a low-dimensional representation of each document (or paragraph), simply as a distribution over some 'topics'. The model is trained in an unsupervised fashion using a collection of documents/paragraphs.
If you run LDA on your collection of paragraphs, then by looking into the similarity of the hidden topics vector, you can find whether a given two paragraphs are related or not.
Of course, the baseline is to not use the LDA, and instead use the term frequencies (augmented with tf/idf) to measure similarities (vector space model).

Related

Given a list of words, how to develop an algorithmic way to semantically group them?

I am working with the Google Places API, and they contain a list of 97 different locations. I want to reduce the list of locations into a lesser number
of them, as many of them are groupable. For example, atm and bank into financial; temple, church, mosque, synagogue into worship; school, university into education; subway_station, train_station, transit_station, gas_station into transportation.
But also, it should not overgeneralize; for example, pet_store, city_hall, courthouse, restaurant into something like buildings.
I tried quite a few methods to do this. First I downloaded synonyms of each of the 97 words in the list from multiple dictionaries. Then, I found out the similarity between 2 words based on what fraction of unique synonyms they share in common (Jaccard similarity):
But after that, how do I group words into clusters? Using traditional clustering methods (k-means, k-medoid, hierarchical clustering, and FCM), I am not getting any good clustering (I identified several misclassifications by scanning the results manually):
I even tried the word2vec model trained on Google news data (where each word is expressed as a vector of 300 features), and I do not get good clusters based on that as well:
You are probably looking for something related to vector space dimensionality reduction. In these techniques, you'll need a corpus of text that uses the locations as words in the text. Dimensionality reduction will then group the terms together. You can do some reading on Latent Dirichlet Allocation and Latent semantic indexing. A good reference is "Introduction to Information Retrieval" by Manning et al., chapter 18. Note that this book is from 2009, so a lot of advances are not captured. As you noted, there has been a lot of work such as word2vec. Another good reference is "Speech and Language Processing" by Jurafsky and Martin, chapter 16.
You need much more data.
No algorithm ever, without additional data, will relate ATM and bank to financial. Because that requires knowledge of these terms.
Jaccard similarity doesn't have access to such knowledge, it can only work on the words. And then "river bank" and "bank branch" are very similar.
So don't expect magic to happen by the algorithm. You need the magic to be in the data...

how to categorize but don't use Classification or Clustering algorithms?

I have a crawler program that stores sport data from 7 difference news agencies every day. it stores about 1200 sport news every day.
I want to categorize news of last two days into sub-categories. So every two days I have about 2400 news that are exactly for these days and many of their topics are talking exactly about the same event.
for example:
70 news are talking about 500 miles racing of Brad Keselowski.
120 news are talking about US swimmer Nyad that begins swimming.
28 new are talking about the match between Man United and Man City.
. . .
In other words, I want to make something like Google News.
The problem is that this situation is not a classification problem, because I don't have special classes. for example, my classes are not swimming, golf, football, etc. my classes are a special events in every field that happened in these two years. So I cannot use classification algorithms such as Naive Bayes.
On the other hand, my problem is not solving with clustering algorithms too. Because I don't want to force them to put into n clusters. Maybe one of the news doesn't have any similar news or maybe in one pack of two days, there are 12 different stories, but in other two days, there are 30 different issues. So I cannot use clustering algorithms such as "Single Link( Maximum Similarity)", "Complete Link( Minimum Similarity)", "Maximum Weighted Matching" or "Group Average( Average Intra Similarity)".
I have some ideas myself to do this, for example, each two news that have 10 common words, should be in the same class. But if we don't consider some parameters such as length of documents, influence of common and rare words and some other things, this will not work well.
I have read this paper, but it was not my answer.
Is there any known algorithm to solve this problem?
The problem strikes me as a clustering problem with an unknown quality measure for the clusters. That points to an unsupervised method, which is ultimately based on detecting correlations using redundancy in the data. Perhaps something like principal component analysis or latent semantic analysis could be useful. The different dimensions (principal components or singular vectors) would indicate distinct major themes, with the terms corresponding to the vector components hopefully being the words appearing in the description. One drawback is that there's no guarantee that the strongest correlations would lead easily to a sensible description.
Take a look at "topic models" and "Latent Dirichlet Allocation". These are popular and you'll find code in a variety of languages.
You might use hierarchical clustering algorithms to investigate relationships between your items - the closest items (news with almost the same description) would be in the same clusters, and the closest clusters (groups of similar news) would be in the same super-cluster etc.
Also, there is pretty nice and fast algorithm called CLOPE - http://www.google.com.ua/url?sa=t&source=web&cd=11&sqi=2&ved=0CF0QFjAK&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.13.7142%26rep%3Drep1%26type%3Dpdf&rct=j&q=CLOPE&ei=gvo_Tsi4AsKa-gas-uCkAw&usg=AFQjCNGcR9sFqhsEkAJowEjIGbDBwSjeXw&cad=rja
There are many document clustering algorithms out there. Take a look at "Hierarchical document clustering using frequent itemsets", for example, and see if that is similar to what you want. If you're programming in Java, you may get some mileage out of the S-space package, which includes algorithms for latent semantic analysis (LSA) among others.

What are good algorithms for detecting abnormality?

Background
Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
Question
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
Ideas
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

how does data clustering help in image or pattern recognition

I have been playing around with different data clustering algorithms working on finding clusters between random data points represented an nodes, I keep reading that data clustering is used for image recognition. I am failing to make the connection, how does clustering data help in recognizing an image or in facial recognition. can someone explain this?
It's no surprise that clustering is used for pattern recognition at large, and image recognition in particular: clustering is a reducing process, and images in this megapixel era need boiling down... It is also a process which produces categories and that is of course useful.
However there are many approaches to the use of clustering as a technique for image recognition. One of the reasons for this diversity is that clustering can be applied at different level, for different purposes: from basic pixel level to feature level (feature be a line, a geometric figure...), for classification or for other purposes.
At a very high level, clustering is a statistical tool, it helps discovering the relative importance of various dimensions in defining the belonging of particular item to a particular category.
One [of many] usage[s] of such a tool, is with supervised learning, whereby a set of human-selected items (say images) are fed into the cluster-based logic, along with a label associated with a particular item ("this is an apple", "this is another apple", "this is a lemon"...), the clustering logic then determines how much each dimension of the input matters for helping each group of items (apples, lemons...) fit in a distinct cluster (for example the color may matter relatively little, but the shape, or the presence of dots, or whatever may matter a lot). After this training phase, new images can be fed to the logic and by seeing how close to a particular cluster this image falls, it is "recognized" (as a banana!).
When it comes to image processing one needs to remember that whatever is "fed" to the clustering logic is not necessarily (in fact, rarely) the raw pixels, but various "objects"
characterizing various "elements" of the original data (essentially a collection of relatively high dimension vectors, not unlike some that one may have encountered in other other data clustering examples), and produced by previous stages of the process. For example a important element of facial recognition is probably the exact distance between the center of the eyes. In previous stages, the image is processed in a way that figures out where the eyes are (possibly relying on another clustering-based logic). Then the distance between the eyes, along with many other elements are fed to the final clustering logic.
The preceding description is only one example of the use of clustering for image recognition. Indeed, various forms of neural networks have been used, very successfully, in this domain, and it can be argued that in a sense these neural networks are clustering information. One of the reasons for the success of neural nets may lie in their ability to be more respectful of the locality dimension as found in the original input, and also their ability to work in a hierarchical fashion.
A good conclusion to this write up would be a short list of online resources, but I'm pressed for time at the moment... "to be continued" ;-)
Next day edit: (failed attempt to provide an introductory online bibliography on the subject)
My search for literature on the topic of clustering as applied to artificial vision and image processing revealed two distinct... clusters ;-)
Books such as Algorithms for image processing and computer vision J Parkey pub Wiley, or Machine Vision : Theory, Algorithms, Practicalities M Seul et. Al Cambridge UP. Such books generally cover the all important techniques associated with noise reduction, Edge detection, Color or intensity conversion, and many other elements of the image processing chain, most of which do not involve clustering or even statistical methods, and they reserve only a chapter or two, or even minor mentions, to clustering, as applied to pattern recognition or to other tasks.
Scholarly papers and conference handbooks, which specifically cover clustering techniques applied to artificial vision and such, but in the narrowest and deepest fashion (ex: Variations on the Fukunaga and Narendra algorithm, for applications in character recognition, or Fast methods for selections of Nearest Neighbor candidates in whatever context.)
In short I feel ill equipped to make any specific book or article suggestion.
You may find it informative to browse titles in say Google books, keying in by "Artificial vision" or "Image Recognition" or some or the titles mentioned above. With the preview feature and also the tag cloud (btw another application of clustering) found in the "about this book" link, one can get a good idea of the various books contents and maybe decide to purchase some of them. Unfortunately the reduced readership and the potentially lucrative applications in the field make these books relatively expensive. At the other end of the spectrum, you may download, sometimes for free, research papers discussing advanced topics in the field. These will also show up on regular (web) Google, or at specialized repositories such as CiteSeer.
Good luck with your exploration in that field!

What are techniques and practices on measuring data quality?

If I have a large set of data that describes physical 'things', how could I go about measuring how well that data fits the 'things' that it is supposed to represent?
An example would be if I have a crate holding 12 widgets, and I know each widget weighs 1 lb, there should be some data quality 'check' making sure the case weighs 13 lbs maybe.
Another example would be that if I have a lamp and an image representing that lamp, it should look like a lamp. Perhaps the image dimensions should have the same ratio of the lamp dimensions.
With the exception of images, my data is 99% text (which includes height, width, color...).
I've studied AI in school, but have done very little outside of that.
Are standard AI techniques the way to go? If so, how do I map a problem to an algorithm?
Are some languages easier at this than others? Do they have better libraries?
thanks.
Your question is somewhat open-ended, but it sounds like you want is what is known as a "classifier" in the field of machine learning.
In general, a classifier takes a piece of input and "classifies" it, ie: determines a category for the object. Many classifiers provide a probability with this determination, and some may even return multiple categories with probabilities on each.
Some examples of classifiers are bayes nets, neural nets, decision lists, and decision trees. Bayes nets are often used for spam classification. Emails are classified as either "spam" or "not spam" with a probability.
For you question you'd want to classify your objects as "high quality" or "not high quality".
The first thing you'll need is a bunch of training data. That is, a set of objects where you already know the correct classification. One way to obtain this could be to get a bunch of objects and classify them by hand. If there are too many objects for one person to classify you could feed them to Mechanical Turk.
Once you have your training data you'd then build your classifier. You'll need to figure out what attributes are important to your classification. You'll probably need to do some experimentation to see what works well. You then have your classifier learn from your training data.
One approach that's often used for testing is to split your training data into two sets. Train your classifier using one of the subsets, and then see how well it classifies the other (usually smaller) subset.
AI is one path, natural intelligence is another.
Your challenge is a perfect match to Amazon's Mechanical Turk. Divvy your data space up into extremely small verifiable atoms and assign them as HITs on Mechanical Turk. Have some overlap to give yourself a sense of HIT answer consistency.
There was a shop with a boatload of component CAD drawings that needed to be grouped by similarity. They broke it up and set it loose on Mechanical Turk to very satisfying results. I could google for hours and not find that link again.
See here for a related forum post.
This is a tough answer. For example, what defines a lamp? I could google images a picture of some crazy looking lamps. Or even, look up the definition of a lamp (http://dictionary.reference.com/dic?q=lamp). Theres no physical requirements of what a lamp must look like. Thats the crux of the AI problem.
As for data, you could setup Unit testing on the project to ensure that 12 widget() weighs less than 13 lbs in the widetBox(). Regardless, you need to have the data at hand to be able to test things like that.
I hope i was able to answer your question somewhat. Its a bit vauge, and my answers are broad, but hopefully it'll at least send you in a good direction.

Resources