Detecting Similarity in Strings - algorithm

If I search for something on Google News, I can click on the "Explore in depth" button and get the same news article from multiple sources. What kind of algorithm is used to compare articles of text and then determine that it is regarding the same thing? I have seen the Question here:
Is there an algorithm that tells the semantic similarity of two phrases
However, using methods mentioned there, I feel that if there were articles that were similar in nature but regarding different stories, they would be grouped together using the methods mentioned there. Is there a standard way of detecting Strings that are about the same thing and grouping them, while keeping Strings that are just similar separate? Eg. If I search "United States Border" I might get stories about problems at the USA's border, but what would prevent these from all getting grouped together? All I can think of is the date of publication, but what if many stories were published very close to each other?

One standard way to determine similarity of two articles is create a language model for each of them, and then find the similarity between them.
The language model is usually a probability function, assuming the article was created by a model that randomly selects tokens (words/bigrams/.../ngrams).
The simplest language model is for unigrams (words): P(word|d) = #occurances(w,d)/|d| (the number of times the word appeared in the document, relative to the total length of the document). Smoothing techniques are often used to prevent words having zero probability to appear.
After you have a language model, all you have to do is compare the two models. One way to do it is cosine similarity or Jensen-Shannon similarity.
This gives you an absolute score of similarity of two articles. This can be combined with many other methods, like your suggestion to compare dates.

Related

How are keyword clouds constructed?

How are keyword clouds constructed?
I know there are a lot of nlp methods, but I'm not sure how they solve the following problem:
You can have several items that each have a list of keywords relating to them.
(In my own program, these items are articles where I can use nlp methods to detect proper nouns, people, places, and (?) possibly subjects. This will be a very large list given a sufficiently sized article, but I will assume that I can winnow the list down using some method by comparing articles. How to do this properly is what I am confused about).
Each item can have a list of keywords, but how do they pick keywords such that the keywords aren't overly specific or overly general between each item?
For example, trivially "the" can be a keyword that is a lot of items.
While "supercalifragilistic" could only be in one.
I suppose that I could create a heuristic where if a word exists in n% of the items where n is sufficiently small, but will return a nice sublist (say 5% of 1000 articles is 50, which seems reasonable) then I could just use that. However, the issue that I take with this approach is that given two different sets of entirely different items, there is most likely some difference in interrelatedness between the items, and I'm throwing away that information.
This is very unsatisfying.
I feel that given the popularity of keyword clouds there must have been a solution created already. I don't want to use a library however as I want to understand and manipulate the assumptions in the math.
If anyone has any ideas please let me know.
Thanks!
EDIT:
freenode/programming/guardianx has suggested https://en.wikipedia.org/wiki/Tf%E2%80%93idf
tf-idf is ok btw, but the issue is that the weighting needs to be determined apriori. Given that two distinct collections of documents will have a different inherent similarity between documents, assuming an apriori weighting does not feel correct
freenode/programming/anon suggested https://en.wikipedia.org/wiki/Word2vec
I'm not sure I want something that uses a neural net (a little complicated for this problem?), but still considering.
Tf-idf is still a pretty standard method for extracting keywords. You can try a demo of a tf-idf-based keyword extractor (which has the idf vector, as you say apriori determined, estimated from Wikipedia). A popular alternative is the TextRank algorithm based on PageRank that has an off-the-shelf implementation in Gensim.
If you decide for your own implementation, note that all algorithms typically need plenty of tuning and text preprocessing to work correctly.
The minimum you need to do is removing stopwords that you know that they never can be a keyword (prepositions, articles, pronouns, etc.). If you want something fancier, you can use for instance Spacy to keep only desired parts of speech (nouns, verbs, adjectives). You can also include frequent multiword expressions (gensim has good function for automatic collocation detection), named entities (spacy can do it). You can get better results if you run coreference resolution and substitute pronouns with what they refer to... There are endless options for improvements.

Finding similar images via tags

tl;dr
Fundamentally I'm looking for reasonable ways to implement a similarity rank among tag groups, where a tag group is 2 to 9 tags. Similar to ranking the similarity of 2 to 9 word sentences where the vocabulary is 200,000 words, except word order doesn't matter.
I have a collection of tagged images and I want to implement a couple of search functions:
Similar images
Similar but different images
where the similarity is based solely on the tags.
Finding identically tagged images isn't that hard, but after that I'm at a bit of a loss as to the best way to proceed. We have a hundreds of thousands of tags and no metadata on them, so we don't know that "Outlook" is related to "Microsoft" or "Windows" or "Email" and therefore cannot appreciate the difference in relatedness of an image tagged "Microsoft,Excel,Bar Graph" to an image tagged "Excel,Spreadsheet" versus one tagged "Visio,Bar Graph".
For "Similar images" we'd want to match "Microsoft,Excel,Bar Graph" to "Visio,Bar Graph" while for "Similar but different images" we'd want to match "Microsoft,Excel,Bar Graph" to "Excel,Spreadsheet".
My best guess at the moment is to treat the tags like text and throw them into Solr. On the other hand, maybe a different kind of database, like Neo4j, would be the way to go.
Any suggestions on how to take a few steps forward? I'm not expecting a full solution, but suggestions for a general approach would be appreciated.
Extra Credit:
To make things more difficult, when tags are assigned to images, they are designated as "primary" or "secondary" and of course we want to take that into account.
Update
Let's repeat the problem.
The input data consists of sets of tags = set of strings (and pointers to the associated image resources)
The strings are just character sequences, there is no additional semantic information available
However there is a weighting of strings into 'primary' (higher weight) and 'secondary' (lower weight)
This means that the search has to rely solely on some similarity measure of sets (and strings).
Examples for such measures are:
Jaccard Similarity
Dice Similarity
Tverski Similarity
Cosine Similarity
This paper from 2010: A weighted tag similarity measure based on a collaborative weight model applies several of those (and others) on the tag problem and shows how to include weighting. That should be helpful, IMHO.
Another (simpler) application can be seen in this paper from 2013: Using of Jaccard Coefficient for Keywords Similarity.
About the examples from the question
For "Similar images" we'd want to match "Microsoft, Excel, Bar Graph" to "Visio, Bar Graph"
It would have some similarity due to that one tag ("Bar Graph") is common to both sets of tags.
while for "Similar but different images" we'd want to match "Microsoft, Excel, Bar Graph" to "Excel, Spreadsheet".
Again one tag in common ("Excel"). But how should the system know that "Visio" is more similiar to the set "Microsoft, Excel, Bar Graph" than "Spreadsheet"?
That would require semantic information. I don't see how to solve this otherwise.
Old Part
I found not much which would help you with your chosen approach (you restricted it quite a bit), except for a discussion of various metrics in the 2009 paper below.
But I would like to keep the steps of my little searching on this topic online here, because it puts your problem into context.
Where others go
The research community seems to go in these directions:
using the additional information about the users who provided the tags (folksonomies, social tagging)
making use of semantic meta data (ontologies, semantic similarity)
making use of visual image content (content-based visual information retrieval)
Folksonomnies, Social Tagging
See this paper from 2009: Evaluating Similarity Measures for Emergent Semantics of Social Tagging.
Instead of the traditional approach to define similarity by comparing the graphical data of the images
I = { (x, y, colour) }
by some measure (content-based image retrieval, query by image content, content-based-visual information retrieval), those authors use information (harvest semantics) from the tags, like you intend.
Their basic model consists of user assigned tags for a resource, comparing tuples of a so-called folksonomy
F = { (user, resource, tag) }
which can be scaled down to your case of (resource, tag) tuples by different approaches to aggregate over the users resulting in different similarity measures.
Semantic Similarity
Interesting is the use of semantic similarity, e.g. Jiang-Conrad, but alas, you have no semantic meta data (ontology e.g.) for your tags, which leaves you sticking to the similarity of the string representation of the words.
Again this paper The Use of Ontologies for Improving Image Retrieval and Annotation from 2008 favours the use of ontologies, but I think it gives a nice discussion of the various approaches.
Folksonomies, social tagging systems that relies on the idea of the
wisdom of the people. One representative example of this is Flickr.
com. This approach overcomes the so much time consuming of
manual annotation but the inconsistency in tag use can difficult the
search through the entire collection of data.
Combination with Content-Based Visual Information Retrieval
Both papers above cite this paper Augmenting Navigation for Collaborative Tagging with Emergent Semantics from 2006.
However, using tags alone for searching and browsing databases clearly has
its limitations. First, people make mistakes while tagging, such as spelling mistakes,
or accidental tagging with the wrong tag. Second, there is no solution to
cope with homonymy, i.e. to distinguish different meanings of a word. Third,
synonymy or different languages can only be handled by tagging data explicitly
with all terms.
These authors combine social tagging with the initially mentioned content-based image retrieval.
Yet another link: collaborative tagging.

Discover user behind multiple different user accounts according to words he uses

I would like to create algorithm to distinguish the persons writing on forum under different nicknames.
The goal is to discover people registring new account to flame forum anonymously, not under their main account.
Basicaly I was thinking about stemming words they use and compare users according to similarities or these words.
As shown on the picture there is user3 and user4 who uses same words. It means there is probably one person behind the computer.
Its clear that there are lot of common words which are being used by all users. So I should focus on "user specific" words.
Input is (related to the image above):
<word1, user1>
<word2, user1>
<word2, user2>
<word3, user2>
<word4, user2>
<word5, user3>
<word5, user4>
... etc. The order doesnt matter
Output should be:
user1
user2
user3 = user4
I am doing this in Java but I want this question to be language independent.
Any ideas how to do it?
1) how to store words/users? What data structures?
2) how to get rid of common words everybody use? I have to somehow ignore them among user specific words. Maybe I could just ignore them because they get lost. I am afraid that they will hide significant difference of "user specific words"
3) how to recognize same users? - somehow count same words between each user?
I am very thankful for every advice in advance.
In general this is task of author identification, and there are several good papers like this that may give you a lot of information. Here are my own suggestions on this topic.
1. User recognition/author identification itself
The most simple kind of text classification is classification by topic, and there you take meaningful words first of all. That is, if you want to distinguish text about Apple the company and apple the fruit, you count words like "eat", "oranges", "iPhone", etc., but you commonly ignore things like articles, forms of words, part-of-speech (POS) information and so on. However many people may talk about same topics, but use different styles of speech, that is articles, forms of words and all the things you ignore when classifying by topic. So the first and the main thing you should consider is collecting the most useful features for your algorithm. Author's style may be expressed by frequency of words like "a" and "the", POS-information (e.g. some people tend to use present time, others - future), common phrases ("I would like" vs. "I'd like" vs. "I want") and so on. Note that topic words should not be discarded completely - they still show themes the user is interested in. However you should treat them somehow specially, e.g. you can pre-classify texts by topic and then discriminate users not interested in it.
When you are done with feature collection, you may use one of machine learning algorithm to find best guess for an author of the text. As for me, 2 best suggestions here are probability and cosine similarity between text vector and user's common vector.
2. Discriminating common words
Or, in latest context, common features. The best way I can think of to get rid of the words that are used by all people more or less equally is to compute entropy for each such feature:
entropy(x) = -sum(P(Ui|x) * log(P(Ui|x)))
where x is a feature, U - user, P(Ui|x) - conditional probability of i-th user given feature x, and sum is the sum over all users.
High value of entropy indicates that distribution for this feature is close to uniform and thus is almost useless.
3. Data representation
Common approach here is to have user-feature matrix. That is, you just build table where rows are user ids and columns are features. E.g. cell [3][12] shows normalized how many times user #3 used feature #12 (don't forget to normalize these frequencies by total number of features user ever used!).
Depending on features your are going to use and size of the matrix, you may want to use sparse matrix implementation instead of dense. E.g. if you use 1000 features and for every particular user around 90% of cells are 0, it doesn't make sense to keep all these zeros in memory and sparse implementation is better option.
I recommend a language modelling approach. You can train a language model (unigram, bigram, parsimonious, ...) on each of your user accounts' words. That gives you a mapping from words to probabilities, i.e. numbers between 0 and 1 (inclusive) expressing how likely it is that a user uses each of the words you encountered in the complete training set. Language models can be stored as arrays of pairs, hash tables or sparse vectors. There are plenty of libraries on the web for fitting LMs.
Such a mapping can be considered a high-dimensional vector, in the same way documents are considered as vector in the vector space model of information retrieval. You can then compare these vectors by using KL-divergence or any of the popular distance metrics: Euclidean distance, cosine distance, etc. A strong similarity/small distance between two users' vectors might then indicate that they belong to one and the same user.
how to store words/users? What data structures?
You probably have some kind of representation for the users and the posts that they have made. I think you should have a list of words, and a list corresponding to each word containing the users who use it. Something like:
<word: <user#1, user#4, user#5, ...> >
how to get rid of common words everybody use?
Hopefully, you have a set of stopwords. Why not extend it to include commonly used words from your forum? For example, for stackoverflow, some of the most frequently used tags' names should qualify for it.
how to recognize same users?
In addition to using similarity or word-frequency based measures, you can also try using interactions between users. For example, user3 likes/upvotes/comments each and every post by user8, or a new user doing similar things for some other (older) user in this way.

Automatically linking categories to each other when categorizing text

I've been working on a project to data-mine a large amount of short texts and categorize these based on a pre-existing large list of category names. To do this I had to figure out how to first create a good text corpus from the data in order to have reference documents for the categorization and then to get the quality of the categorization up to an acceptable level. This part I am finished with (luckily categorizing text is something that a lot of people have done a lot of research into).
Now my next problem, I'm trying to figure out a good way of linking the various categories to each other computationally. That is to say, to figure out how to recognize that "cars" and "chevrolet" are related in some way. So far I've tried utilizing the N-Gram categorization methods described by, among others, Cavnar and Trenkle for comparing the various reference documents I've created for each category. Unfortunately it seems the best I've been able to get out of that method is approximately 50-55% correct relations between categories, and those are the best relations, overall it's around 30-35% which is miserably low.
I've tried a couple of other approaches as well but I've been unable to get much higher than 40% relevant links (an example of a non-relevant relation would be the category "trucks" being strongly related to the category "makeup" or the category "diapers" while weakly (or not at all) related to "chevy").
Now, I've tried looking for better methods for doing this but it just seems like I can't find any (yet I know others have done better than I have). Does anyone have any experience with this? Any tips on usable methods for creating relations between categories? Right now the methods I've tried either don't give enough relations at all or contain way too high a percentage of junk relations.
Obviously, the best way of doing that matching is highly dependent on your taxonomy, the nature of your "reference documents", and the expected relationships you'd like created.
However, based on the information provided, I'd suggest the following:
Start by Building a word-based (rather than letter based) unigram or bigram model for each of your categories, based on the reference documents. If there are only few of these for each category (It seems you might have only one), you could use a semi-supervised approach, and throw in also the automatically categorized documents for each category. A relatively simple tool for building the model might be the CMU SLM toolkit.
Calculate the mutual-information (infogain) of each term or phrase in your model, with relation to other categories. if your categories are similar, you might need you use only neighboring categories to get meaningful result. This step would give the best separating terms higher scores.
Correlate the categories to each other based on the top-infogain terms or phrases. This could be done either by using euclidean or cosine distance between the category models, or by using a somewhat more elaborated techniques, like graph-based algorithms or hierarchic clustering.

how to get the similar texts from a lot of pages?

get the x most similar texts from a lot of texts to one text.
maybe change the page to text is better.
You should not compare the text to every text, because its too slow.
The ability of identifying similar documents/pages, whether web pages or more general forms of text or even of codes, has many practical applications. This topics is well represented in scholarly papers and also in less specialized forums. In spite of this relative wealth of documentation, it can be difficult to find the information and techniques relevant to a particular case.
By describing the specific problem at hand and associated requirements, it may be possible to provide you more guidance. In the meantime the following provides a few general ideas.
Many different functions may be used to measure, in some fashion, the similarity of pages. Selecting one (or possibly several) of these functions depends on various factors, including the amount of time and/or space one can allot the problem and also to the level of tolerance desired for noise.
Some of the simpler metrics are:
length of the longest common sequence of words
number of common words
number of common sequences of words of more than n words
number of common words for the top n most frequent words within each document.
length of the document
Some of the metrics above work better when normalized (for example to avoid favoring long pages which, through their sheer size have more chances of having similar words with other pages)
More complicated and/or computationally expensive measurements are:
Edit distance (which is in fact a generic term as there are many ways to measure the Edit distance. In general, the idea is to measure how many [editing] operations it would take to convert one text to the other.)
Algorithms derived from the Ratcliff/Obershelp algorithm (but counting words rather than letters)
Linear algebra-based measurements
Statistical methods such as Bayesian fitlers
In general, we can distinguish measurements/algorithms where most of the calculation can be done once for each document, followed by a extra pass aimed at comparing or combining these measurements (with relatively little extra computation), as opposed to the algorithms that require to deal with the documents to be compared in pairs.
Before choosing one (or indeed several such measures, along with some weighing coefficients), it is important to consider additional factors, beyond the similarity measurement per-se. for example, it may be beneficial to...
normalize the text in some fashion (in the case of web pages, in particular, similar page contents, or similar paragraphs are made to look less similar because of all the "decorum" associated with the page: headers, footers, advertisement panels, different markup etc.)
exploit markup (ex: giving more weight to similarities found in the title or in tables, than similarities found in plain text.
identify and eliminate domain-related (or even generally known) expressions. For example two completely different documents may appear similar is they have in common two "boiler plate" paragraphs pertaining to some legal disclaimer or some general purpose description, not truly associated with the essence of each cocument's content.
Tokenize texts, remove stop words and arrange in a term vector. Calculate tf-idf. Arrange all vectors in a matrix and calculate distances between them to find similar docs, using for example Jaccard index.
All depends on what you mean by "similar". If you mean "about the same subject", looking for matching N-grams usually works pretty well. For example, just make a map from trigrams to the text that contains them, and put all trigrams from all of your texts into that map. Then when you get your text to be matched, look up all its trigrams in your map and pick the most frequent texts that come back (perhaps with some normalization by length).
I don't know what you mean by similar, but perhaps you ought to load your texts into a search system like Lucene and pose your 'one text' to it as a query. Lucene does pre-index the texts so it can quickly find the most similar ones (by its lights) at query-time, as you asked.
You will have to define a function to measure the "difference" between two pages. I can imagine a variety of such functions, one of which you have to choose for your domain:
Difference of Keyword Sets - You can prune the document of the most common words in the dictionary, and then end up with a list of unique keywords per document. The difference funciton would then calculate the difference as the difference of the sets of keywords per document.
Difference of Text - Calculate each distance based upon the number of edits it takes to turn one doc into another using a text diffing algorithm (see Text Difference Algorithm.
Once you have a difference function, simply calculate the difference of your current doc with every other doc, then return the other doc that is closest.
If you need to do this a lot and you have a lot of documents, then the problem becomes a bit more difficult.

Resources