how to get the similar texts from a lot of pages? - algorithm

get the x most similar texts from a lot of texts to one text.
maybe change the page to text is better.
You should not compare the text to every text, because its too slow.

The ability of identifying similar documents/pages, whether web pages or more general forms of text or even of codes, has many practical applications. This topics is well represented in scholarly papers and also in less specialized forums. In spite of this relative wealth of documentation, it can be difficult to find the information and techniques relevant to a particular case.
By describing the specific problem at hand and associated requirements, it may be possible to provide you more guidance. In the meantime the following provides a few general ideas.
Many different functions may be used to measure, in some fashion, the similarity of pages. Selecting one (or possibly several) of these functions depends on various factors, including the amount of time and/or space one can allot the problem and also to the level of tolerance desired for noise.
Some of the simpler metrics are:
length of the longest common sequence of words
number of common words
number of common sequences of words of more than n words
number of common words for the top n most frequent words within each document.
length of the document
Some of the metrics above work better when normalized (for example to avoid favoring long pages which, through their sheer size have more chances of having similar words with other pages)
More complicated and/or computationally expensive measurements are:
Edit distance (which is in fact a generic term as there are many ways to measure the Edit distance. In general, the idea is to measure how many [editing] operations it would take to convert one text to the other.)
Algorithms derived from the Ratcliff/Obershelp algorithm (but counting words rather than letters)
Linear algebra-based measurements
Statistical methods such as Bayesian fitlers
In general, we can distinguish measurements/algorithms where most of the calculation can be done once for each document, followed by a extra pass aimed at comparing or combining these measurements (with relatively little extra computation), as opposed to the algorithms that require to deal with the documents to be compared in pairs.
Before choosing one (or indeed several such measures, along with some weighing coefficients), it is important to consider additional factors, beyond the similarity measurement per-se. for example, it may be beneficial to...
normalize the text in some fashion (in the case of web pages, in particular, similar page contents, or similar paragraphs are made to look less similar because of all the "decorum" associated with the page: headers, footers, advertisement panels, different markup etc.)
exploit markup (ex: giving more weight to similarities found in the title or in tables, than similarities found in plain text.
identify and eliminate domain-related (or even generally known) expressions. For example two completely different documents may appear similar is they have in common two "boiler plate" paragraphs pertaining to some legal disclaimer or some general purpose description, not truly associated with the essence of each cocument's content.

Tokenize texts, remove stop words and arrange in a term vector. Calculate tf-idf. Arrange all vectors in a matrix and calculate distances between them to find similar docs, using for example Jaccard index.

All depends on what you mean by "similar". If you mean "about the same subject", looking for matching N-grams usually works pretty well. For example, just make a map from trigrams to the text that contains them, and put all trigrams from all of your texts into that map. Then when you get your text to be matched, look up all its trigrams in your map and pick the most frequent texts that come back (perhaps with some normalization by length).

I don't know what you mean by similar, but perhaps you ought to load your texts into a search system like Lucene and pose your 'one text' to it as a query. Lucene does pre-index the texts so it can quickly find the most similar ones (by its lights) at query-time, as you asked.

You will have to define a function to measure the "difference" between two pages. I can imagine a variety of such functions, one of which you have to choose for your domain:
Difference of Keyword Sets - You can prune the document of the most common words in the dictionary, and then end up with a list of unique keywords per document. The difference funciton would then calculate the difference as the difference of the sets of keywords per document.
Difference of Text - Calculate each distance based upon the number of edits it takes to turn one doc into another using a text diffing algorithm (see Text Difference Algorithm.
Once you have a difference function, simply calculate the difference of your current doc with every other doc, then return the other doc that is closest.
If you need to do this a lot and you have a lot of documents, then the problem becomes a bit more difficult.

Related

Finding keywords in a set of small texts

I have a set of almost 2000 texts.
My goal is to find the keywords across these texts to understand what is the subject of them, or simply the most common words and expressions.
I would like some ideias of algorithms to score the words and identify when they frequently come together.
I have read some other related questions here, but I'm trying to get more and more information about this subject. So any ideas are very welcome. Thank you so much!
--
I have already extracted stopwords. After removing them I have more than 7000 words remaing; My question is how to score these words and from which point I can consider removing some them from my list of keywords. Also, how to get key expressions, find words that come together.
You may want to refer to a classical text on Information Retrieval. Most of the algorithms use a stop list to remove commonly occurring words such as "for" and "the", and then, extract the base or root word (change "seeing", "seen", "see", "sees" to the base word "see"). The remaining words form the keywords of the document and are weighted by things like term frequency (how many times the word occurs in the document) and inverse document frequency (how unique is the word in describing the content). You can use the weighted keywords as document representation and use them for retrieval.
You can use the Lucene MoreLikeThis implementation, which extracts a list of top most important keywords from a given text document. The term scoring function it uses is the tf-idf, i.e. it chooses those terms with topmost tf-idf scores, i.e. the terms which are relatively uncommon and occur frequently in the document.
If efficiency is an issue, it employs some common heuristics as follows.
Since you're trying to maximize a tf*idf score, you're probably most interested in terms with a high tf. Choosing a tf threshold even as low as two or three will radically reduce the number of terms under consideration. Another heuristic is that terms with a high idf (i.e., a low df) tend to be longer. So you could threshold the terms by the number of characters, not selecting anything less than, e.g., six or seven characters. With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms that do a pretty good job of characterizing a document.
More details can be found in this javadoc.

Algorithm to compare similarity of ideas (as strings)

Consider an arbitrary text box that records the answer to the question, what do you want to do before you die?
Using a collection of response strings (max length 240), I'd like to somehow sort and group them and count them by idea (which may be just string similarity as described in this question).
Is there another or better way to do something like this?
Is this any different than string similarity?
Is this the right question to be asking?
The idea here is to have people write in a text box over and over again, and me to provide a number that describes, generally speaking, that 802 people wrote approximately the same thing
It is much more difficult than string similarity. This is what you need to do at a minimum:
Perform some text formatting/cleaning tasks like removing punctuations characters and common "stop words"
Construct a corpus (collection of words with their usage statistics) from the terms that occur answers.
Calculate a weight for every term.
Construct a document vector from every answer (each term corresponds to a dimension in a very high dimensional Euclidian space)
Run a clustering algorithm on document vectors.
Read a good statistical natural language processing book, or search google for good introductions / tutorials (likely terms: statistical nlp, text categorization, clustering) You can probably find some libraries (weka or nltk comes to mind) depending on the language of your choice but you need to understand the concepts to use the library anyway.
The Latent Semantic Analysis (LSA) might interest you. Here is a nice introduction.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
[...]
What you want is very much an open problem in NLP. #Ali's answer describes the idea at a high level, but the part "Construct a document vector for every answer" is the really hard one. There are a few obvious ways of building a document vector from a the vectors of the words it contains. Addition, multiplication and averaging are fast, but they affectively ignore the syntax. Man bites dog and Dog bites man will have the same representation, but clearly not the same meaning. Google compositional distributional semantics- as far as I know, there are people at Universities of Texas, Trento, Oxford, Sussex and at Google working in the area.

N-gram text categorization category size difference compensation

Lately I've been mucking about with text categorization and language classification based on Cavnar and Trenkle's article "N-Gram-Based Text Categorization" as well as other related sources.
For doing language classification I've found this method to be very reliable and useful. The size of the documents used to generate the N-gram frequency profiles is fairly unimportant as long as they are "long enough" since I'm just using the most common n N-grams from the documents.
On the other hand well-functioning text categorization eludes me. I've tried with both my own implementations of various variations of the algorithms at hand, with and without various tweaks such as idf weighting and other peoples' implementations. It works quite well as long as I can generate somewhat similarly-sized frequency profiles for the category reference documents but the moment they start to differ just a bit too much the whole thing falls apart and the category with the shortest profile ends up getting a disproportionate number of documents assigned to it.
Now, my question is. What is the preferred method of compensating for this effect? It's obviously happening because the algorithm assumes a maximum distance for any given N-gram that equals the length of the category frequency profile but for some reason I just can't wrap my head around how to fix it. One reason I'm interested in this fix is actually because I'm trying to automate the generation of category profiles based on documents with a known category which can vary in length (and even if they are the same length the profiles may end up being different lengths). Is there a "best practice" solution to this?
If you are still interested, and assuming I understand your question correctly, the answer to your problem would be to normalise your n-gram frequencies.
The simplest way to do this, on a per document basis, is to count the total frequency of all n-grams in your document and divide each individual n-gram frequency by that number. The result is that every n-gram frequency weighting now relates to a percentage of the total document content, regardless of the overall length.
Using these percentages in your distance metrics will discount the size of the documents and instead focus on the actual make up of their content.
It might also be worth noting that the n-gram representation only makes up a very small part of an entire categorisation solution. You might also consider using dimensional reduction, different index weighting metrics and obviously different classification algorithms.
See here for an example of n-gram use in text classification
As I know the task is to count probability of generation some text by language model M.
Recently i was working on measuring the readaiblity of texts using semantic, synctatic and lexical properties. It can be also measured by language model approach.
To answer properly you should consider these questions:
Are you using log-likelihood approach?
What levels of N-Grams are you using? unigrams digrams or higher level?
How big are language corpuses that you use?
Using only digrams and unigrams i managed to classify some documents with nice results. If your classification is weak consider creating bigger language corpuse or using n-grams of lower levels.
Also remember that classifying some text to invalid category may be an error depending on length of text (randomly there are few words appearing in another language models).
Just consider making your language corpuses bigger and know that analysing short texts have higher probability of missclasification

Finding personal information in documents (hard problem)

I am tasked with trying to create an automated system that removes personal information from text documents.
Emails, phone numbers are relatively easy to remove. Names are not. The problem is hard because there are names in the documents that need to be kept (eg, references, celebrities, characters etc). The author name needs to be removed from the content (there may also be more than one author).
I have currently thought of the following:
Quite often personal names are located at the beginning of a document
Look at how frequently the name is used in the document (personal names tend to be written just once)
Search for words around the name to find patterns (mentions of university and so on...)
Any ideas? Anyone solved this problem already??
With current technology, doing what what you are describing in a fully automated way with a low error rate is impossible.
It might be possible to come up with an approximate solution, but it would still make a lot of errors...... either false positives or false negatives or some combination of the two.
If you are still really determined to try, I think your best approach would be Bayseian filtering (as used in spam filtering). The reason for this is that it is quite good at assigning probabilities based on relative positions and frequencies of words, and could also learn which names are more likely / less likely to be celebrities etc.
The area of machine learning that you would need to learn about to make an attempt at this would be natural language processing. There are a few different approaches that could be used, bayesian networks (something better then a naive bayes classifier), support vector machines, or neural nets would be areas to research. Whatever system you end up building would probably need to use an annotated corpus (labeled set of data) to learn where names should be. Even with a large corpus, whatever you build will not be 100% accurate, so you would probably be better off setting flags at the names for deletion instead of just deleting all of the words that might be names.
This is a common problem in basic cryptography courses (my first programming job).
If you generated a word histogram of your entire document corpus (each bin is a word on the x-axis whose height is frequency represented by height on the y-axis), words like "this", "the", "and" and so forth would be easy to identify because of their large y-values (frequency). Surnames should at the far right of your histogram--very infrequent; given names towards the left, but not by much.
Does this technique definitively identify the names in each document? No, but it could be used to substantially constrain your search, by eliminating all words whose frequency is larger than X. Likewise, there should be other attributes that constrain your search, such as author names only appear once on the documents they authored and not on any other documents.

Algorithms to detect phrases and keywords from text

I have around 100 megabytes of text, without any markup, divided to approximately 10,000 entries. I would like to automatically generate a 'tag' list. The problem is that there are word groups (i.e. phrases) that only make sense when they are grouped together.
If I just count the words, I get a large number of really common words (is, the, for, in, am, etc.). I have counted the words and the number of other words that are before and after it, but now I really cannot figure out what to do next The information relating to the 2 and 3 word phrases is present, but how do I extract this data?
Before anything, try to preserve the info about "boundaries" which comes in the input text.
(if such info has not readily be lost, your question implies that maybe the tokenization has readily been done)
During the tokenization (word parsing, in this case) process, look for patterns that may define expression boundaries (such as punctuation, particularly periods, and also multiple LF/CR separation, use these. Also words like "the", can often be used as boundaries. Such expression boundaries are typically "negative", in a sense that they separate two token instances which are sure to not be included in the same expression. A few positive boundaries are quotes, particularly double quotes. This type of info may be useful to filter-out some of the n-grams (see next paragraph). Also word sequencces such as "for example" or "in lieu of" or "need to" can be used as expression boundaries as well (but using such info is edging on using "priors" which I discuss later).
Without using external data (other than the input text), you can have a relative success with this by running statistics on the text's digrams and trigrams (sequence of 2 and 3 consecutive words). Then [most] the sequences with a significant (*) number of instances will likely be the type of "expression/phrases" you are looking for.
This somewhat crude method will yield a few false positive, but on the whole may be workable. Having filtered the n-grams known to cross "boundaries" as hinted in the first paragraph, may help significantly because in natural languages sentence ending and sentence starts tend to draw from a limited subset of the message space and hence produce combinations of token that may appear to be statistically well represented, but which are typically not semantically related.
Better methods (possibly more expensive, processing-wise, and design/investment-wise), will make the use of extra "priors" relevant to the domain and/or national languages of the input text.
POS (Part-Of-Speech) tagging is quite useful, in several ways (provides additional, more objective expression boundaries, and also "noise" words classes, for example all articles, even when used in the context of entities are typically of little in tag clouds such that the OP wants to produce.
Dictionaries, lexicons and the like can be quite useful too. In particular, these which identify "entities" (aka instances in WordNet lingo) and their alternative forms. Entities are very important for tag clouds (though they are not the only class of words found in them), and by identifying them, it is also possible to normalize them (the many different expressions which can be used for say,"Senator T. Kennedy"), hence eliminate duplicates, but also increase the frequency of the underlying entities.
if the corpus is structured as a document collection, it may be useful to use various tricks related to the TF (Term Frequency) and IDF (Inverse Document Frequency)
[Sorry, gotta go, for now (plus would like more detail from your specific goals etc.). I'll try and provide more detail and pointes later]
[BTW, I want to plug here Jonathan Feinberg and Dervin Thunk responses from this post, as they provide excellent pointers, in terms of methods and tools for the kind of task at hand. In particular, NTLK and Python-at-large provide an excellent framework for experimenting]
I'd start with a wonderful chapter, by Peter Norvig, in the O'Reilly book Beautiful Data. He provides the ngram data you'll need, along with beautiful Python code (which may solve your problems as-is, or with some modification) on his personal web site.
It sounds like you're looking for collocation extraction. Manning and Schütze devote a chapter to the topic, explaining and evaluating the 'proposed formulas' mentioned in the Wikipedia article I linked to.
I can't fit the whole chapter into this response; hopefully some of their links will help. (NSP sounds particularly apposite.) nltk has a collocations module too, not mentioned by Manning and Schütze since their book predates it.
The other responses posted so far deal with statistical language processing and n-grams more generally; collocations are a specific subtopic.
Do a matrix for words. Then if there are two consecutive words then add one to that appropriate cell.
For example you have this sentence.
mat['for']['example'] ++;
mat['example']['you'] ++;
mat['you']['have'] ++;
mat['have']['this'] ++;
mat['this']['sentence'] ++;
This will give you values for two consecutive words.
You can do this word three words also. Beware this requires O(n^3) memory.
You can also use a heap for storing the data like:
heap['for example']++;
heap['example you']++;
One way would be to build yourself an automaton. most likely a Nondeterministic Finite Automaton(NFA).
NFA
Another more simple way would be to create a file that has contains the words and/or word groups that you want to ignore, find, compare, etc. and store them in memory when the program starts and then you can compare the file you are parsing with the word/word groups that are contained in the file.

Resources