Keywords Ranking - ranking

I have a list of keywords taken from 95 documents. I'd like to rank their importance, but I have only the number of documents in which the keywords appear and the maximum frequency of a keyword among all the documents. I'm looking for a ranking formula that could help. At the moment I'm using IDF, but I'd like to know if there is any better formula.

word frequency is already done by listing the most important words in English ( and many other langs ) by Wikitionary Frequency Lists which has many type of lists based on the most important and top words, besides the TV and Movies most frequent words and many others.
If you like to do some algorithm based on word ranking I would suggest you don't go far away from
TF-IDF
and here you can find the Latent semantic indexing algorithm which might me an asset for you.
Hope that is what you needed.

TF-IDF is definitely a good base and easy to implement.
It is also really common to add other bias such as the position of your terms inside your documents; a term occurring at the beginning of a document, or better, in its title tends to be more relevant than the ones occurring in the middle or at the end.
But you have to keep in mind that choosing an algorithm and its bias also depends on the nature of your documents. For instance, long documents (e.g. research papers or books) would need a position bias, but not necessarily news articles. Same thing for the "IDF" measure, it has to be computed on a large corpus of documents with similar type of content than your documents. You don't want to have a relevancy score computed on a "TV and Movies" corpus if for instance your documents are research papers about semi-conductors.
My two cents.

Related

Finding keywords in a set of small texts

I have a set of almost 2000 texts.
My goal is to find the keywords across these texts to understand what is the subject of them, or simply the most common words and expressions.
I would like some ideias of algorithms to score the words and identify when they frequently come together.
I have read some other related questions here, but I'm trying to get more and more information about this subject. So any ideas are very welcome. Thank you so much!
--
I have already extracted stopwords. After removing them I have more than 7000 words remaing; My question is how to score these words and from which point I can consider removing some them from my list of keywords. Also, how to get key expressions, find words that come together.
You may want to refer to a classical text on Information Retrieval. Most of the algorithms use a stop list to remove commonly occurring words such as "for" and "the", and then, extract the base or root word (change "seeing", "seen", "see", "sees" to the base word "see"). The remaining words form the keywords of the document and are weighted by things like term frequency (how many times the word occurs in the document) and inverse document frequency (how unique is the word in describing the content). You can use the weighted keywords as document representation and use them for retrieval.
You can use the Lucene MoreLikeThis implementation, which extracts a list of top most important keywords from a given text document. The term scoring function it uses is the tf-idf, i.e. it chooses those terms with topmost tf-idf scores, i.e. the terms which are relatively uncommon and occur frequently in the document.
If efficiency is an issue, it employs some common heuristics as follows.
Since you're trying to maximize a tf*idf score, you're probably most interested in terms with a high tf. Choosing a tf threshold even as low as two or three will radically reduce the number of terms under consideration. Another heuristic is that terms with a high idf (i.e., a low df) tend to be longer. So you could threshold the terms by the number of characters, not selecting anything less than, e.g., six or seven characters. With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms that do a pretty good job of characterizing a document.
More details can be found in this javadoc.

How to deal with very uncommon terms in tf-idf?

I'm implementing a naive "keyword extraction algorithm". I'm self-taught though so I lack some terminology and maths common in the online literature.
I'm finding "most relevant keywords" of a document thus:
I count how often each term is used in the current document. Let's call this tf.
I look up how often each of those terms is used in the entire database of documents. Let's call this df.
I calculate a relevance weight r for each term by r = tf / df.
Each document is a proper subset of the corpus so no document contains a term not in the corpus. This means I don't have to worry about division by zero.
I sort all terms by their r and keep however many of the top terms. These are the top keywords most closely associated with this document. Terms that are common in this document are more important. Terms that are common in the entire database of documents are less important.
I believe this is a naive form of tf-idf.
The problem is that when terms are very uncommon in the entire database but occur in the current document they seem to have too high an r value.
This can be thought of as some kind of artefact due to small sample size. What is the best way or the usual ways to compensate for this?
Throw away terms less common in the overall database than a certain threshold. If so how is that threshold calculated? It seems it would depend on too many factors to be a hard-coded value.
Can it be weighted or smoothed by some kind of mathematical function such as inverse square or cosine?
I've tried searching the web and reading up on tf-idf but much of what I find deals with comparing documents, which I'm not interested in. Plus most of them have a low ratio of explanation vs. jargon and formulae.
(In fact my project is a generalization of this problem. I'm really working with tags on Stack Exchange sites so the total number of terms is small, stopwords are irrelevant, and low-usage tags might be more common than low-usage words in the standard case.)
I spent a lot of time trying to do targeted Google searches for particular tf-idf information and dug through many documents.
Finally I found a document with clear and concise explanation accompanied by formulae even I can grok: Document Processing and the Semantic Web, Week 3 Lecture 1: Ranking for Information Retrieval by Robert Dale of the Department of Computing at Macquarie University:
Page 20:
The two things I was missing was taking into account the number of documents in the collection, and using the logarithm of the inverse df rather than using the inverse df directly.

N-gram text categorization category size difference compensation

Lately I've been mucking about with text categorization and language classification based on Cavnar and Trenkle's article "N-Gram-Based Text Categorization" as well as other related sources.
For doing language classification I've found this method to be very reliable and useful. The size of the documents used to generate the N-gram frequency profiles is fairly unimportant as long as they are "long enough" since I'm just using the most common n N-grams from the documents.
On the other hand well-functioning text categorization eludes me. I've tried with both my own implementations of various variations of the algorithms at hand, with and without various tweaks such as idf weighting and other peoples' implementations. It works quite well as long as I can generate somewhat similarly-sized frequency profiles for the category reference documents but the moment they start to differ just a bit too much the whole thing falls apart and the category with the shortest profile ends up getting a disproportionate number of documents assigned to it.
Now, my question is. What is the preferred method of compensating for this effect? It's obviously happening because the algorithm assumes a maximum distance for any given N-gram that equals the length of the category frequency profile but for some reason I just can't wrap my head around how to fix it. One reason I'm interested in this fix is actually because I'm trying to automate the generation of category profiles based on documents with a known category which can vary in length (and even if they are the same length the profiles may end up being different lengths). Is there a "best practice" solution to this?
If you are still interested, and assuming I understand your question correctly, the answer to your problem would be to normalise your n-gram frequencies.
The simplest way to do this, on a per document basis, is to count the total frequency of all n-grams in your document and divide each individual n-gram frequency by that number. The result is that every n-gram frequency weighting now relates to a percentage of the total document content, regardless of the overall length.
Using these percentages in your distance metrics will discount the size of the documents and instead focus on the actual make up of their content.
It might also be worth noting that the n-gram representation only makes up a very small part of an entire categorisation solution. You might also consider using dimensional reduction, different index weighting metrics and obviously different classification algorithms.
See here for an example of n-gram use in text classification
As I know the task is to count probability of generation some text by language model M.
Recently i was working on measuring the readaiblity of texts using semantic, synctatic and lexical properties. It can be also measured by language model approach.
To answer properly you should consider these questions:
Are you using log-likelihood approach?
What levels of N-Grams are you using? unigrams digrams or higher level?
How big are language corpuses that you use?
Using only digrams and unigrams i managed to classify some documents with nice results. If your classification is weak consider creating bigger language corpuse or using n-grams of lower levels.
Also remember that classifying some text to invalid category may be an error depending on length of text (randomly there are few words appearing in another language models).
Just consider making your language corpuses bigger and know that analysing short texts have higher probability of missclasification

How does Amazon's Statistically Improbable Phrases work?

How does something like Statistically Improbable Phrases work?
According to amazon:
Amazon.com's Statistically Improbable
Phrases, or "SIPs", are the most
distinctive phrases in the text of
books in the Search Inside!™ program.
To identify SIPs, our computers scan
the text of all books in the Search
Inside! program. If they find a phrase
that occurs a large number of times in
a particular book relative to all
Search Inside! books, that phrase is a
SIP in that book.
SIPs are not necessarily improbable
within a particular book, but they are
improbable relative to all books in
Search Inside!. For example, most SIPs
for a book on taxes are tax related.
But because we display SIPs in order
of their improbability score, the
first SIPs will be on tax topics that
this book mentions more often than
other tax books. For works of fiction,
SIPs tend to be distinctive word
combinations that often hint at
important plot elements.
For instance, for Joel's first book, the SIPs are: leaky abstractions, antialiased text, own dog food, bug count, daily builds, bug database, software schedules
One interesting complication is that these are phrases of either 2 or 3 words. This makes things a little more interesting because these phrases can overlap with or contain each other.
It's a lot like the way Lucene ranks documents for a given search query. They use a metric called TF-IDF, where TF is term frequence and idf is inverse document frequency. The former ranks a document higher the more the query terms appear in that document, and the latter ranks a document higher if it has terms from the query that appear infrequently across all documents. The specific way they calculate it is log(number of documents / number of documents with the term) - ie, the inverse of the frequency that the term appears.
So in your example, those phrases are SIPs relative to Joel's book because they are rare phrases (appearing in few books) and they appear multiple times in his book.
Edit: in response to the question about 2-grams and 3-grams, overlap doesn't matter. Consider the sentence "my two dogs are brown". Here, the list of 2-grams is ["my two", "two dogs", "dogs are", "are brown"], and the list of 3-grams is ["my two dogs", "two dogs are", "dogs are brown"]. As I mentioned in my comment, with overlap you get N-1 2-grams and N-2 3-grams for a stream of N words. Because 2-grams can only equal other 2-grams and likewise for 3-grams, you can handle each of these cases separately. When processing 2-grams, every "word" will be a 2-gram, etc.
They are probably using a variation on the tf-idf weight, detecting phrases that occur a high number of times in the specific book but few times in the whole corpus minus the specific book. Repeat for each book.
Thus 'improbability' is relative to the whole corpus and could be understood as 'uniqueness', or 'what makes a book unique compared to the rest of the library'.
Of course, I'm just guessing.
LingPipe has a tutorial on how to do this, and they link to references. They don't discuss the math behind it, but their source code is open so you can look in their source code.
I can't say I know what Amazon does, because they probably keep it a secret (or at least they just haven't bothered to tell anyone).
Sorry for reviving an old thread, but I landed up here for the same question and found that there is some newer work which might add to the great thread.
I feel SIPs are more unique to a document than just words with high TF-IDF scores. For example, In a document about Harry Potter, terms like Hermione Granger and Hogwarts tend to be better SIPs where as terms like magic and London aren't. TF-IDF is not great at making this distinction.
I came across an interesting definition of SIPs here. In this work, the phrases are modelled as n-grams and their probability of occurrence in a document is computed to identify their uniqueness.
As a starting point, I'd look at Markov Chains.
One option:
build a text corpus from the full index.
build a text corpus from just the one book.
for every m to n word phrase, find the probability that each corpus would generate it.
select the N phrases with the highest ratio of probabilities.
An interesting extension would be to run a Markov Chain generator where your weights table is a magnification of the difference between the global and local corpus. This would generate a "caricature" (literally) of the author's stylistic idiosyncrasies.
I am fairly sure its the combination of SIPs that identify the book as unique. In your example its very rare almost impossible that another book has "leaky abstractions" and "own dog food" in the same book.
I am however making an assumption here as I do not know for sure.

how to get the similar texts from a lot of pages?

get the x most similar texts from a lot of texts to one text.
maybe change the page to text is better.
You should not compare the text to every text, because its too slow.
The ability of identifying similar documents/pages, whether web pages or more general forms of text or even of codes, has many practical applications. This topics is well represented in scholarly papers and also in less specialized forums. In spite of this relative wealth of documentation, it can be difficult to find the information and techniques relevant to a particular case.
By describing the specific problem at hand and associated requirements, it may be possible to provide you more guidance. In the meantime the following provides a few general ideas.
Many different functions may be used to measure, in some fashion, the similarity of pages. Selecting one (or possibly several) of these functions depends on various factors, including the amount of time and/or space one can allot the problem and also to the level of tolerance desired for noise.
Some of the simpler metrics are:
length of the longest common sequence of words
number of common words
number of common sequences of words of more than n words
number of common words for the top n most frequent words within each document.
length of the document
Some of the metrics above work better when normalized (for example to avoid favoring long pages which, through their sheer size have more chances of having similar words with other pages)
More complicated and/or computationally expensive measurements are:
Edit distance (which is in fact a generic term as there are many ways to measure the Edit distance. In general, the idea is to measure how many [editing] operations it would take to convert one text to the other.)
Algorithms derived from the Ratcliff/Obershelp algorithm (but counting words rather than letters)
Linear algebra-based measurements
Statistical methods such as Bayesian fitlers
In general, we can distinguish measurements/algorithms where most of the calculation can be done once for each document, followed by a extra pass aimed at comparing or combining these measurements (with relatively little extra computation), as opposed to the algorithms that require to deal with the documents to be compared in pairs.
Before choosing one (or indeed several such measures, along with some weighing coefficients), it is important to consider additional factors, beyond the similarity measurement per-se. for example, it may be beneficial to...
normalize the text in some fashion (in the case of web pages, in particular, similar page contents, or similar paragraphs are made to look less similar because of all the "decorum" associated with the page: headers, footers, advertisement panels, different markup etc.)
exploit markup (ex: giving more weight to similarities found in the title or in tables, than similarities found in plain text.
identify and eliminate domain-related (or even generally known) expressions. For example two completely different documents may appear similar is they have in common two "boiler plate" paragraphs pertaining to some legal disclaimer or some general purpose description, not truly associated with the essence of each cocument's content.
Tokenize texts, remove stop words and arrange in a term vector. Calculate tf-idf. Arrange all vectors in a matrix and calculate distances between them to find similar docs, using for example Jaccard index.
All depends on what you mean by "similar". If you mean "about the same subject", looking for matching N-grams usually works pretty well. For example, just make a map from trigrams to the text that contains them, and put all trigrams from all of your texts into that map. Then when you get your text to be matched, look up all its trigrams in your map and pick the most frequent texts that come back (perhaps with some normalization by length).
I don't know what you mean by similar, but perhaps you ought to load your texts into a search system like Lucene and pose your 'one text' to it as a query. Lucene does pre-index the texts so it can quickly find the most similar ones (by its lights) at query-time, as you asked.
You will have to define a function to measure the "difference" between two pages. I can imagine a variety of such functions, one of which you have to choose for your domain:
Difference of Keyword Sets - You can prune the document of the most common words in the dictionary, and then end up with a list of unique keywords per document. The difference funciton would then calculate the difference as the difference of the sets of keywords per document.
Difference of Text - Calculate each distance based upon the number of edits it takes to turn one doc into another using a text diffing algorithm (see Text Difference Algorithm.
Once you have a difference function, simply calculate the difference of your current doc with every other doc, then return the other doc that is closest.
If you need to do this a lot and you have a lot of documents, then the problem becomes a bit more difficult.

Resources