How do I data mine text? - sorting

Here's the problem. I have a bunch of large text files with paragraphs and paragraphs of written matter. Each para contains references to a few people (names), and documents a few topics (places, objects).
How do I data mine this pile to assemble some categorised library? ... in general, 2 things.
I don't know what I'm looking for, so I need a program to get the most used words/multiple words ("Jacob Smith" or "bluewater inn" or "arrow").
Then knowing the keywords, I need a program to help me search for related paras, then sort and refine results (manually by hand).

Your question is a tiny bit open-ended :)
Chances are, you will find modules for whatever analysis you want to do in the UIMA framework:
Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.
UIMA is made of many things
UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.
You may also find Open Calais a useful API for text analysis; depending on how big your heap of documents is, it may be more or less appropriate.
If you want it quick and dirty -- create an inverted index that stores all locations of words (basically a big map of words to all file ids in which they occur, paragraphs in those files, lines in the paragraphs, etc). Also index tuples so that given a fileid and paragraph you can look up all the neighbors. This will do what you describe, but it takes quite a bit of tweaking to get it to pull up meaningful correlations (some keywords to start you off on your search: information retrieval, TF-IDF, Pearson correlation coefficient).

Looks like you're trying to create an index?
I think Learning Perl has information on finding the frequency of words in a text file, so that's not a particularly hard problem.
But do you really want to know that "the" or "a" is the most common word?
If you're looking for some kind of topical index, the words you actually care about are probably down the list a bit, intermixed with more words you don't care about.
You could start by getting rid of "stop words" at the front of the list to filter your results a bit, but nothing would beat associating keywords that actually reflect the topic of the paragraphs, and that requires context.
Anyway, I could be off base, but there you go. ;)

The problem with what you ask is that you don't know what you're looking for. If you had some sort of weighted list of terms that you cared about, then you'd be in good shape.
Semantically, the problem is twofold:
Generally the most-used words are the least relevant. Even if you use a stop-words file, a lot of chaff remains
Generally, the least-used words are the most relevant. For example, "bluewater inn" is probably infrequent.
Let's suppose that you had something that did what you ask, and produced a clean list of all the keywords that appear in your texts. There would be thousands of such keywords. Finding "bluewater inn" in a list of 1000s of terms is actually harder than finding it in the paragraph (assuming you don't know what you're looking for) because you can skim the texts and you'll find the paragraph that contains "bluewater inn" because of its context, but you can't find it in a list because the list has no context.
Why don't you talk more about your application and process and then perhaps we can help you better??

I think what you want to do is called "entity extraction". This Wikipedia article has a good overview and a list of apps, including open source ones. I used to work on one of the commercial tools in the list, but not in a programming capacity, so I can't help you there.

Ned Batchelder gave a great talk at DevDays Boston about Python.
He presented a spell-corrector written in Python that does pretty much exactly what you want.
You can find the slides and source code here:
http://nedbatchelder.com/text/devdays.html

I recommend that you have a look at R. In particular, look at the tm package. Here are some relevant links:
Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel
mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Package homepage: http://cran.r-project.org/web/packages/tm/index.html
Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
More generally, there are a large number of text mining packages on the Natural Language Processing view on CRAN.

Related

Fuzzy logic text transformation methodologies?

I have a large set of data (several hundred thousand records) that are unique entries in a CSV. These entries are essentially products that are being listed in a store from a vendor that offers these products. The problem is that while they offer us rights to copy these verbatim or to change wording, I don't want to list them verbatim obviously since Google will slap the ranking for having "duplicate" content. And then, also obviously, manually editing 500,000 items would take a ridiculous amount of time.
The solution, it would seem, would be to leverage fuzzy logic that would take certain phraseology and transform it to something different that would not then be penalized by Google. I have hitherto been unable to find any real library to address this or a solid solution that addresses such a situation.
I am thinking through my own algorithms to perhaps accomplish this, but I hate to reinvent the wheel or, worse, be beaten down by the big G after a failed attempt.
My idea is to simply search for various phrases and words (sans stop words) and then essentially map those to phrases and words that can be randomly inserted that still have equivalent meaning, but enough substance to hopefully not cause a deranking situation.
A solution for Ruby would be optimal, but absolutely not necessary as any language can be used.
Are there any existing algorithms, theories or implementations of a similar scenario that could be used to model or solve such a scenario?

What's the best way to generate keywords from a given Text?

I want to generate Keywords for my CMS.
Does someone know a good PHP Script (or something else) which generates keywords?
I have a HTML Site like this: http://pastebin.com/ZU8vdyeP
This is a very hard problem for a computer to solve. It would be much easier to get somebody (else?) to do it manually, or simply not do it at all.
If you'd really need a computer to do it, I'd head over to the excellent Python library NLTK which has many tools for this sort of thing (=natural language processing), and it's a lot of fun to work with.
For example, you could calculate a frequency distribution of the words, and then search for the most common hypernyms of larger (above say 5 char) words that appear most frequently and use that as a hint of what the keywords could be.
Again, it is much easier to get it done by a human, however.
to automate, get the words from the article, match them against a blacklist and dont include words under 4 chars.
Additionally, Let user manually edit. So only automate if no present keywords.
This can be done by trigger or application layer.
regards,
/t
If I understand the problem, you have text and you want to determine keywords that are most relevant to the text.
Three approaches:
1) Have user enter keywords
2) Statistical analysis of text, for example determine the words that are far more common in the text than they are in the language overall. Any good text on Information Retrieval will have some algorithms.
3) If you have a set of documents that are already classified (perhaps previously classified by humans) then you can use a machine learning algorithm (perhaps a Bayesian classifier) to train the system to classify the new documents. If you let the users override/correct the suggested keywords, the system can learn over time.
Personally, I'd do #3, since it is more adaptive.

Algorithms to detect phrases and keywords from text

I have around 100 megabytes of text, without any markup, divided to approximately 10,000 entries. I would like to automatically generate a 'tag' list. The problem is that there are word groups (i.e. phrases) that only make sense when they are grouped together.
If I just count the words, I get a large number of really common words (is, the, for, in, am, etc.). I have counted the words and the number of other words that are before and after it, but now I really cannot figure out what to do next The information relating to the 2 and 3 word phrases is present, but how do I extract this data?
Before anything, try to preserve the info about "boundaries" which comes in the input text.
(if such info has not readily be lost, your question implies that maybe the tokenization has readily been done)
During the tokenization (word parsing, in this case) process, look for patterns that may define expression boundaries (such as punctuation, particularly periods, and also multiple LF/CR separation, use these. Also words like "the", can often be used as boundaries. Such expression boundaries are typically "negative", in a sense that they separate two token instances which are sure to not be included in the same expression. A few positive boundaries are quotes, particularly double quotes. This type of info may be useful to filter-out some of the n-grams (see next paragraph). Also word sequencces such as "for example" or "in lieu of" or "need to" can be used as expression boundaries as well (but using such info is edging on using "priors" which I discuss later).
Without using external data (other than the input text), you can have a relative success with this by running statistics on the text's digrams and trigrams (sequence of 2 and 3 consecutive words). Then [most] the sequences with a significant (*) number of instances will likely be the type of "expression/phrases" you are looking for.
This somewhat crude method will yield a few false positive, but on the whole may be workable. Having filtered the n-grams known to cross "boundaries" as hinted in the first paragraph, may help significantly because in natural languages sentence ending and sentence starts tend to draw from a limited subset of the message space and hence produce combinations of token that may appear to be statistically well represented, but which are typically not semantically related.
Better methods (possibly more expensive, processing-wise, and design/investment-wise), will make the use of extra "priors" relevant to the domain and/or national languages of the input text.
POS (Part-Of-Speech) tagging is quite useful, in several ways (provides additional, more objective expression boundaries, and also "noise" words classes, for example all articles, even when used in the context of entities are typically of little in tag clouds such that the OP wants to produce.
Dictionaries, lexicons and the like can be quite useful too. In particular, these which identify "entities" (aka instances in WordNet lingo) and their alternative forms. Entities are very important for tag clouds (though they are not the only class of words found in them), and by identifying them, it is also possible to normalize them (the many different expressions which can be used for say,"Senator T. Kennedy"), hence eliminate duplicates, but also increase the frequency of the underlying entities.
if the corpus is structured as a document collection, it may be useful to use various tricks related to the TF (Term Frequency) and IDF (Inverse Document Frequency)
[Sorry, gotta go, for now (plus would like more detail from your specific goals etc.). I'll try and provide more detail and pointes later]
[BTW, I want to plug here Jonathan Feinberg and Dervin Thunk responses from this post, as they provide excellent pointers, in terms of methods and tools for the kind of task at hand. In particular, NTLK and Python-at-large provide an excellent framework for experimenting]
I'd start with a wonderful chapter, by Peter Norvig, in the O'Reilly book Beautiful Data. He provides the ngram data you'll need, along with beautiful Python code (which may solve your problems as-is, or with some modification) on his personal web site.
It sounds like you're looking for collocation extraction. Manning and Schütze devote a chapter to the topic, explaining and evaluating the 'proposed formulas' mentioned in the Wikipedia article I linked to.
I can't fit the whole chapter into this response; hopefully some of their links will help. (NSP sounds particularly apposite.) nltk has a collocations module too, not mentioned by Manning and Schütze since their book predates it.
The other responses posted so far deal with statistical language processing and n-grams more generally; collocations are a specific subtopic.
Do a matrix for words. Then if there are two consecutive words then add one to that appropriate cell.
For example you have this sentence.
mat['for']['example'] ++;
mat['example']['you'] ++;
mat['you']['have'] ++;
mat['have']['this'] ++;
mat['this']['sentence'] ++;
This will give you values for two consecutive words.
You can do this word three words also. Beware this requires O(n^3) memory.
You can also use a heap for storing the data like:
heap['for example']++;
heap['example you']++;
One way would be to build yourself an automaton. most likely a Nondeterministic Finite Automaton(NFA).
NFA
Another more simple way would be to create a file that has contains the words and/or word groups that you want to ignore, find, compare, etc. and store them in memory when the program starts and then you can compare the file you are parsing with the word/word groups that are contained in the file.

Building or Finding a "relevant terms" suggestion feature

Given a few words of input, I want to have a utility that will return a diverse set of relevant terms, phrases, or concepts. A caveat is that it would need to have a large graph of terms to begin with, or else the feature would not be very useful.
For example, submitting "baseball" would return
["shortstop", "Babe Ruth", "foul ball", "steroids", ... ]
Google Sets is the best example I can find of this kind of feature, but I can't use it since they have no public API (and I wont go against their TOS). Also, single-word input doesn't garner a very diverse set of results. I'm looking for a solution that goes off on tangents.
The closest I've experimented with is using WikiPedia's API to search Categories and Backlinks, but there's no way to directly sort those results by "relevance" or "popularity". Without that, the suggestion list is massive and all over the place, which is not immediately useful and very hard to whittle down.
Using A Thesaurus could also work minimally, but that would leave out any proper nouns or tangentially relevant terms (like any of the results listed above).
I would happily reuse an open service, if one exists, but I haven't found anything sufficient.
I'm looking for either a way to implement this either in-house with a decently-populated starting set, or reuse a free service that offers this.
Have a solution? Thanks ahead of time!
UPDATE: Thank you for the incredibly dense & informative answers. I'll choose a winning answer in 6 to 12 months, when I'll hopefully understand what you've all suggested =)
You might be interested in WordNet. It takes a bit of linguistic knowledge to understand the API, but basically the system is a database of meaning-based links between English words, which is more or less what you're searching for. I'm sure I can dig up more information if you want it.
Peter Norvig (director of research at Google) spoke about how they do this at Google (specifically mentioning Google Sets) in a Facebook Tech Talk. The idea is that a relatively simple algorithm on a huge dataset (e.g. the entire web) is much better than a complicated algorithm on a small data set.
You could look at Google's n-gram collection as a starting point. You'd start to see what concepts are grouped together. Norvig hinted that internally Google has up to 7-grams for use in things like Google Translate.
If you're more ambitious, you could download all of Wikipedia's articles in the language you desire and create your own n-gram database.
The problem is even more complicated if you just have a single word; check out this recent thesis for more details on word sense disambiguation.
It's not an easy problem, but it is useful as you mentioned. In the end, I think you'll find that a really successful implementation will have a relatively simple algorithm and a whole lot of data.
Take a look at the following two papers:
Clustering User Queries of a Search Engine [pdf]
Topic Detection by Clustering Keywords [pdf]
Here is my attempt at a very simplified explanation:
If we have a database of past user queries, we can define a similarity function between two queries. For example: number of words in common. Now for each query in our database, we compute its similarity with each other query, and remember the k most similar queries. The non-overlapping words from these can be returned as "related terms".
We can also take this approach with a database of documents containing information users might be searching for. We can define the similarity between two search terms as the number of documents containing both divided by the number of documents containing either. To decide which terms to test, we can scan the documents and throw out words that are either too common ('and', 'the', etc.) or that are too obscure.
If our data permits, then we could see which queries led users to choosing which results, instead of comparing documents by content. For example if we had data that showed us that users searching for "Celtics" and "Lakers" both ended up clicking on espn.com, then we could call these related terms.
If you're starting from scratch with no data about past user queries, then you can try Wikipedia, or the Bag of Words dataset as a database of documents. If you are looking for a database of user search terms and results, and if you are feeling adventurous, then you can take a look at the AOL Search Data.

Is there an algorithm that extracts meaningful tags of english text

I would like to extract a reduced collection of "meaningful" tags (10 max) out of an english text of any size.
http://tagcrowd.com/ is quite interesting but the algorithm seems very basic (just word counting)
Is there any other existing algorithm to do this?
There are existing web services for this. Two Three examples:
Yahoo's Term Extraction API
Topicalizer
OpenCalais
When you subtract the human element (tagging), all that is left is frequency. "Ignore common English words" is the next best filter, since it deals with exclusion instead of inclusion. I tested a few sites, and it is very accurate. There really is no other way to derive "meaning", which is why the Semantic Web gets so much attention these days. It is a way to imply meaning with HTML... of course, that has a human element to it as well.
Basically, this is a text categorization problem/document classification problem. If you have access to a number of already tagged documents, you could analyze which (content) words trigger which tags, and then use this information for tagging new documents.
If you don't want to use a machine-learning approach and you still have a document collection, then you can use metrics like tf.idf to filter out interesting words.
Going one step further, you can use Wordnet to find synonyms and replace words by their synonym, if the frequency of the synonym is higher.
Manning & Schütze contains a lot more introduction on text categorization.
In text classification, this problem is known as dimensionality reduction. There are many useful algorithms in the literature on this subject.
You want to do the semantic analysis of a text.
Word frequency analysis is one of the easiest ways to do the semantic analysis. Unfortunately (and obviously) it is the least accurate one. It can be improved by using special dictionaries (like for synonims or forms of a word), "stop-lists" with common words, other texts (to find those "common" words and exclude them)...
As for other algorithms they could be based on:
Syntax analysis (like trying to find the main subject and/or verb in a sentence)
Format analysis (analyzing headers, bold text, italic... where applicable)
Reference analysis (if the text is in Internet, for example, then a reference can describe it in several words... used by some search engines)
BUT... you should understand that these algorithms are mereley heuristics for semantic analysis, not the strict algorithms of achieving the goal.
The problem of semantic analysis is one of the main problems in Artificial Intelligence/Machine Learning studies since the first computers appeared.
Perhaps "Term Frequency - Inverse Document Frequency" TF-IDF would be useful...
You can use this in two steps:
1 - Try topic modeling algorithms:
Latent Dirichlet Allocation
Latent word Embeddings
2 - After that you can select the most representative word of every topic as a tag

Resources