How to extract keywords from lots of documents? - hadoop

I have many documents, over ten thousands (maybe more). I'd like to extract some keywords from each document, let's say 5 keywords from each document, using hadoop. Each document may talk about a unique topic. My current approach is to use Latent Dirichlet Allocation (LDA) implemented in Mahout. However as each document talks about a different topic, the number of extracted topics should be equal to the number of documents, which is very large. As the LDA become very inefficient when the number of topics become large, my approach is to randomly group documents into small groups each having only 100 documents and then use Mahout LDA to extract 100 topics from each group. This approach works, but may not be very efficient because each time I run hadoop on small set of documents. Does anyone has a better (more efficient) idea for this?

Related

Finding documents containing a phrase in a big corpus

I have a big corpus of text documents, which is dynamically updated. Every second nearly 100 new documents are added to this corpus.
I want to find the documents that contain an input phrase query (or one of the input phrases) in real-time. The queries also come sequentially with a high rate.
What is the right tool to implement this? Is Elasticsearch the appropriate one or there are lighter tools?

Computing Document Similarity Matrices in Sphinx?

Does Sphinx provide a way to precompute document similarity matrices? I have looked at Sphinx/Solr/Lucene; it seems Lucene is able to do this indirectly using Term Vectors Computing Document Similarity with Term Vectors.
Currently I am using the tf-idf-similarity gem to do these calculations, but it is incredibly slow as the dataset grows; something like On^(n-1!).
Currently trying to find a faster alternative to this. Lucene seems like a potential solution, but it doesn't have as much support within the Ruby community so if Sphinx has a good way of doing this that would be ideal.
Just to clarify; I am not trying to do live search similarity matching, which appears to be the most common use case for both Lucene and Sphinx, I am trying to precompute a similarity matrix that will create a similarity between all documents with the dataset. This will subsequently be used in data visualizations for different types of user analysis.
Also anyone with prior experience doing this I'm curious about benchmarks. How it looks in terms of time to process and how much computing power and/or parallelization were you using with reference to number of documents and doc size average.
Currently it is taking about 40 minutes for me to process roughly 4000 documents and about 2 hours to process 6400 records. Providing the 2 different sizes and times here to give an indication of the growth expansion so you can see how slow this would become with significantly large datasets.

What algorithms should I experiment with to try and classify these PDFs?

We are crawling and downloading lots of companies' PDFs and trying to pick out the ones that are Annual Reports. Such reports can be downloaded from most companies' investor-relations pages.
The PDFs are scanned and the database is populated with, among other things, the:
Title
Contents (full text)
Page count
Word count
Orientation
First line
Using this data we are checking for the obvious phrases such as:
Annual report
Financial statement
Quarterly report
Interim report
Then recording the frequency of these phrases and others. So far we have around 350,000 PDFs to scan and a training set of 4,000 documents that have been manually classified as either a report or not.
We are experimenting with a number of different approaches including Bayesian classifiers and weighting the different factors available. We are building the classifier in Ruby. My question is: if you were thinking about this problem, where would you start?
You should try a quick and basic approach first to form a baseline, which may be good enough for your purposes. Here is one such approach:
Scan all pdfs and form the vocabulary which is a numbered list of all words that occur in any document.
Create a feature vector from this vocabulary for each document by counting the word frequency of each word (all words, dont bother hand picking them). Feature i of document j, is the number of times word i appears in document j.
Then exponentiate features by word importance, which is the opposite of how often the word occurs in all documents. (ie The more often the word occurs in all documents (eg "the") the less information it contains.)
Then use a unsupervised clustering algorithm such as k-means to cluster the documents. You initialize by randomly placing k cluster centroids, assign the nearest documents to them, then move the centroids to the average of the documents assigned to them, then repeat the last two steps until convergence.
Then find the cluster that contains annual reports by using a few hand labeled examples.
Adjust the number of clusters with a cross validation set until the accuracy on the cross validation set is high.
Then finally test on a held out test set. If this is low come back.
For my dissertation a few years back I did something similar, but with digitised lecture slides and exam papers. One of the nicest books I came across for a good broad overview of search engines, search algorithms, and determining the effectiveness of the search was:
Search Engines: Information Retrieval in Practice, W. Bruce Croft, Donald Metzler, Trevor Strohman
There are some sample chapters in the publishers website which will tell you if the book's for you or not: pearsonhighered.com
Hope that helps.

Efficient Phrase Matching Algorithm

I have a set of about 7 Million phrases to be matched to about 300 Million queries.
Queries can be sub-strings or contain the phrases themselves. Basically I want a measure of 'similarity' between two phrases [ not necessarily the edit distance ]
Can someone give some pointers to efficient algorithms to do this. I would prefer distributed algorithms since I am going to do this on Hadoop via streaming using python.
Bed trees look interesting
Bed-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance (Pdf of presentation)
This a at least not very trivial, because you have on the one side very much data and on the other side even more.
The ultra simplest approach would be a lucene index on the 7 mio. phrases and let the hadoop job query the index. Not quite sure if you need a solr server for that, or any similar implementations in python.
The mapper should write out the phrase id or linenumber, whatever you have to indentify it. Or at least the phrase itself, along with the matchingscore.
In the reduce step you could go for a reducing on a phrase key and write out all the related phrases with the score. (or whatever you want to)
For similarity you can read further here:
Similarity of Apache Lucene
Apache Lucene itself

Optimal Document Size for LSI Similarity Model

I'm using Gensim's excellent library to compute similarity queries on a corpus using LSI. However, I have a distinct feeling that the results could be better, and I'm trying to figure out whether I can adjust the corpus itself in order to improve the results.
I have a certain amount of control over how to split the documents. My original data has a lot of very short documents (mean length is 12 words in a document, but there exist documents that are 1-2 words long...), and there are a few logical ways to concatenate several documents into one. The problem is that I don't know whether it's worth doing this or not (and if so, to what extent). I can't find any material addressing this question, but only regarding the size of the corpus, and the size of the vocabulary. I assume this is because, at the end of the day, the size of a document is bounded by the size of the vocabulary. But I'm sure there are still some general guidelines that could help with this decision.
What is considered a document that is too short? What is too long? (I assume the latter is a function of |V|, but the former could easily be a constant value.)
Does anyone have experience with this? Can anyone point me in the direction of any papers/blog posts/research that address this question? Much appreciated!
Edited to add:
Regarding the strategy for grouping documents - each document is a text message sent between two parties. The potential grouping is based on this, where I can also take into consideration the time at which the messages were sent. Meaning, I could group all the messages sent between A and B within a certain hour, or on a certain day, or simply group all the messages between the two. I can also decide on a minimum or maximum number of messages grouped together, but that is exactly what my question is about - how do I know what the ideal length is?
Looking at number of words per document does not seem to me to be the correct approach. LSI/LSA is all about capturing the underlying semantics of the documents by detecting common co-occurrences.
You may want to read:
LSI: Probabilistic Analysis
Latent Semantic Analysis (particularly section 3.2)
A valid excerpt from 2:
An important feature of LSI is that it makes no assumptions
about a particular generative model behind the data. Whether
the distribution of terms in the corpus is “Gaussian”, Poisson, or
some other has no bearing on the effectiveness of this technique, at
least with respect to its mathematical underpinnings. Thus, it is
incorrect to say that use of LSI requires assuming that the attribute
values are normally distributed.
The thing I would be more concerned is if the short documents share similar co-occurring terms that will allow LSI to form an appropriate topic grouping all of those documents that for a human share the same subject. This can be hardly done automatically (maybe with a WordNet / ontology) by substituting rare terms with more frequent and general ones. But this is a very long shot requiring further research.
More specific answer on heuristic:
My best bet would be to treat conversations as your documents. So the grouping would be on the time proximity of the exchanged messages. Anything up to a few minutes (a quarter?) I would group together. There may be false positives though (strongly depending on the actual contents of your dataset). As with any hyper-parameter in NLP - your mileage will vary... so it is worth doing a few experiments.
Short documents are indeed a challenge when it comes to applying LDA, since the estimates for the word co-occurrence statistics are significantly worse for short documents (sparse data). One way to alleviate this issue is, as you mentioned, to somehow aggregate multiple short texts into one longer document by some heuristic measure.
One particularity nice test-case for this situation is topic modeling Twitter data, since it's limited by definition to 140 characters. In Empirical Study of Topic Modeling in Twitter (Hong et al, 2010), the authors argue that
Training a standard topic model on aggregated user messages leads to a
faster training process and better quality.
However, they also mention that different aggregation methods lead to different results:
Topics learned by using different aggregation strategies of
the data are substantially different from each other.
My recommendations:
If you are using your own heuristic for aggregating short messages into longer documents, make sure to experiment with different aggregation techniques (potentially all the "sensical" ones)
Consider using a "heuristic-free" LDA variant that is better tailored for short messages, e.g, Unsupervised Topic Modeling for Short Texts Using Distributed
Representations of Words

Resources