Computing Document Similarity Matrices in Sphinx?

Computing Document Similarity Matrices in Sphinx? - ruby

Does Sphinx provide a way to precompute document similarity matrices? I have looked at Sphinx/Solr/Lucene; it seems Lucene is able to do this indirectly using Term Vectors Computing Document Similarity with Term Vectors.
Currently I am using the tf-idf-similarity gem to do these calculations, but it is incredibly slow as the dataset grows; something like On^(n-1!).
Currently trying to find a faster alternative to this. Lucene seems like a potential solution, but it doesn't have as much support within the Ruby community so if Sphinx has a good way of doing this that would be ideal.
Just to clarify; I am not trying to do live search similarity matching, which appears to be the most common use case for both Lucene and Sphinx, I am trying to precompute a similarity matrix that will create a similarity between all documents with the dataset. This will subsequently be used in data visualizations for different types of user analysis.
Also anyone with prior experience doing this I'm curious about benchmarks. How it looks in terms of time to process and how much computing power and/or parallelization were you using with reference to number of documents and doc size average.
Currently it is taking about 40 minutes for me to process roughly 4000 documents and about 2 hours to process 6400 records. Providing the 2 different sizes and times here to give an indication of the growth expansion so you can see how slow this would become with significantly large datasets.

Related

document similarity search - annoy & pysparNN

I am trying to find a solution for finding nearest or approximate nearest neighbor of documents.
Right now I am using tfidf as vector representation of the document. My data is pretty big (N ~ million). If I use annoy with tfidf, I ran out of memory. I figured it's because of tfidf's high dimensionality(my vocabulary is about 2000000 Chinese words).
I then tried it with pysparNN, which works great. However my concern is as my data size grow, pysparNN build a bigger index, and eventually it might not fit into RAM. This is ab problem because pysparNN does not use a static file like annoy does.
I am wondering what might be a good solution for finding nearest neighbor for text data. Right now I am looking into using gensim's annoy index, with doc2ve

I don't find tfidf to be a great solution when it comes to document embedding.
You might try to extract more sophisticated text (doc) embeddings by using FastText, LASER, gensim, BERT, ELMO and others and then use annoy or faiss to build an index to retrieve similarities.

Bring Word2Vec models efficiently into Production Service

This is kind of a long shot, but I am hoping that someone has been in a similar situation as I am looking for some advice how to efficiently bring a set of large word2vec models into a production environment.
We have a range of trained w2v models with a dimensionality of 300. Due to the underlying data - huge corpus with POS tagged words; specialized vocabularies with up to 1 mio words - these models became quite large and we are currently looking into effective ways how to expose these to our users w/o paying a too high price in infrastructure.
Besides trying to better control the vocabulary size, obviously, dimensionality reduction on the feature vectors would be an option. Is anyone aware of publications around that, particularly on how this would affect model quality, and how to best measure this?
Another option is to pre-calculate the top X most similar words to each vocabulary word and to provide a lookup table. With the model size being that big, this is currently also very inefficient. Are there any heuristics known that could be used reduce the number of necessary distance calculations from n x n-1 to a lower number?
Thank you very much!

There are pre-indexing techniques for similarity-search in high-dimensional spaces which can speed nearest-neighbor discovery, but usually at a cost of absolute accuracy. (They also need more memory for the index.)
An example is the ANNOY library. The gensim project includes a demo notebook showing its use with Word2Vec.
I once did some experiments using just 16-bit (rather than 32-bit) floats in a Word2Vec model. It saved memory in the idle state, and nearest-neighbor top-N results were nearly unchanged. But, perhaps because some behind-the-scenes up-conversion to 32-bit floats was still occurring during the one-against-all distance-calculations, speed of operations was actually reduced. (And this suggests that each distance-calculation may have caused a temporary memory expansion offsetting any idle-state savings.) So it's not a quick fix, but further research here – perhaps involving finding/implementing the right routines for float16 array operations – could maybe mean 50% model-size savings and equivalent or even better speed.
For many applications, discarding the least-frequent words doesn't hurt much – or even, when done before training, can improve the quality of the remaining vectors. As many implementations, including gensim, sort the word-vector array in most-to-least-frequent order, you can discard the tail-end of the array to save memory, or limit most_similar() searches to the first-N entries to speed calculations.
Once you've minimized the vocabulary size, you want to be sure the full set is in RAM, and no swapping is triggered during the (typical) full-sweep distance-calculations. If you need multiple processes to serve answers from the same vector set, as in a web service on a multicore machine, gensim's memory-mapping operations can prevent each process from loading its own redundant copy of the vectors. You can see a discussion of this technique in this answer about speeding gensim Word2Vec loading time.
Finally, while precomputing top-N neighbors for a larger vocabulary is both time-consuming and memory-intensive, if your pattern of access is such that some tokens are checked far more than others, a cache of the N most-recently or M most-frequently requested top-N could improve perceived performance a lot – making only less-frequently-requested neighbor-lists require the full distance calculations to every other token.

How to deal with very uncommon terms in tf-idf?

I'm implementing a naive "keyword extraction algorithm". I'm self-taught though so I lack some terminology and maths common in the online literature.
I'm finding "most relevant keywords" of a document thus:
I count how often each term is used in the current document. Let's call this tf.
I look up how often each of those terms is used in the entire database of documents. Let's call this df.
I calculate a relevance weight r for each term by r = tf / df.
Each document is a proper subset of the corpus so no document contains a term not in the corpus. This means I don't have to worry about division by zero.
I sort all terms by their r and keep however many of the top terms. These are the top keywords most closely associated with this document. Terms that are common in this document are more important. Terms that are common in the entire database of documents are less important.
I believe this is a naive form of tf-idf.
The problem is that when terms are very uncommon in the entire database but occur in the current document they seem to have too high an r value.
This can be thought of as some kind of artefact due to small sample size. What is the best way or the usual ways to compensate for this?
Throw away terms less common in the overall database than a certain threshold. If so how is that threshold calculated? It seems it would depend on too many factors to be a hard-coded value.
Can it be weighted or smoothed by some kind of mathematical function such as inverse square or cosine?
I've tried searching the web and reading up on tf-idf but much of what I find deals with comparing documents, which I'm not interested in. Plus most of them have a low ratio of explanation vs. jargon and formulae.
(In fact my project is a generalization of this problem. I'm really working with tags on Stack Exchange sites so the total number of terms is small, stopwords are irrelevant, and low-usage tags might be more common than low-usage words in the standard case.)

I spent a lot of time trying to do targeted Google searches for particular tf-idf information and dug through many documents.
Finally I found a document with clear and concise explanation accompanied by formulae even I can grok: Document Processing and the Semantic Web, Week 3 Lecture 1: Ranking for Information Retrieval by Robert Dale of the Department of Computing at Macquarie University:
Page 20:
The two things I was missing was taking into account the number of documents in the collection, and using the logarithm of the inverse df rather than using the inverse df directly.

N-gram text categorization category size difference compensation

Lately I've been mucking about with text categorization and language classification based on Cavnar and Trenkle's article "N-Gram-Based Text Categorization" as well as other related sources.
For doing language classification I've found this method to be very reliable and useful. The size of the documents used to generate the N-gram frequency profiles is fairly unimportant as long as they are "long enough" since I'm just using the most common n N-grams from the documents.
On the other hand well-functioning text categorization eludes me. I've tried with both my own implementations of various variations of the algorithms at hand, with and without various tweaks such as idf weighting and other peoples' implementations. It works quite well as long as I can generate somewhat similarly-sized frequency profiles for the category reference documents but the moment they start to differ just a bit too much the whole thing falls apart and the category with the shortest profile ends up getting a disproportionate number of documents assigned to it.
Now, my question is. What is the preferred method of compensating for this effect? It's obviously happening because the algorithm assumes a maximum distance for any given N-gram that equals the length of the category frequency profile but for some reason I just can't wrap my head around how to fix it. One reason I'm interested in this fix is actually because I'm trying to automate the generation of category profiles based on documents with a known category which can vary in length (and even if they are the same length the profiles may end up being different lengths). Is there a "best practice" solution to this?

If you are still interested, and assuming I understand your question correctly, the answer to your problem would be to normalise your n-gram frequencies.
The simplest way to do this, on a per document basis, is to count the total frequency of all n-grams in your document and divide each individual n-gram frequency by that number. The result is that every n-gram frequency weighting now relates to a percentage of the total document content, regardless of the overall length.
Using these percentages in your distance metrics will discount the size of the documents and instead focus on the actual make up of their content.
It might also be worth noting that the n-gram representation only makes up a very small part of an entire categorisation solution. You might also consider using dimensional reduction, different index weighting metrics and obviously different classification algorithms.
See here for an example of n-gram use in text classification

As I know the task is to count probability of generation some text by language model M.
Recently i was working on measuring the readaiblity of texts using semantic, synctatic and lexical properties. It can be also measured by language model approach.
To answer properly you should consider these questions:
Are you using log-likelihood approach?
What levels of N-Grams are you using? unigrams digrams or higher level?
How big are language corpuses that you use?
Using only digrams and unigrams i managed to classify some documents with nice results. If your classification is weak consider creating bigger language corpuse or using n-grams of lower levels.
Also remember that classifying some text to invalid category may be an error depending on length of text (randomly there are few words appearing in another language models).
Just consider making your language corpuses bigger and know that analysing short texts have higher probability of missclasification

Optimal Document Size for LSI Similarity Model

I'm using Gensim's excellent library to compute similarity queries on a corpus using LSI. However, I have a distinct feeling that the results could be better, and I'm trying to figure out whether I can adjust the corpus itself in order to improve the results.
I have a certain amount of control over how to split the documents. My original data has a lot of very short documents (mean length is 12 words in a document, but there exist documents that are 1-2 words long...), and there are a few logical ways to concatenate several documents into one. The problem is that I don't know whether it's worth doing this or not (and if so, to what extent). I can't find any material addressing this question, but only regarding the size of the corpus, and the size of the vocabulary. I assume this is because, at the end of the day, the size of a document is bounded by the size of the vocabulary. But I'm sure there are still some general guidelines that could help with this decision.
What is considered a document that is too short? What is too long? (I assume the latter is a function of |V|, but the former could easily be a constant value.)
Does anyone have experience with this? Can anyone point me in the direction of any papers/blog posts/research that address this question? Much appreciated!
Edited to add:
Regarding the strategy for grouping documents - each document is a text message sent between two parties. The potential grouping is based on this, where I can also take into consideration the time at which the messages were sent. Meaning, I could group all the messages sent between A and B within a certain hour, or on a certain day, or simply group all the messages between the two. I can also decide on a minimum or maximum number of messages grouped together, but that is exactly what my question is about - how do I know what the ideal length is?

Looking at number of words per document does not seem to me to be the correct approach. LSI/LSA is all about capturing the underlying semantics of the documents by detecting common co-occurrences.
You may want to read:
LSI: Probabilistic Analysis
Latent Semantic Analysis (particularly section 3.2)
A valid excerpt from 2:
An important feature of LSI is that it makes no assumptions
about a particular generative model behind the data. Whether
the distribution of terms in the corpus is “Gaussian”, Poisson, or
some other has no bearing on the effectiveness of this technique, at
least with respect to its mathematical underpinnings. Thus, it is
incorrect to say that use of LSI requires assuming that the attribute
values are normally distributed.
The thing I would be more concerned is if the short documents share similar co-occurring terms that will allow LSI to form an appropriate topic grouping all of those documents that for a human share the same subject. This can be hardly done automatically (maybe with a WordNet / ontology) by substituting rare terms with more frequent and general ones. But this is a very long shot requiring further research.
More specific answer on heuristic:
My best bet would be to treat conversations as your documents. So the grouping would be on the time proximity of the exchanged messages. Anything up to a few minutes (a quarter?) I would group together. There may be false positives though (strongly depending on the actual contents of your dataset). As with any hyper-parameter in NLP - your mileage will vary... so it is worth doing a few experiments.

Short documents are indeed a challenge when it comes to applying LDA, since the estimates for the word co-occurrence statistics are significantly worse for short documents (sparse data). One way to alleviate this issue is, as you mentioned, to somehow aggregate multiple short texts into one longer document by some heuristic measure.
One particularity nice test-case for this situation is topic modeling Twitter data, since it's limited by definition to 140 characters. In Empirical Study of Topic Modeling in Twitter (Hong et al, 2010), the authors argue that
Training a standard topic model on aggregated user messages leads to a
faster training process and better quality.
However, they also mention that different aggregation methods lead to different results:
Topics learned by using different aggregation strategies of
the data are substantially different from each other.
My recommendations:
If you are using your own heuristic for aggregating short messages into longer documents, make sure to experiment with different aggregation techniques (potentially all the "sensical" ones)
Consider using a "heuristic-free" LDA variant that is better tailored for short messages, e.g, Unsupervised Topic Modeling for Short Texts Using Distributed
Representations of Words

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio