I am trying to find a solution for finding nearest or approximate nearest neighbor of documents.
Right now I am using tfidf as vector representation of the document. My data is pretty big (N ~ million). If I use annoy with tfidf, I ran out of memory. I figured it's because of tfidf's high dimensionality(my vocabulary is about 2000000 Chinese words).
I then tried it with pysparNN, which works great. However my concern is as my data size grow, pysparNN build a bigger index, and eventually it might not fit into RAM. This is ab problem because pysparNN does not use a static file like annoy does.
I am wondering what might be a good solution for finding nearest neighbor for text data. Right now I am looking into using gensim's annoy index, with doc2ve
I don't find tfidf to be a great solution when it comes to document embedding.
You might try to extract more sophisticated text (doc) embeddings by using FastText, LASER, gensim, BERT, ELMO and others and then use annoy or faiss to build an index to retrieve similarities.
Related
I was recently asked a system design question where I needed to "design system for document search" and first thing came in my mind was how elastic search works. So I came up with inverted index approach that is used to support text search. The inverted index has a record for each term. Each record has the list of documents that the term appears in. Documents are identified by an integer document ID. The list of document IDs is sorted in ascending order.
So I said something along below line but I am not sure whether this is the way it should work in distributed fashion because we may have lot of documents to be indexed so we need some load balancer and somehow we need to partition inverted index or data. Meaning a process that will upload the documents and then what process will tokenize it (will it be just one machine or bunch of machines). Basically wanted to understand what's the right way to design a system like this with proper components. How should we talk about this in system design interview? What are the things we should touch in an interview for this problem?
What's the right way we should design system for document search in distributed fashion with right components.
ok... it is a vast subject. Actually elasticsearch has been done exactly for that. But google too. from elasticSearch to google search there is a technologic gap.
If you go for your personnal implementation it is still possible... but there is a ton of work to do, to be as efficient as elasticsearch. The quick answer is : use elasticsearch.
You may be curious or for some reason you may need to write it by yourself. So how it works :
TFIDF and cosine distance
As you specified first you will tokenize
then you will represent the tokenized text as vector, then measure the angular distance between the text and the search word.
imagine have only 3 word in your language "foo, bar, bird"
so a text with "foo bar bird" can be represented by a vector3[1,1,1]
a text with
A) "foo foo foo foo bird" will be [4,0,1]
another
B) "foo bar" [1,1,0]
if you search for "bar" which is represented by [0,1,0] you will look for the text with which have the minimal angular distance and if you compute the angular distance between your search and B I think this is 90° which is lower than A.
Actually the language is more than 3 word so you will compute the distance in a vector of many more dimensions because a 1 world = 1 dimension :)
TFIDF stand for term frequency inverse document frequency.
this rates the frequency of a word in a document by the inverse of the frequencies of this word in all document. What it does is it point the important word in a document.
Let explain that :
the word "that, in a the" etc are everywhere so they are not important
in all of your texts the word "corpus" has a frequency of I don't know 0.0001%
in a particular text it is cited 5 times hand has a frequencie of 0.1%
then it is quite rare in the corpus but quite important in your text in comparison
so when you'll search for "corpus" what you want is to get first the text where it appears 4 times.
so instead of vector of number of occurence you will have a vector of relative occurence frequency for exemple [0.2,1,0.0005]
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
I have got a little surprise for you at the end : this is what is behind
the scoring in elasticsearch https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
in addition elasticsearch will provide to you replication scalability, distribution and anything you could dream about.
There is still reason to not use elasticsearch :
research purpose
you actually write another search engine like duckduck go (why not)
you are curious
the scaling and distribution part
Read about lucene before https://en.wikipedia.org/wiki/Apache_Lucene
it depend greatly on the volume of text to be indexed and word.
If it represent 1M of text you dont need to distribut it.
you will need a distributed inverted index if you index something big like wikipedia (wikipedia uses elasticsearch for the searchbox)
foo is in text A,B,C,R
so I will partition my index
I will use a distributed cache with word for the keys and a list of pointer to the vectors as values. I would store the values in memory mapped file.
A search engine is something that must be fast, so if you do the thing by yourself you will reduce the external libraries. You will use c++
At google they end in a situation where the vectors take so much space that they need to store them in multiple machine so they invented GFS this is a distributed file system.
the cosine distance computation between multidimensionnal vectors are time consuming, so I will go for a computation in GPU, because GPU are efficient for floating point operations on matrix and vectors.
Actually this is a little bit crasy to reimplement all of that expect you have a good reason to do so for exemple a very good business model :)
I will probably use kubernetes docker and mesos to virtualise all my component. I will look for something similar to GFS if high volume is needed.
https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems
You will need to get the text back, so I will use any NIO webserver that scale in any language. I will use nginx to serve the static pages and something like netty or vertx to get the search, then build the answer of link to the text (it depend on how many user you want to serve in one second)
All of that if I plan to index something bigger than wikipedia. And if I plan to invent something better than elasticsearch (hard task good luck)
for exemple wikipedia this is less than 1T of text.
finally
If you do it with elasticsearch in one week you are done may be 2 week and in production. If you do it by yourself, you will need at least 1 senior developer and a datascientist an architect, and something like one year or more depending on the volume of text you want to index. I cant help to stop asking myself "what is it for".
Actually if you read the code source of lucene you will know exactly what you need. They done it, lucene is the engine of elasticsearch.
Twitter is using Lucene for its real time search
Does Sphinx provide a way to precompute document similarity matrices? I have looked at Sphinx/Solr/Lucene; it seems Lucene is able to do this indirectly using Term Vectors Computing Document Similarity with Term Vectors.
Currently I am using the tf-idf-similarity gem to do these calculations, but it is incredibly slow as the dataset grows; something like On^(n-1!).
Currently trying to find a faster alternative to this. Lucene seems like a potential solution, but it doesn't have as much support within the Ruby community so if Sphinx has a good way of doing this that would be ideal.
Just to clarify; I am not trying to do live search similarity matching, which appears to be the most common use case for both Lucene and Sphinx, I am trying to precompute a similarity matrix that will create a similarity between all documents with the dataset. This will subsequently be used in data visualizations for different types of user analysis.
Also anyone with prior experience doing this I'm curious about benchmarks. How it looks in terms of time to process and how much computing power and/or parallelization were you using with reference to number of documents and doc size average.
Currently it is taking about 40 minutes for me to process roughly 4000 documents and about 2 hours to process 6400 records. Providing the 2 different sizes and times here to give an indication of the growth expansion so you can see how slow this would become with significantly large datasets.
I have to manually go through a long list of terms (~3500) which have been entered by users through the years. Beside other things, I want to reduce the list by looking for synonyms, typos and alternate spellings.
My work will be much easier if I can group the list into clusters of possible typos before starting. I was imagining to use some metric which can calculate the similarity to a term, e.g. in percent, and then cluster everything which has a similarity higher than some threshold. As I am going through it manually anyway, I don't mind a high failure rate, if it can keep the whole thing simple.
Ideally, there exists some easily available library to do this for me, implemented by people who know what they are doing. If there is no such, then at least one calculating a similarity metric for a pair of strings would be great, I can manage the clustering myself.
If this is not available either, do you know of a good algorithm which is simple to implement? I was first thinking a Hamming distance divided by word length will be a good metric, but noticed that while it will catch swapped letters, it won't handle deletions and insertions well (ptgs-1 will be caught as very similar to ptgs/1, but hematopoiesis won't be caught as very similar to haematopoiesis).
As for the requirements on the library/algorithm: it has to rely completely on spelling. I know that the usual NLP libraries don't work this way, but
there is no full text available for it to consider context.
it can't use a dictionary corpus of words, because the terms are far outside of any everyday language, frequently abbreviations of highly specialized terms.
Finally, I am most familiar with C# as a programming language, and I already have a C# pseudoscript which does some preliminary cleanup. If there is no one-step solution (feed list in, get grouped list out), I will prefer a library I can call from within a .NET program.
The whole thing should be relatively quick to learn for somebody with almost no previous knowledge in information retrieval. This will save me maybe 5-6 hours of manual work, and I don't want to spend more time than that in setting up an automated solution. OK, maybe up to 50% longer if I get the chance to learn something awesome :)
The question: What should I use, a library, or an algorithm? Which ones should I consider? If what I need is a library, how do I recognize one which is capable of delivering results based on spelling alone, as opposed to relying on context or dictionary use?
edit To clarify, I am not looking for actual semantic relatedness the way search or recommendation engines need it. I need to catch typos. So, I am looking for a metric by which mouse and rodent have zero similarity, but mouse and house have a very high similarity. And I am afraid that tools like Lucene use a metric which gets these two examples wrong (for my purposes).
Basically you are looking to cluster terms according to Semantic Relatedness.
One (hard) way to do it is following Markovitch and Gabrilovitch approach.
A quicker way will be consisting of the following steps:
download wikipedia dump and an open source Information Retrieval library such as Lucene (or Lucene.NET).
Index the files.
Search each term in the index - and get a vector - denoting how relevant the term (the query) is for each document. Note that this will be a vector of size |D|, where |D| is the total number of documents in the collection.
Cluster your vectors in any clustering algorithm. Each vector represents one term from your initial list.
If you are interested only in "visual" similarity (words are written similar to each other) then you can settle for levenshtein distance, but it won't be able to give you semantic relatedness of terms.For example, you won't be able to relate between "fall" and "autumn".
I'm implementing a naive "keyword extraction algorithm". I'm self-taught though so I lack some terminology and maths common in the online literature.
I'm finding "most relevant keywords" of a document thus:
I count how often each term is used in the current document. Let's call this tf.
I look up how often each of those terms is used in the entire database of documents. Let's call this df.
I calculate a relevance weight r for each term by r = tf / df.
Each document is a proper subset of the corpus so no document contains a term not in the corpus. This means I don't have to worry about division by zero.
I sort all terms by their r and keep however many of the top terms. These are the top keywords most closely associated with this document. Terms that are common in this document are more important. Terms that are common in the entire database of documents are less important.
I believe this is a naive form of tf-idf.
The problem is that when terms are very uncommon in the entire database but occur in the current document they seem to have too high an r value.
This can be thought of as some kind of artefact due to small sample size. What is the best way or the usual ways to compensate for this?
Throw away terms less common in the overall database than a certain threshold. If so how is that threshold calculated? It seems it would depend on too many factors to be a hard-coded value.
Can it be weighted or smoothed by some kind of mathematical function such as inverse square or cosine?
I've tried searching the web and reading up on tf-idf but much of what I find deals with comparing documents, which I'm not interested in. Plus most of them have a low ratio of explanation vs. jargon and formulae.
(In fact my project is a generalization of this problem. I'm really working with tags on Stack Exchange sites so the total number of terms is small, stopwords are irrelevant, and low-usage tags might be more common than low-usage words in the standard case.)
I spent a lot of time trying to do targeted Google searches for particular tf-idf information and dug through many documents.
Finally I found a document with clear and concise explanation accompanied by formulae even I can grok: Document Processing and the Semantic Web, Week 3 Lecture 1: Ranking for Information Retrieval by Robert Dale of the Department of Computing at Macquarie University:
Page 20:
The two things I was missing was taking into account the number of documents in the collection, and using the logarithm of the inverse df rather than using the inverse df directly.
Lately I've been mucking about with text categorization and language classification based on Cavnar and Trenkle's article "N-Gram-Based Text Categorization" as well as other related sources.
For doing language classification I've found this method to be very reliable and useful. The size of the documents used to generate the N-gram frequency profiles is fairly unimportant as long as they are "long enough" since I'm just using the most common n N-grams from the documents.
On the other hand well-functioning text categorization eludes me. I've tried with both my own implementations of various variations of the algorithms at hand, with and without various tweaks such as idf weighting and other peoples' implementations. It works quite well as long as I can generate somewhat similarly-sized frequency profiles for the category reference documents but the moment they start to differ just a bit too much the whole thing falls apart and the category with the shortest profile ends up getting a disproportionate number of documents assigned to it.
Now, my question is. What is the preferred method of compensating for this effect? It's obviously happening because the algorithm assumes a maximum distance for any given N-gram that equals the length of the category frequency profile but for some reason I just can't wrap my head around how to fix it. One reason I'm interested in this fix is actually because I'm trying to automate the generation of category profiles based on documents with a known category which can vary in length (and even if they are the same length the profiles may end up being different lengths). Is there a "best practice" solution to this?
If you are still interested, and assuming I understand your question correctly, the answer to your problem would be to normalise your n-gram frequencies.
The simplest way to do this, on a per document basis, is to count the total frequency of all n-grams in your document and divide each individual n-gram frequency by that number. The result is that every n-gram frequency weighting now relates to a percentage of the total document content, regardless of the overall length.
Using these percentages in your distance metrics will discount the size of the documents and instead focus on the actual make up of their content.
It might also be worth noting that the n-gram representation only makes up a very small part of an entire categorisation solution. You might also consider using dimensional reduction, different index weighting metrics and obviously different classification algorithms.
See here for an example of n-gram use in text classification
As I know the task is to count probability of generation some text by language model M.
Recently i was working on measuring the readaiblity of texts using semantic, synctatic and lexical properties. It can be also measured by language model approach.
To answer properly you should consider these questions:
Are you using log-likelihood approach?
What levels of N-Grams are you using? unigrams digrams or higher level?
How big are language corpuses that you use?
Using only digrams and unigrams i managed to classify some documents with nice results. If your classification is weak consider creating bigger language corpuse or using n-grams of lower levels.
Also remember that classifying some text to invalid category may be an error depending on length of text (randomly there are few words appearing in another language models).
Just consider making your language corpuses bigger and know that analysing short texts have higher probability of missclasification