How can we design system for document search? - full-text-search

I was recently asked a system design question where I needed to "design system for document search" and first thing came in my mind was how elastic search works. So I came up with inverted index approach that is used to support text search. The inverted index has a record for each term. Each record has the list of documents that the term appears in. Documents are identified by an integer document ID. The list of document IDs is sorted in ascending order.
So I said something along below line but I am not sure whether this is the way it should work in distributed fashion because we may have lot of documents to be indexed so we need some load balancer and somehow we need to partition inverted index or data. Meaning a process that will upload the documents and then what process will tokenize it (will it be just one machine or bunch of machines). Basically wanted to understand what's the right way to design a system like this with proper components. How should we talk about this in system design interview? What are the things we should touch in an interview for this problem?
What's the right way we should design system for document search in distributed fashion with right components.

ok... it is a vast subject. Actually elasticsearch has been done exactly for that. But google too. from elasticSearch to google search there is a technologic gap.
If you go for your personnal implementation it is still possible... but there is a ton of work to do, to be as efficient as elasticsearch. The quick answer is : use elasticsearch.
You may be curious or for some reason you may need to write it by yourself. So how it works :
TFIDF and cosine distance
As you specified first you will tokenize
then you will represent the tokenized text as vector, then measure the angular distance between the text and the search word.
imagine have only 3 word in your language "foo, bar, bird"
so a text with "foo bar bird" can be represented by a vector3[1,1,1]
a text with
A) "foo foo foo foo bird" will be [4,0,1]
another
B) "foo bar" [1,1,0]
if you search for "bar" which is represented by [0,1,0] you will look for the text with which have the minimal angular distance and if you compute the angular distance between your search and B I think this is 90° which is lower than A.
Actually the language is more than 3 word so you will compute the distance in a vector of many more dimensions because a 1 world = 1 dimension :)
TFIDF stand for term frequency inverse document frequency.
this rates the frequency of a word in a document by the inverse of the frequencies of this word in all document. What it does is it point the important word in a document.
Let explain that :
the word "that, in a the" etc are everywhere so they are not important
in all of your texts the word "corpus" has a frequency of I don't know 0.0001%
in a particular text it is cited 5 times hand has a frequencie of 0.1%
then it is quite rare in the corpus but quite important in your text in comparison
so when you'll search for "corpus" what you want is to get first the text where it appears 4 times.
so instead of vector of number of occurence you will have a vector of relative occurence frequency for exemple [0.2,1,0.0005]
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
I have got a little surprise for you at the end : this is what is behind
the scoring in elasticsearch https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
in addition elasticsearch will provide to you replication scalability, distribution and anything you could dream about.
There is still reason to not use elasticsearch :
research purpose
you actually write another search engine like duckduck go (why not)
you are curious
the scaling and distribution part
Read about lucene before https://en.wikipedia.org/wiki/Apache_Lucene
it depend greatly on the volume of text to be indexed and word.
If it represent 1M of text you dont need to distribut it.
you will need a distributed inverted index if you index something big like wikipedia (wikipedia uses elasticsearch for the searchbox)
foo is in text A,B,C,R
so I will partition my index
I will use a distributed cache with word for the keys and a list of pointer to the vectors as values. I would store the values in memory mapped file.
A search engine is something that must be fast, so if you do the thing by yourself you will reduce the external libraries. You will use c++
At google they end in a situation where the vectors take so much space that they need to store them in multiple machine so they invented GFS this is a distributed file system.
the cosine distance computation between multidimensionnal vectors are time consuming, so I will go for a computation in GPU, because GPU are efficient for floating point operations on matrix and vectors.
Actually this is a little bit crasy to reimplement all of that expect you have a good reason to do so for exemple a very good business model :)
I will probably use kubernetes docker and mesos to virtualise all my component. I will look for something similar to GFS if high volume is needed.
https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems
You will need to get the text back, so I will use any NIO webserver that scale in any language. I will use nginx to serve the static pages and something like netty or vertx to get the search, then build the answer of link to the text (it depend on how many user you want to serve in one second)
All of that if I plan to index something bigger than wikipedia. And if I plan to invent something better than elasticsearch (hard task good luck)
for exemple wikipedia this is less than 1T of text.
finally
If you do it with elasticsearch in one week you are done may be 2 week and in production. If you do it by yourself, you will need at least 1 senior developer and a datascientist an architect, and something like one year or more depending on the volume of text you want to index. I cant help to stop asking myself "what is it for".
Actually if you read the code source of lucene you will know exactly what you need. They done it, lucene is the engine of elasticsearch.
Twitter is using Lucene for its real time search

Related

How do I get a quick and dirty recognition of possible typos in .net?

I have to manually go through a long list of terms (~3500) which have been entered by users through the years. Beside other things, I want to reduce the list by looking for synonyms, typos and alternate spellings.
My work will be much easier if I can group the list into clusters of possible typos before starting. I was imagining to use some metric which can calculate the similarity to a term, e.g. in percent, and then cluster everything which has a similarity higher than some threshold. As I am going through it manually anyway, I don't mind a high failure rate, if it can keep the whole thing simple.
Ideally, there exists some easily available library to do this for me, implemented by people who know what they are doing. If there is no such, then at least one calculating a similarity metric for a pair of strings would be great, I can manage the clustering myself.
If this is not available either, do you know of a good algorithm which is simple to implement? I was first thinking a Hamming distance divided by word length will be a good metric, but noticed that while it will catch swapped letters, it won't handle deletions and insertions well (ptgs-1 will be caught as very similar to ptgs/1, but hematopoiesis won't be caught as very similar to haematopoiesis).
As for the requirements on the library/algorithm: it has to rely completely on spelling. I know that the usual NLP libraries don't work this way, but
there is no full text available for it to consider context.
it can't use a dictionary corpus of words, because the terms are far outside of any everyday language, frequently abbreviations of highly specialized terms.
Finally, I am most familiar with C# as a programming language, and I already have a C# pseudoscript which does some preliminary cleanup. If there is no one-step solution (feed list in, get grouped list out), I will prefer a library I can call from within a .NET program.
The whole thing should be relatively quick to learn for somebody with almost no previous knowledge in information retrieval. This will save me maybe 5-6 hours of manual work, and I don't want to spend more time than that in setting up an automated solution. OK, maybe up to 50% longer if I get the chance to learn something awesome :)
The question: What should I use, a library, or an algorithm? Which ones should I consider? If what I need is a library, how do I recognize one which is capable of delivering results based on spelling alone, as opposed to relying on context or dictionary use?
edit To clarify, I am not looking for actual semantic relatedness the way search or recommendation engines need it. I need to catch typos. So, I am looking for a metric by which mouse and rodent have zero similarity, but mouse and house have a very high similarity. And I am afraid that tools like Lucene use a metric which gets these two examples wrong (for my purposes).
Basically you are looking to cluster terms according to Semantic Relatedness.
One (hard) way to do it is following Markovitch and Gabrilovitch approach.
A quicker way will be consisting of the following steps:
download wikipedia dump and an open source Information Retrieval library such as Lucene (or Lucene.NET).
Index the files.
Search each term in the index - and get a vector - denoting how relevant the term (the query) is for each document. Note that this will be a vector of size |D|, where |D| is the total number of documents in the collection.
Cluster your vectors in any clustering algorithm. Each vector represents one term from your initial list.
If you are interested only in "visual" similarity (words are written similar to each other) then you can settle for levenshtein distance, but it won't be able to give you semantic relatedness of terms.For example, you won't be able to relate between "fall" and "autumn".

Would my approach to fuzzy search, for my dataset, be better than using Lucene?

I want to implement a fuzzy search facility in the web-app i'm currently working on. The back-end is in Java, and it just so happens that the search engine that everyone recommends on here, Lucene, is coded in Java as well. I, however, am shying away from using it for several reasons:
I would feel accomplished building something of my own.
Lucene has a plethora of features that I don't see myself utilizing; i'd like to minimize bloat.
From what I understand, Lucene's fuzzy search implementation manually evaluates the edit distances of each term indexed. I feel the approach I want to take (detailed below), would be more efficient.
The data to-be-indexed could potentially be the entire set of nouns and pro-nouns in the English language, so you can see how Lucene's approach to fuzzy search makes me weary.
What I want to do is take an n-gram based approach to the problem: read and tokenize each item from the database and save them to disk in files named by a given n-gram and its location.
For example: let's assume n = 3 and my file-naming scheme is something like: [n-gram]_[location_of_n-gram_in_string].txt.
The file bea_0.txt would contain:
bear
beau
beacon
beautiful
beats by dre
When I receive a term to be searched, I can simply tokenize it in to n-grams, and use them along with their corresponding locations to read in to the corresponding n-gram files (if present). I can then perform any filtering operations (eliminating those not within a given length range, performing edit distance calculations, etc.) on this set of data instead of doing so for the entire dataset.
My question is... well I guess I have a couple of questions.
Has there been any improvements in Lucene's fuzzy search that I'm not aware of that would make my approach unnecessary?
Is this a good approach to implement fuzzy-search, (considering the set of data I'm dealing with), or is there something I'm oversimplifying/missing?
Lucene 3.x fuzzy query used to evaluate the Levenshtein distance between the queried term and every index term (brute-force approach). Given that this approach is rather inefficient, Lucene spellchecker used to rely on something similar to what you describe: Lucene would first search for terms with similar n-grams to the queried term and would then score these terms according to a String distance (such as Levenshtein or Jaro-Winckler).
However, this has changed a lot in Lucene 4.0 (an ALPHA preview has been released a few days ago): FuzzyQuery now uses a Levenshtein automaton to efficiently intersect the terms dictionary. This is so much faster that there is now a new direct spellchecker that doesn't require a dedicated index and directly intersects the terms dictionary with an automaton, similarly to FuzzyQuery.
For the record, as you are dealing with English corpus, Lucene (or Solr but I guess you could use them in vanilla lucene) has some Phonetic analyzers that might be useful (DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone)
Lucene 4.0 alpha was just released, many things are easier to customize now, so you could also build upon it an create a custom fuzzy search.
In any case Lucene has many years of performance improvements so you hardly would be able to achieve the same perf. Of course it might be good enough for your case...

Efficient Phrase Matching Algorithm

I have a set of about 7 Million phrases to be matched to about 300 Million queries.
Queries can be sub-strings or contain the phrases themselves. Basically I want a measure of 'similarity' between two phrases [ not necessarily the edit distance ]
Can someone give some pointers to efficient algorithms to do this. I would prefer distributed algorithms since I am going to do this on Hadoop via streaming using python.
Bed trees look interesting
Bed-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance (Pdf of presentation)
This a at least not very trivial, because you have on the one side very much data and on the other side even more.
The ultra simplest approach would be a lucene index on the 7 mio. phrases and let the hadoop job query the index. Not quite sure if you need a solr server for that, or any similar implementations in python.
The mapper should write out the phrase id or linenumber, whatever you have to indentify it. Or at least the phrase itself, along with the matchingscore.
In the reduce step you could go for a reducing on a phrase key and write out all the related phrases with the score. (or whatever you want to)
For similarity you can read further here:
Similarity of Apache Lucene
Apache Lucene itself

how to cluster evolving data streams

I want to incrementally cluster text documents reading them as data streams but there seems to be a problem. Most of the term weighting options are based on vector space model using TF-IDF as the weight of a feature. However, in our case IDF of an existing attribute changes with every new data point and hence previous clustering does not remain valid anymore and hence any popular algorithms like CluStream, CURE, BIRCH cannot be applied which assumes fixed dimensional static data.
Can anyone redirect me to any existing research related to this or give suggestions? Thanks !
Have you looked at
TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams
Here's an idea off the top of my head:
What's your input data like? I'm guessing it's at least similarly themed, so you could start with a base phrase dictionary and use that for idf. Apache Lucene is a great indexing engine. Since you have a base dictionary, you can run kmeans or whatever you'd like. As documents come in, you'll have to rebuild the dictionary at some frequency (which can be off-loaded to another thread/machine/etc) and then re-cluster.
With the data indexed in a high-performance, flexible engine like Lucene, you could run queries even as new documents are being indexed. I bet if you do some research on different clustering algorithms you'll find some good ideas.
Some interesting paper/links:
http://en.wikipedia.org/wiki/Document_classification
http://www.scholarpedia.org/article/Text_categorization
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Without more information, I can't see why you couldn't re-cluster every once in a while. You might wanna take a look at some of the recommender systems already out there.

Latent Semantic Indexing

It is said that through LSI, the matrices that are produced U, A and V, they bring together documents which have synonyms. For e.g. if we search for "car", we also get documents which have "automobile". But LSI is nothing but manipulations of matrices. It only takes into account the frequency, not semantics. So whats the thing behind this magic that I am missing? Please explain.
LSI basically creates a frequency profile of each document, and looks for documents with similar frequency profiles. If the remainder of the frequency profile is enough alike, it'll classify two documents as being fairly similar, even if one systematically substitutes some words. Conversely, if the frequency profiles are different, it can/will classify documents as different, even if they share frequent use of a few specific terms (e.g., "file" being related to a computer in some cases, and a thing that's used to cut and smooth metal in other cases).
LSI is also typically used with relatively large groups of documents. The other documents can help in finding similarities as well -- even if document A and B look substantially different, if document C uses quite a few terms from both A and B, it can help in finding that A and B are really fairly similar.
According to the Wikipedia article, "LSI is based on the principle that words that are used in the same contexts tend to have similar meanings." That is, if two words seem to be used interchangeably, they might be synonyms.
It's not infallible.

Resources