Is there a way to show at what percentage a selected document is similar to others on ElasticSearch? - elasticsearch

I need to build a search engine using Elasticsearch and the steps will be as following:
Search on the search engine with a search string.
The relevant results will display and I can click on these documents.
If I select a document, I will be redirected to another page where I will see all the details of the documents and will have an option "More Like This" (which will return documents similar to the selected document). I know that this is done using the MLT query.
Now my question is: Except for returning documents similar to the selected one, how can I also return at what percentage the documents are similar to the selected one?

There are a couple of things you can do.
using function_score query
more_like_this query is essentially a full text search, and it returns documents ordered by their relevance score. It could be possible to convert the score directly to a percentage, but it is not advised (here
and more specifically here).
Instead one can define a custom score with help of a function_score query, which can be designed so it returns a meaningful percentage.
This, of course, comes with additional cost of complexity, and the definition of "similarity" becomes more of an art than of science.
using dense_vector
One may opt to use the (yet experimental) dense_vector data type, which allows storing and comparing dense vectors (that is, arrays of numbers of fixed size). Here's an article that describes this approach very well: Text similarity search with vector fields.
In this case the definition of similarity is as precise as it can possibly be: a distance of two vectors in a multidimensional space, which can be computed via, for instance, cosine similarity.
However, such dense vectors have to be somehow computed, and the quality of said vectors will equal the quality of the similarity itself.
As the bottom line I must say that to make this work with Elasticsearch a bunch of computation and logic should be added outside, either in form of pre-computed models, or custom curated scoring algorithms. Elasticsearch out of the box does not seem to be a good percentage-similarity kind of deal.
Hope that helps!

If you're going the route of using semantic search via dense_vector, as Nikolay mentioned, I would recommend NBoost. NBoost has a good out-of-the-box systems for improving Elasticsearch results with SOTA models.

Related

Does elastic search use previous search frequencies?

Does elastic search utilize the frequency of a previously searched document. For example there are document A and document B. Both have similar score in terms of edit distances and other metrics however document A is very frequently searched and B is not. Will elastic search score A better than B. If not, how to acheive this?
Elasticsearch does not change score based on previous searches in its default scoring algorithm. In fact, this is really a question about Lucene scoring, since Elasticsearch uses it for all of the actual Search logic.
I think you may be looking at this from the wrong viewpoint. Users search with a query, and Elasticsearch recommends documents. You have no way of knowing if the document it recommended was valid or not just based on the search. I think your question should really be, "How can I tune Search relevance in an intelligent way based on user data?".
Now, there are a number of ways you can achieve this, but they require you to gather user data and build the model yourself. So unfortunately, there is no easy way.
However, I would recommend taking a look at https://www.elastic.co/app-search/, which offers a managed solution with lots of custom relevant tuning which may save you lots of time depending on your use case.

How do normalization and internal optimization of boosting work? And how does that affect the relevance?

I'm new to elastic search. I'm having trouble understanding the calibration and scaling of boost values for fields in a document. As in how should we decide the boosting values for field so that it works as expected. I've gone through some of the online blogs and es doc as well, it's written that es does normalization and internal optimization of boosting values? How does that work?
E.g.: If we have tags, title, name and text fields in our doc, how should we decide the boosting values for these?
Elasticsearch uses a boolean model to match documents, and then a scoring model to determine relevance (i.e. ranking). The scoring model utilizes a TF/IDF score, coupled with some additional features. Those TF/IDF scores are calculated for each matching field within a query, and then aggregated to produce an overall score for a document. To dig into this process, I suggest running explain on your query to see how the score of each field is influencing the overall relevance of your document.
As the expert on your data, you're in the best position to determine which fields should most heavily influence the relevance of your document. Finding the right boost value for a field is about adjusting the levers until you find a formula that best suites your desired outcome (Also, if you have users, A/B testing can help here).

Algorithm to recognize keywords' categories in a One-search-box-for-all model query

I'm aiming at providing one-search-box-for-everything model in search engine project, like LinkedIn.
I've tried to express my problem using an analogy.
Let's assume that each result is an article and has multiple dimensions like author, topic, conference (if that's a publication), hosted website, etc.
Some sample queries:
"information retrieval papers at IEEE by authorXYZ": three dimensions {topic, conf-name, authorname}
"ACM paper by authoABC on design patterns" : three dimensions {conf-name, author, topic}
"Multi-threaded programming at javaranch" : two dimensions {topic, website}
I've to identify those dimensions and corresponding keywords in a big query before I can retrieve the final result from the database.
Points
I've access to all the possible values to all the dimensions. For example, I've all the conference names, author names, etc.
There's very little overlap of terms across dimensions.
My approach (naive)
Using Lucene, index all the keywords in each dimension with a dedicated field called "dimension" and another field with actual value.
Ex:
1) {name:IEEE, dimension:conference}, etc.
2) {name:ooad, dimension:topic}, etc.
3) {name:xyz, dimension:author}, etc.
Search the index with the query as-it-is.
Iterate through results up to some extent and recognize first document with a new dimension.
Problems
Not sure when to stop recognizing the dimensions from the result set. For example, the query may contain only two dimensions but the results may match 3 dimensions.
If I want to include spell-checking as well, it becomes more complex and the results tend to be less accurate.
References to papers, articles, or pointing-out the right terminology that describes my problem domain, etc. would certainly help.
Any guidance is highly appreciated.
Solution 1: Well how about solving your problem using Natural Language Processing Named Entity Recognition (NER). Now NER can be done using simple Regular Expressions (in case where the data is too static) or else you can use some Machine Learning Technique like Hidden Markov Models to actually figure out the named entities in your sequence data set. Why I stress on HMM as compared to other Machine Learning Supervised algorithms is because you have sequential data with each state dependent on the previous or next state. NER would output for you the dimensions along with the corresponding name. After that your search becomes a vertical search problem and you can just search for the identified words in different Solr/Lucene fields and set your boosts accordingly.
Now coming to the implementation part, I assume you know Java as you are working with Lucene, so Mahout is a good choice. Mahout has an HMM built in and you can train+test the model on your data set. I am also assuming you have large data set.
Solution 2: Try to model this problem as a property graph problem. Check out something like Neo4j. I suggest this as your problem falls under schema less domain. Your schema is not fixed and problem very well can be modelled as a graph where each node would be a set of key value pairs.
Solution 3: As you said that you have all possible values of dimensions than before anything else why not simply convert all your unstructured data from your text to structured data by using Regular Expressions and again as you do not have fixed schema so store the data in any NoSQL key value database. Most of them provided Lucene Integrations for full text search, then simply search on those database.
what you need to do is to calculate the similarity between the query and the document set you are looking in. Measures like cosine similarity should serve your need. However a hack that you can use is calculate the Tf/idf for the document and create an index using that score from there you can choose the appropriate one. I would recommend you to look into Vector Space Model to find a method that serves your need!!
give this algorithm a look aswell
http://en.wikipedia.org/wiki/Okapi_BM25

Score equivalence Oracle Text / Lucene

Is there an equivalence between the scores an Oracle Text Score would calculate and a Lucene one ?
Would you be able to mix the sources to get one unified resultset through the score ?
Scores are not comparable between queries or data changes in Lucene, much less being comparable to another technology. Lucene scores of the same document can be changed dramatically by having other documents added or removed from the index. Scoring as a percentage of maximum becomes the obvious solution, but the same problems remain, as well as that other algorithms in another technology will ikely render different distribution. You can read about why you should not compare scores like this here and here
A way I managed to lash something similar together was to fetch matches from the other data source, and create a temporary index in a RAMDirectory, and then search again incorporating it with a MultiSearcher. That way everything is getting scored on a single, cohesive data set, within a single search. Scoring should be reasonable enough, though this isn't exactly the most efficient way to search.

Would my approach to fuzzy search, for my dataset, be better than using Lucene?

I want to implement a fuzzy search facility in the web-app i'm currently working on. The back-end is in Java, and it just so happens that the search engine that everyone recommends on here, Lucene, is coded in Java as well. I, however, am shying away from using it for several reasons:
I would feel accomplished building something of my own.
Lucene has a plethora of features that I don't see myself utilizing; i'd like to minimize bloat.
From what I understand, Lucene's fuzzy search implementation manually evaluates the edit distances of each term indexed. I feel the approach I want to take (detailed below), would be more efficient.
The data to-be-indexed could potentially be the entire set of nouns and pro-nouns in the English language, so you can see how Lucene's approach to fuzzy search makes me weary.
What I want to do is take an n-gram based approach to the problem: read and tokenize each item from the database and save them to disk in files named by a given n-gram and its location.
For example: let's assume n = 3 and my file-naming scheme is something like: [n-gram]_[location_of_n-gram_in_string].txt.
The file bea_0.txt would contain:
bear
beau
beacon
beautiful
beats by dre
When I receive a term to be searched, I can simply tokenize it in to n-grams, and use them along with their corresponding locations to read in to the corresponding n-gram files (if present). I can then perform any filtering operations (eliminating those not within a given length range, performing edit distance calculations, etc.) on this set of data instead of doing so for the entire dataset.
My question is... well I guess I have a couple of questions.
Has there been any improvements in Lucene's fuzzy search that I'm not aware of that would make my approach unnecessary?
Is this a good approach to implement fuzzy-search, (considering the set of data I'm dealing with), or is there something I'm oversimplifying/missing?
Lucene 3.x fuzzy query used to evaluate the Levenshtein distance between the queried term and every index term (brute-force approach). Given that this approach is rather inefficient, Lucene spellchecker used to rely on something similar to what you describe: Lucene would first search for terms with similar n-grams to the queried term and would then score these terms according to a String distance (such as Levenshtein or Jaro-Winckler).
However, this has changed a lot in Lucene 4.0 (an ALPHA preview has been released a few days ago): FuzzyQuery now uses a Levenshtein automaton to efficiently intersect the terms dictionary. This is so much faster that there is now a new direct spellchecker that doesn't require a dedicated index and directly intersects the terms dictionary with an automaton, similarly to FuzzyQuery.
For the record, as you are dealing with English corpus, Lucene (or Solr but I guess you could use them in vanilla lucene) has some Phonetic analyzers that might be useful (DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone)
Lucene 4.0 alpha was just released, many things are easier to customize now, so you could also build upon it an create a custom fuzzy search.
In any case Lucene has many years of performance improvements so you hardly would be able to achieve the same perf. Of course it might be good enough for your case...

Resources