Score equivalence Oracle Text / Lucene - oracle

Is there an equivalence between the scores an Oracle Text Score would calculate and a Lucene one ?
Would you be able to mix the sources to get one unified resultset through the score ?

Scores are not comparable between queries or data changes in Lucene, much less being comparable to another technology. Lucene scores of the same document can be changed dramatically by having other documents added or removed from the index. Scoring as a percentage of maximum becomes the obvious solution, but the same problems remain, as well as that other algorithms in another technology will ikely render different distribution. You can read about why you should not compare scores like this here and here
A way I managed to lash something similar together was to fetch matches from the other data source, and create a temporary index in a RAMDirectory, and then search again incorporating it with a MultiSearcher. That way everything is getting scored on a single, cohesive data set, within a single search. Scoring should be reasonable enough, though this isn't exactly the most efficient way to search.

Related

Hold Elasticsearch document frequency constant as index changes

I'm using Elasticsearch to retrieve XML documents by terms. I have multiple indexes, one for each day. I have a large collection of documents that is, in some sense, representative. The document frequency of several terms varies from day to day.
The mathching I'm doing depends on inverse document frequency of terms. I'd like to not use the IDF of the indices I'm searching, and instead use the IDF based on the large, representative set. Is there a straightforward way to do this without writing custom scoring functions for large, complex queries?
There is no other way.
FWIW , To access and use IDF , you need to write a custom script Engine in elasticsearch, and probably use that engine based script for sorting.

Is there a way to show at what percentage a selected document is similar to others on ElasticSearch?

I need to build a search engine using Elasticsearch and the steps will be as following:
Search on the search engine with a search string.
The relevant results will display and I can click on these documents.
If I select a document, I will be redirected to another page where I will see all the details of the documents and will have an option "More Like This" (which will return documents similar to the selected document). I know that this is done using the MLT query.
Now my question is: Except for returning documents similar to the selected one, how can I also return at what percentage the documents are similar to the selected one?
There are a couple of things you can do.
using function_score query
more_like_this query is essentially a full text search, and it returns documents ordered by their relevance score. It could be possible to convert the score directly to a percentage, but it is not advised (here
and more specifically here).
Instead one can define a custom score with help of a function_score query, which can be designed so it returns a meaningful percentage.
This, of course, comes with additional cost of complexity, and the definition of "similarity" becomes more of an art than of science.
using dense_vector
One may opt to use the (yet experimental) dense_vector data type, which allows storing and comparing dense vectors (that is, arrays of numbers of fixed size). Here's an article that describes this approach very well: Text similarity search with vector fields.
In this case the definition of similarity is as precise as it can possibly be: a distance of two vectors in a multidimensional space, which can be computed via, for instance, cosine similarity.
However, such dense vectors have to be somehow computed, and the quality of said vectors will equal the quality of the similarity itself.
As the bottom line I must say that to make this work with Elasticsearch a bunch of computation and logic should be added outside, either in form of pre-computed models, or custom curated scoring algorithms. Elasticsearch out of the box does not seem to be a good percentage-similarity kind of deal.
Hope that helps!
If you're going the route of using semantic search via dense_vector, as Nikolay mentioned, I would recommend NBoost. NBoost has a good out-of-the-box systems for improving Elasticsearch results with SOTA models.

How important is it to use separate indices for percolator queries and their documents?

The ElasticSearch documentation on the Percolate query recommends using separate indices for the query and the document being percolated:
Given the design of percolation, it often makes sense to use separate indices for the percolate queries and documents being percolated, as opposed to a single index as we do in examples. There are a few benefits to this approach:
Because percolate queries contain a different set of fields from the percolated documents, using two separate indices allows for fields to be stored in a denser, more efficient way.
Percolate queries do not scale in the same way as other queries, so percolation performance may benefit from using a different index configuration, like the number of primary shards.
At the bottom of the page here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html
I understand this in theory, but I'd like to know more about how necessary this is for a large index (say, 1 million registered queries).
The tradeoff in my case is that creating a separate index for the document is quite a bit of extra work to maintain, mainly because both indices need to stay "in sync". This is difficult to guarantee without transactions, so I'm wondering if the effort is worth it for the scale I need.
In general I'm interested in any advice regarding the design of the index/mapping so that it can be queried efficiently. Thanks!

Would my approach to fuzzy search, for my dataset, be better than using Lucene?

I want to implement a fuzzy search facility in the web-app i'm currently working on. The back-end is in Java, and it just so happens that the search engine that everyone recommends on here, Lucene, is coded in Java as well. I, however, am shying away from using it for several reasons:
I would feel accomplished building something of my own.
Lucene has a plethora of features that I don't see myself utilizing; i'd like to minimize bloat.
From what I understand, Lucene's fuzzy search implementation manually evaluates the edit distances of each term indexed. I feel the approach I want to take (detailed below), would be more efficient.
The data to-be-indexed could potentially be the entire set of nouns and pro-nouns in the English language, so you can see how Lucene's approach to fuzzy search makes me weary.
What I want to do is take an n-gram based approach to the problem: read and tokenize each item from the database and save them to disk in files named by a given n-gram and its location.
For example: let's assume n = 3 and my file-naming scheme is something like: [n-gram]_[location_of_n-gram_in_string].txt.
The file bea_0.txt would contain:
bear
beau
beacon
beautiful
beats by dre
When I receive a term to be searched, I can simply tokenize it in to n-grams, and use them along with their corresponding locations to read in to the corresponding n-gram files (if present). I can then perform any filtering operations (eliminating those not within a given length range, performing edit distance calculations, etc.) on this set of data instead of doing so for the entire dataset.
My question is... well I guess I have a couple of questions.
Has there been any improvements in Lucene's fuzzy search that I'm not aware of that would make my approach unnecessary?
Is this a good approach to implement fuzzy-search, (considering the set of data I'm dealing with), or is there something I'm oversimplifying/missing?
Lucene 3.x fuzzy query used to evaluate the Levenshtein distance between the queried term and every index term (brute-force approach). Given that this approach is rather inefficient, Lucene spellchecker used to rely on something similar to what you describe: Lucene would first search for terms with similar n-grams to the queried term and would then score these terms according to a String distance (such as Levenshtein or Jaro-Winckler).
However, this has changed a lot in Lucene 4.0 (an ALPHA preview has been released a few days ago): FuzzyQuery now uses a Levenshtein automaton to efficiently intersect the terms dictionary. This is so much faster that there is now a new direct spellchecker that doesn't require a dedicated index and directly intersects the terms dictionary with an automaton, similarly to FuzzyQuery.
For the record, as you are dealing with English corpus, Lucene (or Solr but I guess you could use them in vanilla lucene) has some Phonetic analyzers that might be useful (DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone)
Lucene 4.0 alpha was just released, many things are easier to customize now, so you could also build upon it an create a custom fuzzy search.
In any case Lucene has many years of performance improvements so you hardly would be able to achieve the same perf. Of course it might be good enough for your case...

Efficient Phrase Matching Algorithm

I have a set of about 7 Million phrases to be matched to about 300 Million queries.
Queries can be sub-strings or contain the phrases themselves. Basically I want a measure of 'similarity' between two phrases [ not necessarily the edit distance ]
Can someone give some pointers to efficient algorithms to do this. I would prefer distributed algorithms since I am going to do this on Hadoop via streaming using python.
Bed trees look interesting
Bed-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance (Pdf of presentation)
This a at least not very trivial, because you have on the one side very much data and on the other side even more.
The ultra simplest approach would be a lucene index on the 7 mio. phrases and let the hadoop job query the index. Not quite sure if you need a solr server for that, or any similar implementations in python.
The mapper should write out the phrase id or linenumber, whatever you have to indentify it. Or at least the phrase itself, along with the matchingscore.
In the reduce step you could go for a reducing on a phrase key and write out all the related phrases with the score. (or whatever you want to)
For similarity you can read further here:
Similarity of Apache Lucene
Apache Lucene itself

Resources