Filtering the results of a sorted query in Lucene.NET - sorting

I'm using Lucene.NET, which is currently up to date with Lucene 2.9. I'm trying to implement a kind of select distinct, but without the need to drill down into any groups. I know that Lucene 3.2 has a faceted search that may solve this, but I don't have the time to port it to 2.9 yet.
I figure in any event, when you perform a paged query with a sort operator, Lucene has to find all the documents that match the query, sort them, then take the top N results, where N is the page size. I'd like to build something that is also applied after the sorted query has completed, but takes the top N unique results and returns them. I'm thinking of using a HashSet and one of the indexed fields to determine uniqueness. I'd rather find a way to extend something in Lucene than try and do this once the results are already returned for performance reasons.
Custom filters seem to run before the main query is even applied and custom collectors run before sorting is applied, unless you are sorting by Lucene's document id. So what is the best approach to this problem? A point in the direction of the right component to extend will get you the answer on this one, an example implementation will most definitely get you the answer. Thanks in advance

I'd make the search without sorting, and in a custom collector, would collect the results in a sorted list of size N based on "uniqueness"

Related

ElasticSearch Search Queries Count

We have a use case for aggregating count of elastic-search search queries/operations. Initially we've decided to make use of the /_stats endpoint for aggregating results on a per index basis. However, we would also like to explore the option of filtering search operations so we can distinguish operations by origin/source. I was wondering how we can do this efficiently. Any references to documentation or implementations would be highly appreciated,

Sort results by relevance without filtering in Algolia

Is there a way to sort results in Algolia by relevance instead of filtering them? In our case we have quite a few important attributes but we only have around 700 products so many times the search using facets end up with few or no results.
To avoid this, we are looking for a solution to reorder the list by relevance to show the best results on top while allowing users to still see the other less relevant results. Basically not filtering products, but just reorder them by relevance based on a combination of attributes we set.
Thanks
When setting filters leads you to few or no results, and you'd like to avoid that by still showing less relevant results, two solutions come to mind:
Use optionalFilters instead of filters. You get the same behavior as with filtering, but the Algolia API also returns results that don't match the filters and ranks them lower. This is the ideal solution, as it takes a single API round trip.
Perform a second search without filters when the first search returns fewer records than a threshold of your choice. This is a more manual approach and takes up to two API calls.

Is there a way to show at what percentage a selected document is similar to others on ElasticSearch?

I need to build a search engine using Elasticsearch and the steps will be as following:
Search on the search engine with a search string.
The relevant results will display and I can click on these documents.
If I select a document, I will be redirected to another page where I will see all the details of the documents and will have an option "More Like This" (which will return documents similar to the selected document). I know that this is done using the MLT query.
Now my question is: Except for returning documents similar to the selected one, how can I also return at what percentage the documents are similar to the selected one?
There are a couple of things you can do.
using function_score query
more_like_this query is essentially a full text search, and it returns documents ordered by their relevance score. It could be possible to convert the score directly to a percentage, but it is not advised (here
and more specifically here).
Instead one can define a custom score with help of a function_score query, which can be designed so it returns a meaningful percentage.
This, of course, comes with additional cost of complexity, and the definition of "similarity" becomes more of an art than of science.
using dense_vector
One may opt to use the (yet experimental) dense_vector data type, which allows storing and comparing dense vectors (that is, arrays of numbers of fixed size). Here's an article that describes this approach very well: Text similarity search with vector fields.
In this case the definition of similarity is as precise as it can possibly be: a distance of two vectors in a multidimensional space, which can be computed via, for instance, cosine similarity.
However, such dense vectors have to be somehow computed, and the quality of said vectors will equal the quality of the similarity itself.
As the bottom line I must say that to make this work with Elasticsearch a bunch of computation and logic should be added outside, either in form of pre-computed models, or custom curated scoring algorithms. Elasticsearch out of the box does not seem to be a good percentage-similarity kind of deal.
Hope that helps!
If you're going the route of using semantic search via dense_vector, as Nikolay mentioned, I would recommend NBoost. NBoost has a good out-of-the-box systems for improving Elasticsearch results with SOTA models.

Score equivalence Oracle Text / Lucene

Is there an equivalence between the scores an Oracle Text Score would calculate and a Lucene one ?
Would you be able to mix the sources to get one unified resultset through the score ?
Scores are not comparable between queries or data changes in Lucene, much less being comparable to another technology. Lucene scores of the same document can be changed dramatically by having other documents added or removed from the index. Scoring as a percentage of maximum becomes the obvious solution, but the same problems remain, as well as that other algorithms in another technology will ikely render different distribution. You can read about why you should not compare scores like this here and here
A way I managed to lash something similar together was to fetch matches from the other data source, and create a temporary index in a RAMDirectory, and then search again incorporating it with a MultiSearcher. That way everything is getting scored on a single, cohesive data set, within a single search. Scoring should be reasonable enough, though this isn't exactly the most efficient way to search.

Does Lucene.Net Sort and then filter, or filter and then sort?

We are using the Lucene.Net IndexSearch.Search method. We are passing a filter and a Sort, but we're seeing some strange behaviour. Logic tells me that filtering would be done before sorting, for performance reasons, but wanted to make sure.
Filter then Sort.
Sorting in Lucene is done by collecting Documents in order into a queue. It keeps the top X documents, where X is the maximum number of results you asked for. The collectors wont compare documents that dont match either the Filter or the Query.
When you dont specify a Sort, the score is used to prioritize documents into the queue, if you use a Sort, a Comparator for the Sort you asked for is used instead.
If you are more curious, have a look at the different Collector classes in the source code, the Collect() methods have all the info you want.

Resources