Exclude similar documents (duplicates) from result - elasticsearch

I store all articles from some news sources. A news article that originates from e.g. Cnn.com, might be reposted by others. In effect I end up saving the same articles many times.
If I do a search for 'Tesla' I might get 3 articles that are 90% equal to each other. I can compare and filter duplicates in my app using the Levenshtein distance, but I rather have ES filtering it.
Is there a way I can say give me all articles matching WORD, but only return the first if other hits are more than 90% equal to the first?
Cheers,
Martin

If you really need to keep all this records in ES (instead of filtering out with levenshtein before indexing), than you're probably looking for top hits aggregations with field collapsing.
Also take a look at this SO question

Related

Sort results by relevance without filtering in Algolia

Is there a way to sort results in Algolia by relevance instead of filtering them? In our case we have quite a few important attributes but we only have around 700 products so many times the search using facets end up with few or no results.
To avoid this, we are looking for a solution to reorder the list by relevance to show the best results on top while allowing users to still see the other less relevant results. Basically not filtering products, but just reorder them by relevance based on a combination of attributes we set.
Thanks
When setting filters leads you to few or no results, and you'd like to avoid that by still showing less relevant results, two solutions come to mind:
Use optionalFilters instead of filters. You get the same behavior as with filtering, but the Algolia API also returns results that don't match the filters and ranks them lower. This is the ideal solution, as it takes a single API round trip.
Perform a second search without filters when the first search returns fewer records than a threshold of your choice. This is a more manual approach and takes up to two API calls.

Is there a way to show at what percentage a selected document is similar to others on ElasticSearch?

I need to build a search engine using Elasticsearch and the steps will be as following:
Search on the search engine with a search string.
The relevant results will display and I can click on these documents.
If I select a document, I will be redirected to another page where I will see all the details of the documents and will have an option "More Like This" (which will return documents similar to the selected document). I know that this is done using the MLT query.
Now my question is: Except for returning documents similar to the selected one, how can I also return at what percentage the documents are similar to the selected one?
There are a couple of things you can do.
using function_score query
more_like_this query is essentially a full text search, and it returns documents ordered by their relevance score. It could be possible to convert the score directly to a percentage, but it is not advised (here
and more specifically here).
Instead one can define a custom score with help of a function_score query, which can be designed so it returns a meaningful percentage.
This, of course, comes with additional cost of complexity, and the definition of "similarity" becomes more of an art than of science.
using dense_vector
One may opt to use the (yet experimental) dense_vector data type, which allows storing and comparing dense vectors (that is, arrays of numbers of fixed size). Here's an article that describes this approach very well: Text similarity search with vector fields.
In this case the definition of similarity is as precise as it can possibly be: a distance of two vectors in a multidimensional space, which can be computed via, for instance, cosine similarity.
However, such dense vectors have to be somehow computed, and the quality of said vectors will equal the quality of the similarity itself.
As the bottom line I must say that to make this work with Elasticsearch a bunch of computation and logic should be added outside, either in form of pre-computed models, or custom curated scoring algorithms. Elasticsearch out of the box does not seem to be a good percentage-similarity kind of deal.
Hope that helps!
If you're going the route of using semantic search via dense_vector, as Nikolay mentioned, I would recommend NBoost. NBoost has a good out-of-the-box systems for improving Elasticsearch results with SOTA models.

Elastic query for large text document match

If I have an elastic index of news articles, with the news body text in a newsBody field, can you do a search to see if another newsBody 'matches' one in the index? The other newsBody text may have slight variations however.
So not exact matching, but being able to test for similarity between large bodies of text. This is important as often news articles will be nearly identical but differ in ~30 out of 400 words.
So I'd like to be able to pass in a newsBody, and query it against the whole index, looking for similarity to any 'matches'.
I think the similarity module may help, but haven't got anywhere yet: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html
Thanks,
Daniel

MongoDB text index search slow for common words in large table

I am hosting a mongodb database for a service that supports full text searching on a collection with 6.8 million records.
Its text index includes ten fields with varying weights.
Most searches take less than a second. Some searches take two to three seconds. However, some searches take 15 - 60 seconds! The 15-60 second search cases are unacceptable for my application. I need to find a way to speed those up.
Searching takes 15-60 seconds when words that are very common in the index are used in the search query.
I seems that the text search feature does not support lazy parameters. My first thought was to cache a list of the 50 most common words in my text index and then ask mongodb to evaluate those last (lazy) and on top of the filtered results returned by the less common parameters. Hopefully people are still with me. For example, say I have a query "products chocolate", where products is common and chocolate is uncommon. I would like to be able to ask mongodb to evaluate "chocolate" first, and then filter those results with the "products" term. Does anyone know of a way to achieve this?
I can achieve the above scenario by omitting the most common words (i.e. "products") from the db query and then reapplying the common term filter on the application side after it has received records found by db. It is preferable for all query logic to happen on the database, but am open to application side processing for a speed payout.
There are still some holes in this design. If a user only searches common terms, I have no choice but to hit the database with all the terms. From preliminary reading, I gather that it is not recommended (or not supported) to have multiple text indexes (with different names) on the same collection. My plan is to create two identical tables, each with my 6.8M records, with different indexes - one for common words and one for uncommon words. This feels kludgy and clunky, but am willing to do this for a speed increase.
Does anyone have any insight and/or advice on how to speed up this system. I'd like as much processing to happen on the database as possible to keep it fast. I'm sure my little 6.8M record table is not the largest that mongodb has seen. Thanks!
Well I worked around these performance issues by allowing MongoDB full text search to search in OR based format. I'm prioritizing my results by fine tuning the weights on my indexed fields and just ordering by rank. I do get more results than desired, but that's not a huge problem because my weighted results that appear at the top will most likely be consumed before my user gets to less relevant results at the bottom.
If anyone is struggling with MongoDB text search performance using AND searching only, just switch back to OR and control your results using weights. It performs leaps better.
hth
This is the exact same issue as $all versus $in. $all only uses the index for the first keyword in the array. I believe your seeing the same issue here, reason why the OR a.k.a. IN works for you.

Keywords Ranking

I have a list of keywords taken from 95 documents. I'd like to rank their importance, but I have only the number of documents in which the keywords appear and the maximum frequency of a keyword among all the documents. I'm looking for a ranking formula that could help. At the moment I'm using IDF, but I'd like to know if there is any better formula.
word frequency is already done by listing the most important words in English ( and many other langs ) by Wikitionary Frequency Lists which has many type of lists based on the most important and top words, besides the TV and Movies most frequent words and many others.
If you like to do some algorithm based on word ranking I would suggest you don't go far away from
TF-IDF
and here you can find the Latent semantic indexing algorithm which might me an asset for you.
Hope that is what you needed.
TF-IDF is definitely a good base and easy to implement.
It is also really common to add other bias such as the position of your terms inside your documents; a term occurring at the beginning of a document, or better, in its title tends to be more relevant than the ones occurring in the middle or at the end.
But you have to keep in mind that choosing an algorithm and its bias also depends on the nature of your documents. For instance, long documents (e.g. research papers or books) would need a position bias, but not necessarily news articles. Same thing for the "IDF" measure, it has to be computed on a large corpus of documents with similar type of content than your documents. You don't want to have a relevancy score computed on a "TV and Movies" corpus if for instance your documents are research papers about semi-conductors.
My two cents.

Resources