Elastic query for large text document match - elasticsearch

If I have an elastic index of news articles, with the news body text in a newsBody field, can you do a search to see if another newsBody 'matches' one in the index? The other newsBody text may have slight variations however.
So not exact matching, but being able to test for similarity between large bodies of text. This is important as often news articles will be nearly identical but differ in ~30 out of 400 words.
So I'd like to be able to pass in a newsBody, and query it against the whole index, looking for similarity to any 'matches'.
I think the similarity module may help, but haven't got anywhere yet: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html
Thanks,
Daniel

Related

Is there a way to show at what percentage a selected document is similar to others on ElasticSearch?

I need to build a search engine using Elasticsearch and the steps will be as following:
Search on the search engine with a search string.
The relevant results will display and I can click on these documents.
If I select a document, I will be redirected to another page where I will see all the details of the documents and will have an option "More Like This" (which will return documents similar to the selected document). I know that this is done using the MLT query.
Now my question is: Except for returning documents similar to the selected one, how can I also return at what percentage the documents are similar to the selected one?
There are a couple of things you can do.
using function_score query
more_like_this query is essentially a full text search, and it returns documents ordered by their relevance score. It could be possible to convert the score directly to a percentage, but it is not advised (here
and more specifically here).
Instead one can define a custom score with help of a function_score query, which can be designed so it returns a meaningful percentage.
This, of course, comes with additional cost of complexity, and the definition of "similarity" becomes more of an art than of science.
using dense_vector
One may opt to use the (yet experimental) dense_vector data type, which allows storing and comparing dense vectors (that is, arrays of numbers of fixed size). Here's an article that describes this approach very well: Text similarity search with vector fields.
In this case the definition of similarity is as precise as it can possibly be: a distance of two vectors in a multidimensional space, which can be computed via, for instance, cosine similarity.
However, such dense vectors have to be somehow computed, and the quality of said vectors will equal the quality of the similarity itself.
As the bottom line I must say that to make this work with Elasticsearch a bunch of computation and logic should be added outside, either in form of pre-computed models, or custom curated scoring algorithms. Elasticsearch out of the box does not seem to be a good percentage-similarity kind of deal.
Hope that helps!
If you're going the route of using semantic search via dense_vector, as Nikolay mentioned, I would recommend NBoost. NBoost has a good out-of-the-box systems for improving Elasticsearch results with SOTA models.

Are there any approaches/suggestiosn for classifying a keyword so the search space will be reduced in elasticsearch?

I was wondering is there any way to classify a single word before applying a search on elasticsearch. Let's say I have 4 indexes each one holds few millions documents about a specific category.
I'd like to avoid searching the whole search space each time.
This problem becomes more challenging since it's not a sentence,
The search query usually consists only a single or two words, so some nlp magic (Named-entity recognition, POS etc) can't be applied.
I have read few questions on stackoverflow like:
Semantic search with NLP and elasticsearch
Identifying a person's name vs. a dictionary word
and few more, but couldn't find an approach. are there any suggestions I should try?

Partial and Full Phrase Match

Say I have the sentence: "John likes to take his pet lamb in his Lamborghini Huracan more than in his Lamborghini Gallardo" and I have a dictionary containing "Lamborghini", "Lamborghini Gallardo" and "Lamborghini Huracan". What's a good way of extracting the bold terms, achieving the terms "Lamborghini Gallardo" and "Lamborghini Huracan" as phrase matches, and other partial matches "Lamborghini" and "lamb"? Giving preference to the phrase matches over individual keywords.
Elastic search provides exact term match, match phrase, and partial matching. Exact term would obviously not work here, and neither match phrase since the whole sentence is considered as phrase in this case. I believe partial match would be appropriate if I only had the keywords of interest in the sentence. Going through previous SO threads, I found proximity for relevance which seems relevant, although not sure if this is the 'best option' since requires setting a threshold. Or even if there are simpler / better alternatives than elasticsearch (which seems more for full text search rather than simple keyword matching to a database)?
It sounds like you'd like to perform keyphrase extraction from your documents using a controlled vocabulary (your dictionary of industry terms and phrases).
[Italicized terms above to help you find related answers on SO and Google]
This level of analysis takes you a bit out of the search stack into the natural-language processing stack. Since NLP tends to be resource-intensive, it tends to take place offline, or in the case of search-applications, at index-time.
To implement this, you'd:
Integrate a keyphrase extraction tool, into your search-indexing code to generate a list of recognized key phrases for each document.
Index those key phrases as shingles into a new Elasticsearch field.
Include this shingled keyphrase field in the list of fields searched at query-time — most likely with a score boost.
For a quick win tool to help you with controlled keyphrase extraction, check out KEA (written in java).
(You could also probably write your own, but if you're also hoping to extract uncontrolled key phrases (not in dictionary) as well, a trained extractor will serve you better. More tools here.)

More search suggestions with Elasticsearch

I'm building a small vertical search engine using Elasticsearch as the indexer and Nutch as the crawler. I was using the HTML title field to build search suggestions for ES using an edge n gram strategy, thinking that the title field would be good as it should contain relevant terms for the subject content of the page and it would keep the index smaller in terms of search suggestions, be them single words or phrases. However, in testing so far, its not working out as thought... there just aren't that many suggestions appearing.
At present I'm only doing testing using about 10 sites, but will eventually reach about 500 or so. I'm thinking that because of the small data set, (10 sites, only on HTML title field) there probably aren't enough terms or phrases available to make good suggestions, at least phrase suggestions anyway.
Would it be advisable to just crawl more sites to create more suggestions (terms and phrases) with the edge n gram strategy on the title field OR should I use the content field (which is obviously much larger than the title field).
I'm trying to fine tune this to get more search suggestions, especially phrase suggestions, while being mindful of the index size - so that performance doesn't suffer. Any ideas?
These days one could say that suggestions are even more important than the search results itself --- which is slightly nonsensical, I know. But users tend to expect that if there is no suggestion, there is no search result. Therefore make sure every searchable field is properly reflected in your suggestions --- in particular your content. And "optimize later"! Don't look at your performance too early. 500 sites does not sound like you'll get a lot of documents to index anyway. What kind of hardware are you using?

Exclude similar documents (duplicates) from result

I store all articles from some news sources. A news article that originates from e.g. Cnn.com, might be reposted by others. In effect I end up saving the same articles many times.
If I do a search for 'Tesla' I might get 3 articles that are 90% equal to each other. I can compare and filter duplicates in my app using the Levenshtein distance, but I rather have ES filtering it.
Is there a way I can say give me all articles matching WORD, but only return the first if other hits are more than 90% equal to the first?
Cheers,
Martin
If you really need to keep all this records in ES (instead of filtering out with levenshtein before indexing), than you're probably looking for top hits aggregations with field collapsing.
Also take a look at this SO question

Resources