Fuzziness in phrase matching in elasticsearch - elasticsearch

How to include fuzziness in phrase matching ? In the elasticsearch documentation it is mentioned that fuzziness is not supported with phrase matching.
I have documents containing phrases now i have a text body , now i want to find out the common phrases of text and phrases within the documents , but need to search phrases that might be spelled wrong .

There are some ways to do this:
Remove Whitespace and index the hole Phrase as one Token (I think there is a Filter for that in Elastic). In your query you would have to do the same.
There is a Tokenizer where I forgot the name (Maybe someone can help out here?) which lets you index more than one words together. If your phrases have a common maximum length like 5 words or so this could do the trick.
Beware fuzzi only works for a maximum distance of 2 so if you have a very long sentence 2 might not be enough and you have to split it.

Related

Elastic Search substring search

I need to implement search by substring. It is supposed to work the same like “CTRL + F” that highlight a word if its substring matches it.
The search is going to be performed by two fields only:
Name - no more than 255 chars
Id - no more than 200 chars
However, number of records going to be pretty large about a million.
So far I’m using querystring search by keywords wrapped with wildcards but it will definitely lead to performance problems later on once number of records will start growing.
Do you have any suggestions how would I do more performance wise solution?
Searching with leading wildcards is going to be extremely slow on a large index
Avoid beginning patterns with * or ?. This can increase the iterations
needed to find matching terms and slow search performance.
As written in documentation wildcards queries are very slow.
Better to use ngram strategy if you want it to be fast at query time. If you want to search by partial match, word prefix, or any substring match it is better to use n-gram tokenizer, which will improve the full-text search.
The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.
Please go through this SO answer, that includes a working example for a partial match using ngrams

Does Elasticsearch score different length shingles with the same IDF?

In Elasticsearch 5.6.5 I'm searching against a field with the following filter applied:
"filter_shingle":{
"max_shingle_size":"4",
"min_shingle_size":"2",
"output_unigrams":"true",
"type":"shingle"
}
When I perform a search for depreciation tax against a document with that exact text, I see the following explanation of the score:
weight(Synonym(content:depreciation content:depreciation tax)) .... [7.65]
weight(content:tax) ... [6.02]
If I change the search to depreciation taffy against the exact same document with depreciation tax in the content I get this explanation:
weight(Synonym(content:depreciation content:depreciation taffy)) .... [7.64]
This is not what I expected. I thought a match on the bigram token for depreciation tax would get a much higher score than a match on the unigram depreciation. However this scoring seems to reflect a simple unigram match. There is an extremely small difference and digging further this is because the termFreq=28 under the depreciation taffy match, and termFreq=29 under the depreciation tax match. I'm also not sure how this relates as I imagine across the shard holding this document there are very different counts for depreciation, depreciation tax and depreciation tafffy
Is this expected behavior? Is ES treating all the different sized shingles, including unigrams, with the same IDF value? Do I need to split out each shingle size into different sub fields with different analyzers to get the behavior I expect?
TL;DR
Shingles and Synonyms are broken in Elastic/Lucene and a lot of hacks need to be applied until a fix is released (accurate as of ES 6).
Put unigrams, bigrams and so on in individual subfields and search them separately, combining the scores for an overall match. Don't use a single shingle filter on a field that does multiple n-gram configurations
Don't combine a synonym and shingle filter on the same field.
In my case I do a must match with synonyms on a unigram field, then a series of should matches to boost the score on shingles of each size, without synonyms
Details
I got an answer on the elastic support forums:
https://discuss.elastic.co/t/does-elasticsearch-score-different-length-shingles-with-the-same-idf/126653/2
Yep, this is mostly expected.
It's not really the shingles causing the scoring oddness, but the fact
that SynonymQueries do the frequency blending behavior that you're
seeing. They use frequency of the original token for all the
subsequent 'synonym' tokens, as a way to help prevent skewing the
score results. Synonyms are often relatively rare, and would
drastically affect the scoring if they each used their individual
df's.
From the Lucene docs:
For scoring purposes, this query tries to score the terms as if you
had indexed them as one term: it will match any of the terms but only
invoke the similarity a single time, scoring the sum of all term
frequencies for the document.
The SynonymQuery also sets the docFrequency to the maximum
docFrequency of the terms in the document. So for example, if:
"deprecation"df == 5 "deprecation tax"df == 2, "deprecation taffy"df
== 1, it will use 5 as the docFrequency for scoring purposes.
The bigger issue is that Lucene doesn't have a way to differentiate
shingles from synonyms... they both use tokens that overlap the
position of other tokens in the token stream. So if unigrams are mixed
with bi-(or larger)-grams, Lucene is tricked into thinking it's
actually a synonym situation.
The fix is to keep your unigrams and bi-plus-grams in different
fields. That way Lucene won't attempt to use SynonymQueries in these
situations, because the positions won't be overlapping anymore.
Here's another related question that I asked which relates to how actual synonyms also get mangled when combined with shingles. https://discuss.elastic.co/t/es-5-4-synonyms-and-shingles-dont-seem-to-work-together/127552
Elastic/Lucene expands the synonym set, injects them into the token stream, then creates shingles. E.g. Query: econ supply and demand => econ, economics, supply, demand. Document: `... econ foo ... => econ, foo '. Now we get the shingle from the query "econ economics" and somehow this matches the document. No idea why since I only applied synonyms to the query, not the document, so I don't see the match. Also, the way the shingles are created from the query is wrong too.
This is a known problem, and it is still not fully resolved. A number
of Lucene filters can't consume graphs as their inputs.
There is currently active work being done on developing a fixed
shingle filter, and also an idea to have a sub-field for indexing
shingles.

Limit of words in Phrase List

I am adding some keywords in phrase list, there are ~8000 words. Is there any limit in LUIS phrase list. As I am getting an error "BadArgument. Too many words in Data Dictionary".
Can anyone tell me what is the limit of phrase list words?
Also is there any other approach to incorporate these words?
You can have 10 Phrase Lists and the maximum length of each of these is 5,000 items. You can find the information here in the docs.
Regarding incorporating words, can you give an example of the words you're using?

Matching words and valid sub-words in elasticseach

I've been working with ElasticSearch within an existing code base for a few days, so I expect that the answer is easy once I know what I'm doing. I want to extend a search to yield the same results when I search with a compound word, like "eyewitness", or its component words separated by a whitespace, like "eye witness".
For example, I have a catalog of toy cars that includes both "firetruck" toys and "fire truck" toys. I would like to ensure that if someone searched on either of these terms, the results would include both the "firetruck" and the "fire truck" entries.
I attempted to do this at first with the "fuzziness" of a match, hoping that "fire truck" would be considered one transform away from "firetruck", but that does not work: ES fuzziness is per-word and will not add or remove whitespace characters as a valid transformation.
I know that I could do some brute-forcing before generating the query by trying to come up with additional search terms by breaking big words into smaller words and also joining smaller words into bigger words and checking all of them against a dictionary, but that falls apart pretty quickly when "fuzziness" and proper names are part of the task.
It seems like this is exactly the kind of thing that ES should do well, and that I simply don't have the right vocabulary yet for searching for the solution.
Thanks, everyone.
there are two things you could could do:
you could split words into their compounds, i.e. firetruck would be split into two tokens fire and truck, see here
you could use n-grams, i.e. for 4 grams the original firetruck get split into the tokens fire, iret, retr, etru, truc, ruck. In queries, the scoring function helps you ending up with pretty decent results. Check out this.
Always remember to do the same tokenization on both the analysis and the query side.
I would start with the ngrams and if that is not good enough you should go with the compounds and split them yourself - but that's a lot of work depending on the vocabulary you have under consideration.
hope the concepts and the links help, fricke

Does Algolia have a boost score feature like Elasticsearch?

I have a requirement on sorting a field which is when a value matches its field, then this document has a higher score than other documents. Can Algolia do this?
To reflect the importance of an attribute compared to another, the way to go using Algolia is definitely to order the attributes you want to search in in the searchableAttributes index setting.
For instance, if you want to search in both title and description, but title is more important; you should go for:
searchableAttributes:
- title
- description
Compared to the boosting approach, this ensures the number of match occurrences you have won't impact the overall ranking (common issue in ES is: is 4 words matching here and there in description better than 2 words matching exactly in title?).
With Algolia, the objects matching the longest expression (in terms of proximity between query words in the text) will always be used to identify the best matching attribute; and then to sort the results according to the attributes importance.

Resources