Elastic exact matching and substring matching together - elasticsearch

I know that Elastic have "keyword" type in order to find something with exact matching. Ex:
"address": { "type": "keyword"}
That's cool. exact matching works!
but I would like to have both "exact matching" and "sub-string" matching. So I decided to create the following mapping:
"address": { "type": "text" , "index": true }
Problem
If I have "text" type, how can I search exact matching string? (not sub-string). I've tried several ways but does not works:
GET testing_index/_search
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"address" : "washington"
}
}
}
}
}
or
GET testing_index/_search
{
"query": {
"match": {
"address" : "washington"
}
}
}
I need just something universal mapping:
to find exact string
to find sub-strings
I hope elastic can do this.

By default, text fields use the default analyzer, which drops most punctuation, breaks up text into individual words, and lower cases them. For instance, the standard analyzer would turn the string “Quick Brown Fox!” into the terms [quick, brown, fox]. As you can imagine, this makes it difficult to write an exact match query against the text field. For your use case, I suggest one of 2 options:
store as keyword, and accomplish sub-string-like matching using wildcard or fuzzy queries. Wildcard queries, in particular queries with a leading wildcard, are notoriously slow, so proceed with caution.
store the field twice: one as keyword and one as text. Obvious downside here is bloating the size of the index.
For more background, see the "Term Query" Elasticsearch documentation, and in particular the section on "Why doesn’t the term query match my document?": https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

Related

Phrase suggester returns unexpected result when first letter is misspelled

I'm using Elasticsearch Phrase Suggester for correcting user's misspellings. everything is working as I expected unless user enters a query which it's first letter is misspelled. At this situation phrase suggester returns nothing or returns unexpected results.
My query for suggestion:
{
"suggest": {
"text": "user_query",
"simple_phrase": {
"phrase": {
"field": "title.phrase",,
"collate": {
"query": {
"inlile" : {
"bool": {
"should": [
{ "match": {"title": "{{suggestion}}"}},
{ "match": {"participants": "{{suggestion}}"}}
]
}
}
}
}
}
}
}
}
Example when first letter is misspelled:
"simple_phrase" : [
{
"text" : "گاشانچی",
"offset" : 0,
"length" : 11,
"options" : [ {
"text" : "گارانتی",
"score" : 0.00253151
}]
}
]
Example when fifth letter is misspelled:
"simple_phrase" : [
{
"text" : "کاشاوچی",
"offset" : 0,
"length" : 11,
"options" : [ {
"text" : "کاشانچی",
"score" : 0.1121
},
{
"text" : "کاشانجی",
"score" : 0.0021
},
{
"text" : "کاشنچی",
"score" : 0.0020
}]
}
]
I expect that these two misspelled queries have same suggestions(my expected suggestions are second one). what is wrong?
P.S: I'm using this feature for Persian language.
I have solution for your problem, only need to add some fields in your schema.
P.S: I don't have that much expertise in elasticsearch but I have solved same problem using solr, you can implement same way in elasticSearch too
Create new ngram field and copy all you title name in ngram field.
When you fire any query for missspell word and you get blank result then split
the word and again fire the same query you will get results as expected.
Example : Suppose user searching for word Akshay but type it as Skshay, then
create query in below way you will get results as expected hopefully.
I am here giving you solr example same way you can achieve it using
elasticsearch.
**(ngram:"skshay" OR ngram:"sk" OR ngram:"ks" OR ngram:"sh" OR ngram:"ha" ngram:"ay")**
We have split the word sequence wise and fire query on field ngram.
Hope it will help you.
From Elasticsearch doc:
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-suggesters-phrase.html
prefix_length
The number of minimal prefix characters that must match in order be a
candidate suggestions. Defaults to 1. Increasing this number improves
spellcheck performance. Usually misspellings don’t occur in the
beginning of terms. (Old name "prefix_len" is deprecated)
So by default phrase-suggester assumes that the first character is correct because the default value for prefix_length is 1.
Note: setting this value to 0 is not a good way because this will have performance implications.
You need to use the reverse analyzer
I explained it in this post so please go and check my answer
Elasticsearch spell check suggestions even if first letter missed
And regarding the duplicates, you can use
skip_duplicates
Whether duplicate suggestions should be filtered out (defaults to
false).

Elasticsearch 6.2: terms query require lowercase input when searching on keyword

I've created an example index, with the following mapping:
{
"_doc": {
"_source": {
"enabled": False
},
"properties": {
"status": { "type": "keyword" }
}
}
}
And indexed a document:
{"status": "CMP"}
When searching the documents with this status with a terms query, I find no results:
{
"query" : {
"terms": { "status": ["CMP"]}
}
}
However, if I make the same query by putting the input in lowercase, I will find my document:
{
"query" : {
"terms": { "status": ["cmp"]}
}
}
Why is it? Since I'm searching on a keyword field, the indexed content should not be analyzed and should match an uppercase value...
no more #Oliver Charlesworth Now - in Elastic 6.x - you could continue to use a keyword datatype, lowercasing your text with a normalizer,doc here. However in every cases you should change your index mapping and reindex your docs
The index and mapping creation and the search were part of a test suite. It seems that the setup part of the test suite was not executed, and the mapping was not applied to the index.
The index was then using the default types instead of the mapping types, resulting of the use of string fields instead of keywords.
After changing the setup method of the automated tests, the mappings are well applied to the index, and the uppercase values for the status "CMP" are now matching documents.
The symptoms you're seeing shouldn't occur, unless something else is wrong.
A keyword index is not analysed, so your index should contain only CMP. A terms query is also not analysed, etc. so your index is searched only for CMP. Hence there should be a match.

In Elasticsearch, how do I search for an arbitrary substring?

In Elasticsearch, how do I search for an arbitrary substring, perhaps including spaces? (Searching for part of a word isn't quite enough; I want to search any substring of an entire field.)
I imagine it has to be in a keyword field, rather than a text field.
Suppose I have only a few thousand documents in my Elasticsearch index, and I try:
"query": {
"wildcard" : { "description" : "*plan*" }
}
That works as expected--I get every item where "plan" is in the description, even ones like "supplantation".
Now, I'd like to do
"query": {
"wildcard" : { "description" : "*plan is*" }
}
...so that I might match documents with "Kaplan isn't" among many other possibilities.
It seems this isn't possible with wildcard, match prefix, or any other query type I might see. How do I simply search on any substring? (In SQL, I would just do description LIKE '%plan is%')
(I am aware any such query would be slow or perhaps even impossible for large data sets.)
Have you tried the regxp query in elasticsearch? It sure does sound like something you might be interested in.
I was hoping there might be something built-in for this Elasticsearch, given that this simple substring search seems like a very basic capability (Thinking about it, it is implemented as strstr() in C, LIKE '%%' in SQL, Ctrl+F in most text editors, String.IndexOf in C#, etc.), but this seems not to be the case. Note that the regexp query doesn't support case insensitivity, so I also needed to pair it with this custom analyzer, so that the index matches all-lowercase. Then I can convert my search string to lowercase as well.
{
"settings": {
"analysis": {
"analyzer": {
"lowercase_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
...
"description": {"type": "text", "analyzer": "lowercase_keyword"},
}
}
Example query:
"query": {
"regexp" : { "description" : ".*plan is.*" }
}
Thanks to Jai Sharma for leading me; I just wanted to provide more detail.

Elasticsearch find missing word in phrase

How can i use Elasticsearch to find the missing word in a phrase? For example i want to find all documents which contain this pattern make * great again, i tried using a wildcard query but it returned no results:
{
"fields": [
"file_name",
"mime_type",
"id",
"sha1",
"added_at",
"content.title",
"content.keywords",
"content.author"
],
"highlight": {
"encoder": "html",
"fields": {
"content.content": {
"number_of_fragments": 5
}
},
"order": "score",
"tags_schema": "styled"
},
"query": {
"wildcard": {
"content.content": "make * great again"
}
}
}
If i put in a word and use a match_phrase query i get results, so i know i have data which matches the pattern.
Which type of query should i use? or do i need to add some type of custom analyzer to the field?
Wildcard queries operate on terms, so if you use it on an analyzed field, it will actually try to match every term in that field separately. In your case, you can create a not_analyzed sub-field (such as content.content.raw) and run the wildcard query on that. Or just map the actual field to not be analyzed, if you don't need to query it in other ways.

Analyzer to find , e.g: "starbucks" when mistakenly querying "star bucks"

How would I define an analyzer so a query recalls a document with term "starbucks" when mistakenly querying "star bucks"?
Or in general: how would I define an analyzer that is able to search for combined terms by omitting term-separators/ spaces, in the supplied query?
N-grams clearly don't work, since you'd have to know to split up the term 'starbucks' on indexing in 2 separate terms 'star' and 'bucks'. Splitting on syllables might be enough, but not sure if that's possible (or scales)
Thoughts?
You can use Fuzzy Search.
Here is a full working sample:
PUT test1
POST test1/a
{
"item1": "starbucks"
}
POST test1/a
{
"item1": "foo"
}
GET test1/a/_search
{
"query": {
"fuzzy": {
"item1": "star bucks"
}
}
}

Resources