Related
I'm building an autocomplete input to search in a city database.
When I type "Wash", I'd like to propose Washington. Now when I type "Wash" I have nothing, With "washingt" ES found "washington".
I'm using FOS elastica.
mapping:
indexes:
cities:
finder: ~
use_alias: false
settings:
analysis:
analyzer:
text_analyzer:
tokenizer: 'whitespace'
filter: [ lowercase, trim, asciifolding, elision, name_filter ]
filter:
name_filter:
type: edgeNGram
max_gram: 100
min_gram: 2
properties:
zipCode:
type: keyword
name:
analyzer: text_analyzer
ES requets:
{
"query": {
"query_string": {
"query": "washingt",
"fields": [
"name",
"zipCode"
]
}
},
"from": 0,
"size": 1000
}
Any ideas what I'm missing?
Edit:
POST cities/_analyze { "field": "name", "text": ["washington"] }
"tokens": [
{
"token": "w",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "wa",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "was",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "wash",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "washi",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "washin",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "washing",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "washingt",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "washingto",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
,
{
"token": "washington",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
}
]
}
I am looking for a way to get the result by matching EXACT WHOLE WORDS from Elasticsearch. This is for "EQ" ("=") operations from UI.
{
"_index": "docs",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"DocId": 1,
"DocDate": "2020-07-24T10:16:44.0000000Z",
"Conversation": "I just need to know how frequently I should remind you"
}
},
{
"_index": "docs",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"DocId": 2,
"DocDate": "2020-07-25T10:16:45.0000000Z",
"Conversation": "Building a work culture in your firm"
}
}
in here, when querying with "I just need to know how frequently I should remind you" for Conversation, then only ES should return DocId 1 data.
Even the query is like "I just need to know how frequently I should remind", then it should return empty.
I tried these ES queries, but not able to figure it out.
GET docs/_search
{
"query": {
"bool": {
"must": [
{"match_phrase": {
"Conversation": "just need to know"
}}
]
}
}
}
GET docs/_search
{
"query": {
"query_string": {
"default_field": "Conversation",
"query": "\"just need to know\""
}
}
}
GET docs/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"Conversation": {"query": "just need to know",
"operator": "and"
}
}
}
]
}
}
}
You need to add .keyword to the Conversation field. This uses the keyword analyzer instead of the standard analyzer (notice the ".keyword" after the Conversation field).
When using standard analyzer
GET /_analyze
{
"analyzer" : "standard",
"text" : "I just need to know how frequently I should remind you"
}
The following tokens are generated :
{
"tokens": [
{
"token": "i",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "just",
"start_offset": 2,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "need",
"start_offset": 7,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "to",
"start_offset": 12,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "know",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "how",
"start_offset": 20,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "frequently",
"start_offset": 24,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "i",
"start_offset": 35,
"end_offset": 36,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "should",
"start_offset": 37,
"end_offset": 43,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "remind",
"start_offset": 44,
"end_offset": 50,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "you",
"start_offset": 51,
"end_offset": 54,
"type": "<ALPHANUM>",
"position": 10
}
]
}
Whereas the keyword analyzer returns the entire input string as a single token.
If you have not defined any explicit mapping, then your modified search query will be :
{
"query": {
"match": {
"Conversation.keyword": "I just need to know how frequently I should remind you"
}
}
}
You can even change your index mapping, in the following way :
{
"mappings": {
"properties": {
"Conversation": {
"type": "keyword"
}
}
}
}
My use case is to search for edge_ngrams with synonym support where the tokens to match should be in sequence.
While trying out the analysis, I observed 2 different behaviour of the filter chain with respect to position increments.
With filter-chain as lowercase, synonym there is no position increment due to SynonymFilter
With filter-chain as lowercase, edge_ngram, synonym there is position increment due to SynonymFilter
Here are the queries I'm running for each case:
Case 1. No position increment
PUT synonym_test
{
"index": {
"analysis": {
"analyzer": {
"by_smart": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"custom_synonym"
]
}
},
"filter": {
"custom_synonym": {
"type": "synonym",
"synonyms": [
"begin => start"
]
}
}
}
}
}
GET synonym_test/_analyze
{
"text": "begin working",
"analyzer": "by_smart"
}
Outputs :
{
"tokens": [
{
"token": "start",
"start_offset": 0,
"end_offset": 5,
"type": "SYNONYM",
"position": 0
},
{
"token": "working",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 1
}
]
}
Case 2. Position increment
PUT synonym_test
{
"index": {
"analysis": {
"analyzer": {
"by_smart": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"custom_edge_ngram",
"custom_synonym"
]
}
},
"filter": {
"custom_synonym": {
"type": "synonym",
"synonyms": [
"begin => start"
]
},
"custom_edge_ngram": {
"type": "edge_ngram",
"min_gram": "2",
"max_gram": "60"
}
}
}
}
}
GET synonym_test/_analyze
{
"text": "begin working",
"analyzer": "by_smart"
}
Outputs :
{
"tokens": [
{
"token": "be",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "beg",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "begi",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "start",
"start_offset": 0,
"end_offset": 5,
"type": "SYNONYM",
"position": 1
},
{
"token": "wo",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "wor",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "work",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "worki",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "workin",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "working",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
}
]
}
Notice how in Case1 the token begin and start when replaced have the same position and there is no position increment. However in Case 2, when begin token is replaced by start the position got incremented for the subsequent token stream.
Now here are my questions :
Why is it not happening in Case 1 and only happening in Case 2 ?
The main issue that this is causing is when the input query is begi wor with match_phrase query (and default slop as 0) it doesn't matches begin work.
Which is happening since begi and wor are 2 positions away. Any suggestions on how can I achieve this behaviour without impacting my use case ?
I'm using ElasticSearch version 5.6.8 having lucene version 6.6.1.
I've read several documentation links and articles but I couldn't find any proper one explaining why is this happening and is there some settings to get my desired behaviour.
Below is the elastic search mapping with one field called hostname and other field called catch_all which is basically copy_to field(there will be many more fields copying values to this)
{
"settings": {
"analysis": {
"filter": {
"myNGramFilter": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 40
}},
"analyzer": {
"myNGramAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "myNGramFilter"]
}
}
}
},
"mappings": {
"test": {
"properties": {
"catch_all": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"store": true,
"ignore_above": 256
},
"grams": {
"type": "text",
"store": true,
"analyzer": "myNGramAnalyzer"
}
}
},
"hostname": {
"type": "text",
"copy_to": "catch_all"
}
}
}
}
}
When I do the
GET index/_analyze
{
"analyzer": "myNGramAnalyzer",
"text": "Dell PowerEdge R630"
}
{
"tokens": [
{
"token": "d",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "de",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "del",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "dell",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "p",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "po",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "pow",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "powe",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "power",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "powere",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "powered",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "poweredg",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "poweredge",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "r",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "r6",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "r63",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "r630",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}
There is a token called "poweredge".
Right now we search with below query
{
"query": {
"multi_match": {
"fields": ["catch_all.grams"],
"query": "poweredge",
"operator": "and"
}
}
}
When we query with "poweredge" we get 1 result. But when we search by only "edge" there is no result.
Even the match query does not yield results for search word "edge".
Can somebody help here ?
I suggest to don't query with multi_match api for your use case, but to use a match query. The edgengram works in that way: it try to make ngram on the tokens generated by a whitespace tokenizer on you text. As written in documentation - read here:
The edge_ngram tokenizer first breaks text down into words whenever it
encounters one of a list of specified characters, then it emits
N-grams of each word where the start of the N-gram is anchored to the
beginning of the word.
As you have tested in your query to analyze API, it doesn't product "edge" - from poweredge - as ngram because it products ngram from the beginning of the word - look at the output of you analyze API call. Take a look here: https://www.elastic.co/guide/en/elasticsearch/guide/master/ngrams-compound-words.html
When we are passing a query containing special characters, Elastic Search is splitting the text.
E.g. If we pass "test-test" in query how can we make Elastic Search treat this as a single word and not split it up.
Analyzer used on the field we are searching:
"text_search_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 15
},
"standard_stop_filter": {
"type": "stop",
"stopwords": "_english_"
}
},
"analyzer": {
"text_search_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"text_search_filter"
]
}
}
Also the query for search:
"query": {
"multi_match": {
"query": "test-test",
"type": "cross_fields",
"fields": [
"FIELD_NAME"
],
}
}
{
"tokens": [
{
"token": "'",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'t",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'te",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'tes",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-t",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-te",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-tes",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-test",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "'test-test'",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 1
}
]
}
in my code i catch all words which contains "-" and added quotes for it.
example:
joe-doe -> "joe-doe"
java code for this:
static String placeWordsWithDashInQuote(String value) {
return Arrays.stream(value.split("\\s"))
.filter(v -> !v.isEmpty())
.map(v -> v.contains("-") && !v.startsWith("\"") ? "\"" + v + "\"" : v)
.collect(Collectors.joining(" "));
}
and after this example query looks like:
{
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": [
"lastName",
"firstName"
],
"query": "\"joe-doe\"",
"default_operator": "AND"
}
}
]
}
},
"sort": [],
"from": 0,
"size": 10 }