I am trying to write an Elasticsearch multi-match query (with the Java API) to create a "search-as-you-type" program. The query is applied to two fields, title and description, which are analyzed as ngrams.
My problem is, it seems that Elasticsearch tries to find only words beginning like my query. For instance, if I search for "nut", then it matches with documents featuring "nut", "nuts", "Nutella", etc, but it does not match documents featuring "walnut", which should be matched.
Here are my settings :
{
"index": {
"analysis": {
"analyzer": {
"edgeNGramAnalyzer": {
"tokenizer": "edgeTokenizer",
"filter": [
"word_delimiter",
"lowercase",
"unique"
]
}
},
"tokenizer": {
"edgeTokenizer": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "8",
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
Here is the relevant part of my mapping :
{
"content": {
"properties": {
"title": {
"type": "text",
"analyzer": "edgeNGramAnalyzer",
"fields": {
"sort": {
"type": "keyword"
}
}
},
"description": {
"type": "text",
"analyzer": "edgeNGramAnalyzer",
"fields": {
"sort": {
"type": "keyword"
}
}
}
}
}
}
And here is my query :
new MultiMatchQueryBuilder(query).field("title", 3).field("description", 1).fuzziness(0).tieBreaker(1).minimumShouldMatch("100%")
Do you have any idea what I could be doing wrong ?
That's because you're using an edgeNGram tokenizer instead of nGram one. The former only indexes prefixes, while the latter will index prefixes, suffixes and also sub-parts of your data.
Change your analyzer definition to this instead and it should work as expected:
{
"index": {
"analysis": {
"analyzer": {
"edgeNGramAnalyzer": {
"tokenizer": "edgeTokenizer",
"filter": [
"word_delimiter",
"lowercase",
"unique"
]
}
},
"tokenizer": {
"edgeTokenizer": {
"type": "nGram", <---- change this
"min_gram": "3",
"max_gram": "8",
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
Related
I have a field name in my elastic search with a value of Single V
Now if i search it with a value of S or Sing , i don't get no result , but if i enter a full value Single , then i get the result Single V, the query i am using is as following :-
{
"query": {
"match": {
"name": "singl"
}
},
"sort": []
}
This gives me no results , do i need to change the mapping/setting for name or analyzer ?
EDIT:-
I am trying to create the following index with the following mapping/setting
PUT my_cars
{
"settings": {
"analysis": {
"normalizer": {
"sortable": {
"filter": ["lowercase"]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 36,
"token_chars": [
"letter"
]
}
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "sortable"
}
}
}
}
}
}
But i get the following error
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "analyzer [tokenizer] must specify either an analyzer type, or a tokenizer"
}
],
"type" : "illegal_argument_exception",
"reason" : "analyzer [tokenizer] must specify either an analyzer type, or a tokenizer"
},
"status" : 400
}
Elasticsearch by default uses a standard analyzer for the text field if no analyzer is specified. This will tokenize "Single V" into "single" and "v". Due to this, you are getting the result for "Single" and not for the other terms.
If you want to do a partial search, you can use edge n-gram tokenizer or a Wildcard query
The mapping for the Edge n-gram tokenizer would be
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 6,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Update 1:
In the index mapping given above, there is one bracket } missing. Modify your index mapping as shown below
{
"settings": {
"analysis": {
"normalizer": {
"sortable": {
"filter": [
"lowercase"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
}, // note this
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 36,
"token_chars": [
"letter"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "sortable"
}
}
}
}
}
}
This is because of the default analyzer. The field is broken into tokens because of the analyzer - [Single,V].
Match query will try to find an exact search of any of the query tokens. Since you are only passing Singl that will be the only token, which is not matching any of the two tokens which are saved in the DB.
{
"query": {
"wildcard": {
"user.id": {
"name": "*singl*"
}
}
}
}
You can use wildcard queries
Given that I have multiple documents contains a sentence such as "welcome to how are you doing today?" I applied a simple_query_string query to search the above sentence. When I first use welcome to how. It returns 0 hit. However, when I use how are you doing today it shows all the documents. Can someone tell me what causes this?
the query is like:
query: {
simple_query_string : {
query: '\ welcome to \',
fields : ['content'],
default_operator: 'AND' }
}
The settings for the analyzer are:
{
"number_of_shards": 2,
"refresh_interval": "30s",
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"charSplit": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"char_filter": [
"my_char_filter"
],
"filter": [
"lowercase",
"autocomplete_filter"
]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": "1",
"max_gram": "1"
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": "specialCharacters"
}
}
}
}
I am using Custom NGRAM Analyzer which has a ngram tokenizer. I have also used lowercase filter. The query is working fine for searches without characters. But when I am searching for certain symbols, it fails. Since I have used lower case tokenizers, Elasticsearch doesn't analyse symbols. I know whitespace tokenizer can help me solve the issue. How can I use two tokenizers in a single analyzer?Below is the mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer":"my_tokenizer",
"filter":"lowercase"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
Is there a way I could solve this issue?
As per the documentation of elasticsearch,
An analyzer must have exactly one tokenizer.
However, you can have multiple analyzer defined in settings, and you can configure separate analyzer for each field.
If you want to have single field itself to be used using different analyzer, one of the option is to make that field multi-field as per this link
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "whitespace"
"fields": {
"ngram": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
}
}
So if you configure as above your query need to make use of title and title.ngram fields.
GET my_index/_search
{
"query": {
"multi_match": {
"query": "search ##$ whatever",
"fields": [
"title",
"title.ngram"
],
"type": "most_fields"
}
}
}
As another option, here is what you can do
Create two indexes.
The first index have field title with analyzer my_analyzer
Second index have field title with analyzer whitespace
Have same alias created for both of them as below
Execute the below:
POST _aliases
{
"actions":[
{
"add":{
"index":"index A",
"alias":"index"
}
},
{
"add":{
"index":"index B",
"alias":"index"
}
}
]
}
So when you eventually write a query, it must be pointing to this alias which in turn would be querying multiple indexes.
Hope this helps!
if you want to use 2 tokenizers, You should have 2 analyzers.
something like this:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer":"my_tokenizer",
"filter":"lowercase"
},
"my_analyzer_2": {
"tokenizer":"whitespace",
"filter":"lowercase"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
In general, you should also pay attention to the place of the analyzer in the Mapping.
Sometimes it is necessary to have the analyzer in both the serach_time and index_time.
"mappings":{
"_doc":{
"properties":{
"title":{
"type":"text",
"analyzer":"my_analyzer",
"search_analyzer":"my_analyzer"
}
}
}
}
1) You can try with updating your token_chars like below:
"token_chars":[
"letter",
"digit",
"symbol",
"punctuation"
]
2) If not work then try below analyzer:
{
"settings":{
"analysis":{
"filter":{
"my_filter":{
"type":"ngram",
"min_gram":3,
"max_gram":3,
"token_chars":[
"letter",
"digit",
"symbol",
"punctuation"
]
}
},
"analyzer":{
"my_analyzer":{
"type":"custom",
"tokenizer":"keyword",
"filter":[
"lowercase",
"like_filter"
]
}
}
}
},
"mappings":{
"_doc":{
"properties":{
"title":{
"type":"text",
"analyzer":"my_analyzer"
}
}
}
}
}
You need to use keyword tokenizer (keyword tokenizer) and then ngram token filter in your analyzer.
I find a solution for use synonym and edge_ngram tokenizer at the same time.
The main logic works like this => It uses one at indexing time and one at search time.
PUT test_synonym_and_autocomplete
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 10
},
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
},
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"kstem",
"synonym"
]
}
}
}
},
"mappings": {
"properties": {
"text": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "synonym_analyzer"
}
}
}
}
POST test_synonym_and_autocomplete/_doc
{
"text": "Quick Brown Fox"
}
POST test_synonym_and_autocomplete/_doc
{
"text": "B3Round"
}
GET test_synonym_and_autocomplete/_search
{
"query": {
"match": {
"text": {
"query": "test"
}
}
}
}
GET test_synonym_and_autocomplete/_search
{
"query": {
"match": {
"text": {
"query": "qui"
}
}
}
}
Results:
Below is a shortened version of my cfg mapping, using ngram tokenizer (I have not included all fields and only region below).
I have region data indexed as "Stuttgart" and "Munich" for two separate documents.
When I search for text either "Stut" or "tutt" it does not return me any documents.
Is there anything I am missing in the config ?
{
"mappings": {
"address": {
"properties": {
"region": {
"type": "text",
"analyzer": "address_analyzer"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"address_analyzer": {
"tokenizer": "address_tokenizer"
}
},
"tokenizer": {
"address_tokenizer": {
"type": "ngram",
"min_gram": 4,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
I've been trying to create my own index for users, where the query is indexed on the "name" value.
This is my current index settings:
{
"users": {
"settings": {
"index": {
"analysis": {
"filter": {
"shingle_filter": {
"max_shingle_size": "2",
"min_shingle_size": "2",
"output_unigrams": "true",
"type": "shingle"
},
"edgeNGram_filter": {
"type": "edgeNGram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete_query_analyzer": {
"filter": [
"standard",
"asciifolding",
"lowercase"
],
"tokenizer": "standard"
},
"autocomplete_index_analyzer": {
"filter": [
"standard",
"asciifolding",
"lowercase",
"shingle_filter",
"edgeNGram_filter"
],
"tokenizer": "standard"
}
}
},
"number_of_shards": "1",
"number_of_replicas": "1"
}
}
}
}
and my mapping:
{
"users": {
"mappings": {
"data": {
"properties": {
"name": {
"type": "string",
"analyzer": "autocomplete_index_analyzer",
"search_analyzer": "autocomplete_query_analyzer"
}
}
}
}
}
}
Right now my problem is that search queries do not return results that contain the term. For example if I have a user "David", the search queries "Da", "Dav", "Davi", etc will return the value but search for "vid" or "avid" will not return any values.
Is this because of some value I'm missing in the settings?
You need to use nGram instead of edgeNGram. So simply change this
"edgeNGram_filter": {
"type": "edgeNGram",
"min_gram": "1",
"max_gram": "20"
}
into this
"edgeNGram_filter": {
"type": "nGram", <--- change here
"min_gram": "1",
"max_gram": "20"
}
Note that you need to wipe your index, recreate it and the populate it again.