elasticsearch ngram backslash escape - elasticsearch

I am using a ngram analysier on my elastic search index. This is needed for the search capablity I require. I am searching for a document with a name called "l/test_V0001". When I search using "l/test" i am only getting results for "l" the / is working as a escape character and not as a text. I have searched and found this is a common issue and expected but can find no work around.
When i search the API for "l/test_V0001" I can find the result I am after. However when doing the same search via the java API I still only get results for "l".
here is the API search:
{
"query": {
"multi_match": {
"query": "l/test_V0001",
"fields": ["name", "name.partial", "name.text"]
}
}
}
and the mapping for the index:
{
"settings": {
"index": {
"max_ngram_diff": 20,
"search.idle.after": "10m"
},
"analysis": {
"analyzer": {
"ngram3_analyzer": {
"tokenizer": "ngram3_tokenizer",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"ngram3_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20
}
}
}
},
"mappings": {
"dynamic": "strict",
"properties": {
"name": {
"type": "keyword",
"fields": {
"partial": {
"type": "text",
"analyzer": "ngram3_analyzer",
"search_analyzer": "keyword"
},
"text": {
"type": "text"
}
}
},
"value": {
"type": "integer"
}
}
}
}
any help on this or a work around would be great!

so after a bit of digging I found the answer using custom token chars. This is has added to the index mapping:
"tokenizer": {
"ngram3_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"symbol",
"custom"
],
"custom_token_chars": "/"
}
so my full index now looks like:
{
"settings": {
"index": {
"max_ngram_diff": 20,
"search.idle.after": "10m"
},
"analysis": {
"analyzer": {
"ngram3_analyzer": {
"tokenizer": "ngram3_tokenizer",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"ngram3_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"symbol",
"custom"
],
"custom_token_chars": "/"
}
}
}
},
"mappings": {
"dynamic": "strict",
"properties": {
"name": {
"type": "keyword",
"fields": {
"partial": {
"type": "text",
"analyzer": "ngram3_analyzer",
"search_analyzer": "keyword"
},
"text": {
"type": "text"
}
}
},
"value": {
"type": "integer"
}
}
}
}
this works for both rest client and java API

Related

How to match partial words in elastic search text search

I have a field name in my elastic search with a value of Single V
Now if i search it with a value of S or Sing , i don't get no result , but if i enter a full value Single , then i get the result Single V, the query i am using is as following :-
{
"query": {
"match": {
"name": "singl"
}
},
"sort": []
}
This gives me no results , do i need to change the mapping/setting for name or analyzer ?
EDIT:-
I am trying to create the following index with the following mapping/setting
PUT my_cars
{
"settings": {
"analysis": {
"normalizer": {
"sortable": {
"filter": ["lowercase"]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 36,
"token_chars": [
"letter"
]
}
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "sortable"
}
}
}
}
}
}
But i get the following error
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "analyzer [tokenizer] must specify either an analyzer type, or a tokenizer"
}
],
"type" : "illegal_argument_exception",
"reason" : "analyzer [tokenizer] must specify either an analyzer type, or a tokenizer"
},
"status" : 400
}
Elasticsearch by default uses a standard analyzer for the text field if no analyzer is specified. This will tokenize "Single V" into "single" and "v". Due to this, you are getting the result for "Single" and not for the other terms.
If you want to do a partial search, you can use edge n-gram tokenizer or a Wildcard query
The mapping for the Edge n-gram tokenizer would be
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 6,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Update 1:
In the index mapping given above, there is one bracket } missing. Modify your index mapping as shown below
{
"settings": {
"analysis": {
"normalizer": {
"sortable": {
"filter": [
"lowercase"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
}, // note this
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 36,
"token_chars": [
"letter"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "sortable"
}
}
}
}
}
}
This is because of the default analyzer. The field is broken into tokens because of the analyzer - [Single,V].
Match query will try to find an exact search of any of the query tokens. Since you are only passing Singl that will be the only token, which is not matching any of the two tokens which are saved in the DB.
{
"query": {
"wildcard": {
"user.id": {
"name": "*singl*"
}
}
}
}
You can use wildcard queries

Elasticsearch NGram Analyser - Change the Order of the results of Query

Elasticsearch Query change display results according to the scoring
The current Query gives the result of the Field title in the following order.
Quick 123
Foxes Quick
Quick
Foxes Quick Quick
Quick Foxes
Shouldn't
3. Quick be coming as a first result instead?
Also , Foxes Quick Quick has two occurances of Quick, it should have some preference in the Queried result . But it is coming at 4th poistion .
Index Settings .
{
"fundraisers": {
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "fundraisers",
"creation_date": "1546515635025",
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "my_tokenizer"
},
"search_analyzer_search": {
"filter": [
"lowercase"
],
"tokenizer": "search_tokenizer_search"
}
},
"tokenizer": {
"my_tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "3",
"type": "edge_ngram",
"max_gram": "50"
},
"search_tokenizer_search": {
"token_chars": [
"letter",
"digit",
"whitespace"
],
"min_gram": "3",
"type": "ngram",
"max_gram": "50"
}
}
},
"number_of_replicas": "1",
"uuid": "mVweO4_sT3Ww00MzdLyavw",
"version": {
"created": "6020399"
}
}
}
}
}
Query
GET fundraisers/_search?explain=true
{
"query": {
"match_phrase": {
"title": {
"query": "qui",
"analyzer": "my_analyzer"
}
}
}
}
Mapping
{
"fundraisers": {
"mappings": {
"fundraisers": {
"properties": {
"status": {
"type": "text"
},
"suggest": {
"type": "completion",
"analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
},
"title": {
"type": "text",
"analyzer": "my_analyzer"
},
"twitterUrl": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"videoLinks": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"zipCode": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
Am I complicating this too much by using match_phrase,search analyzer and ngrams or is there any simpler way to achieve the expected result ?
Ref:
https://www.elastic.co/guide/en/elasticsearch/reference/6.5/query-dsl-match-query.html
Ok, first let's create a minimal and reproducible setup:
PUT test
{
"settings": {
"index": {
"number_of_shards": "1",
"number_of_replicas": "1",
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "my_tokenizer"
},
"search_analyzer_search": {
"filter": [
"lowercase"
],
"tokenizer": "search_tokenizer_search"
}
},
"tokenizer": {
"my_tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "3",
"type": "edge_ngram",
"max_gram": "50"
},
"search_tokenizer_search": {
"token_chars": [
"letter",
"digit",
"whitespace"
],
"min_gram": "3",
"type": "ngram",
"max_gram": "50"
}
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
PUT test/_doc/1
{
"title": "Quick 123"
}
PUT test/_doc/2
{
"title": "Foxes Quick"
}
PUT test/_doc/3
{
"title": "Quick"
}
PUT test/_doc/4
{
"title": "Foxes Quick Quick"
}
PUT test/_doc/5
{
"title": "Quick Foxes"
}
Then let's try the simplest query:
GET test/_search
{
"query": {
"match": {
"title": {
"query": "qui"
}
}
}
}
And now your order is:
Quick
Foxes Quick Quick
Quick 123
Foxes Quick
Quick Foxes
That's pretty much what you were expecting, right? There might be other usecases, which are not covered by this query, but IMO you'll have to use multi_match and search on different analyzers, because I'm not sure a phrase_search on an edgegram makes much sense.

Elastic search: Run multiple analyzers on the same data

I am looking for a way to make ES search the data with multiple analyzers.
NGram analyzer and one or few language analyzers.
Possible solution will be to use multi-fields and explicitly declare which analyzer to use for each field.
For example, to set the following mappings:
"mappings": {
"my_entity": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"ngram": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"spanish": {
"type": "string",
"analyzer": "spanish"
},
"english": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
The problem with that is that I have explicitly write every field and its analyzers to a search query.
And it will not allow to search with "_all" and use multiple analyzers.
Is there a way to make "_all" query use multiple analyzers?
Something like "_all.ngram", "_all.spanish" and without using copy_to do duplicate the data?
Is it possible to combine ngram analyzer with a spanish (or any other foreign language) and make a single custom analyzer?
I have tested the following settings but these did not work:
PUT /ngrams_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
}
},
"filter": {
"ngram_filter": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
},
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_keywords": {
"type": "keyword_marker",
"keywords": ["ejemplo"]
},
"spanish_stemmer": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase",
"spanish_stop",
"spanish_keywords",
"spanish_stemmer"
]
}
}
}
},
"mappings": {
"my_entity": {
"_all": {
"enabled": true,
"analyzer": "ngram_analyzer"
},
"properties": {
"my_field": {
"type": "text",
"fields": {
"analyzer1": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"analyzer2": {
"type": "string",
"analyzer": "spanish"
},
"analyzer3": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
}
GET /ngrams_index/_analyze
{
"field": "_all",
"text": "Hola, me llamo Juan."
}
returns: just ngram results, without Spanish analysis
where
GET /ngrams_index/_analyze
{
"field": "my_field.analyzer2",
"text": "Hola, me llamo Juan."
}
properly analyzes the search string.
Is it possible to build a custom analyzer which combine Spanish and ngram?
There is a way to create a custom ngram+language analyzer:
PUT /ngrams_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"ngram_filter": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
},
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_keywords": {
"type": "keyword_marker",
"keywords": [
"ejemplo"
]
},
"spanish_stemmer": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"spanish_stop",
"spanish_keywords",
"spanish_stemmer",
"ngram_filter"
]
}
}
}
},
"mappings": {
"my_entity": {
"_all": {
"enabled": true,
"analyzer": "ngram_analyzer"
},
"properties": {
"my_field": {
"type": "text",
"analyzer": "ngram_analyzer"
}
}
}
}
}
GET /ngrams_index/_analyze
{
"field": "my_field",
"text": "Hola, me llamo Juan."
}

How to add custom analyzer to mapping ElasticSearch-2.3.5 for partial searching?

I use ElasticSearch-2.3.5. I want to add my custom analyzer to mapping while index creating.
PUT /library
{
"settings": {
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": "1",
"max_gram": "15",
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"index_ngram_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase"
]
}
},
"search_term_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
},
"mappings": {
"book": {
"properties": {
"Id": {
"type": "long",
"search_analyzer": "search_term_analyzer",
"index_analyzer": "index_ngram_analyzer",
"term_vector":"with_positions_offsets"
},
"Title": {
"type": "string",
"search_analyzer": "search_term_analyzer",
"index_analyzer": "index_ngram_analyzer",
"term_vector":"with_positions_offsets"
}
}
}
}
}
I take a template example from official guide.
{
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"type1" : {
"properties" : {
"field1" : { "type" : "string", "index" : "not_analyzed" }
}
}
}
}
But I get an error trying to execute the first part of code. There is my error:
{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "analyzer [search_term_analyzer] not found for field [Title]"
}
],
"type": "mapper_parsing_exception",
"reason": "Failed to parse mapping [book]: analyzer [search_term_analyzer] not found for field [Title]",
"caused_by": {
"type": "mapper_parsing_exception",
"reason": "analyzer [search_term_analyzer] not found for field [Title]"
}
},
"status": 400
}
I can do it if I put my mappings inside of settings, but I think that it is wrong way. So I try to find my book by using a part of title. I have the "King Arthur" book for example. My query looks like this:
POST /library/book/_search
{
"query": {
"match": {
"Title": "kin"
}
}
}
Nothing will be found. What I do wrong? Could you help me? It seems my analyzer and tokenizer don't work. How can I get the terms "k", "i", "ki", "king" etc.? Because I think that I have only two terms right now. There are 'king' and 'arthur'.
You have misplaced the search_term_analyzer analyzer, it should be inside the analyzer section
PUT /library
{
"settings": {
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": "1",
"max_gram": "15",
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"index_ngram_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase"
]
},
"search_term_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
},
"mappings": {
"book": {
"properties": {
"Id": {
"type": "long", <---- you probably need to make this a string or remove the analyzers
"search_analyzer": "search_term_analyzer",
"analyzer": "index_ngram_analyzer",
"term_vector":"with_positions_offsets"
},
"Title": {
"type": "string",
"search_analyzer": "search_term_analyzer",
"analyzer": "index_ngram_analyzer",
"term_vector":"with_positions_offsets"
}
}
}
}
}
Also make sure to use analyzer instead of index_analyzer, the latter as been deprecated in ES 2.x

How to match query terms containing hyphens or trailing space in elasticsearch

In the mapping char_filter section of elasticsearch mapping, its kind of vague and I'm having a lot of difficulty understanding if and how to use charfilter analyzer: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html
Basically the data we are storing in the index are ids of type String that look like this: "008392342000". I want to be able to search such ids when query terms actually contain a hyphen or trailing space like this: "008392342-000 ".
How would you advise I set the analyzer like?
Currently this is the definition of the field:
"mappings": {
"client": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Here is the settings for the index containing analyzer etc.
"settings": {
"analysis": {
"filter": {
"autocomplete_ngram": {
"max_gram": 15,
"min_gram": 1,
"type": "edge_ngram"
},
"ngram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 8
}
},
"analyzer": {
"lowercase_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_index": {
"filter": [
"lowercase",
"autocomplete_ngram"
],
"tokenizer": "keyword"
},
"ngram_index": {
"filter": [
"ngram_filter",
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"ngram_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
},
"index": {
"number_of_shards": 6,
"number_of_replicas": 1
}
}
}
You haven't provided your actual analyzers, what data goes in and what your expectations are, but based on the info you provided I would start with this:
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [
"-=>"
]
}
},
"analyzer": {
"autocomplete_search": {
"tokenizer": "keyword",
"char_filter": [
"my_mapping"
],
"filter": [
"trim"
]
},
"autocomplete_index": {
"tokenizer": "keyword",
"filter": [
"trim"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
The char_filter would replace - with nothing: -=>. I would, also, use the trim filter to get rid of any trailing or leading white spaces. No idea what your autocomplete_index analyzer you have, I just used a keyword one.
Testing the analyzer GET /my_index/_analyze?analyzer=autocomplete_search&text= 0123-34742-000 results in:
"tokens": [
{
"token": "012334742000",
"start_offset": 0,
"end_offset": 17,
"type": "word",
"position": 1
}
]
which means it does eliminate the - and the white spaces.
And the typical query would be:
{
"query": {
"match": {
"ucn.ucn_autoc": " 0123-34742-000 "
}
}
}

Resources