ElasticSearch Querystring Query Wildcard on query with multiple tokens - elasticsearch

For example I have a record with the contents "FileV2UpdateRequest" and based on my analyzer it would break the record into tokens:
filev
2
updaterequest
I want to be able to search filev2update* in the "query_string" query to find it but for whatever reason the * doesn't try to find the rest of 'updaterequest' like it should.
If I enter the query filev2 update* it returns the results.
Is there anything I can do to make that work where the space isn't needed?
I have tried using auto_generate_phrase_queries set to true but that also isn't working to solve the issue. It seems like when you add the wildcard symbol it looks at the entire input as one token rather than just looking at the token the wildcard is touching.
If I add analyze_wildcard and set it to true, it tries to put the * on every token in the query. costv* 2* add*

I think you can change index filter by using word_delimiter to index your content, Compound Word Token Filter
if use this filter,
FileV2UpdateRequest will be analysised to tokens:
{
"tokens": [{
"token": "File",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
}, {
"token": "V",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 2
}, {
"token": "2",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 3
}, {
"token": "Update",
"start_offset": 6,
"end_offset": 12,
"type": "word",
"position": 4
}, {
"token": "Request",
"start_offset": 12,
"end_offset": 19,
"type": "word",
"position": 5
}]
}
and for search content you also need use word_delimiter as filter without using wild_card.
filev2update will be analysised to tokens:
{
"tokens": [{
"token": "file",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
}, {
"token": "V",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 2
}, {
"token": "2",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 3
}, {
"token": "update",
"start_offset": 6,
"end_offset": 12,
"type": "word",
"position": 4
}]
}

Related

Separators in standard analyzer of elasticsearch

I know that elasicsearch's standard analyzer uses standard tokenizer to generate tokens.
In this elasticsearch docs, they say it does grammar-based tokenization, but the separators used by standard tokenizer are not clear.
My use case is as follows
In my elasticsearch index I have some fields which use the default analyzer standard analyzer
In those fields I want # character searchable and . as one more separator.
Can I achieve my use case with a standard analyzer?
I checked what and all tokens it will generate for string hey john.s #100 is a test name.
POST _analyze
{
"text": "hey john.s #100 is a test name",
"analyzer": "standard"
}
It generated the following tokens
{
"tokens": [
{
"token": "hey",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "john.s",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "100",
"start_offset": 12,
"end_offset": 15,
"type": "<NUM>",
"position": 2
},
{
"token": "is",
"start_offset": 16,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "a",
"start_offset": 19,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "test",
"start_offset": 21,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "name",
"start_offset": 26,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 6
}
]
}
So I got a doubt that Only whitespace is used as a separator in standard tokenizer?
Thank you in advance..
Lets first see why it is not breaking token on . for some of the words:
Standard analyzer use standard tokenizer only but standard tokenizer provides grammar based tokenization based on the Unicode Text Segmentation algorithm. You can read more about algorithm here, here and here. it is not using whitespace tokenizer.
Lets see now, how you can token on . dot and not on #:
You can use Character Group tokenizer and provide list of character on which you want to apply tokenization.
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
".",
"\n"
]
},
"text": "hey john.s #100 is a test name"
}
Response:
{
"tokens": [
{
"token": "hey",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "john",
"start_offset": 4,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "s",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 2
},
{
"token": "#100",
"start_offset": 11,
"end_offset": 15,
"type": "word",
"position": 3
},
{
"token": "is",
"start_offset": 16,
"end_offset": 18,
"type": "word",
"position": 4
},
{
"token": "a",
"start_offset": 19,
"end_offset": 20,
"type": "word",
"position": 5
},
{
"token": "test",
"start_offset": 21,
"end_offset": 25,
"type": "word",
"position": 6
},
{
"token": "name",
"start_offset": 26,
"end_offset": 30,
"type": "word",
"position": 7
}
]
}

elasticsearch edgengram copy_to field partial search not working

Below is the elastic search mapping with one field called hostname and other field called catch_all which is basically copy_to field(there will be many more fields copying values to this)
{
"settings": {
"analysis": {
"filter": {
"myNGramFilter": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 40
}},
"analyzer": {
"myNGramAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "myNGramFilter"]
}
}
}
},
"mappings": {
"test": {
"properties": {
"catch_all": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"store": true,
"ignore_above": 256
},
"grams": {
"type": "text",
"store": true,
"analyzer": "myNGramAnalyzer"
}
}
},
"hostname": {
"type": "text",
"copy_to": "catch_all"
}
}
}
}
}
When I do the
GET index/_analyze
{
"analyzer": "myNGramAnalyzer",
"text": "Dell PowerEdge R630"
}
{
"tokens": [
{
"token": "d",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "de",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "del",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "dell",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "p",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "po",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "pow",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "powe",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "power",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "powere",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "powered",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "poweredg",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "poweredge",
"start_offset": 5,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "r",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "r6",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "r63",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "r630",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}
There is a token called "poweredge".
Right now we search with below query
{
"query": {
"multi_match": {
"fields": ["catch_all.grams"],
"query": "poweredge",
"operator": "and"
}
}
}
When we query with "poweredge" we get 1 result. But when we search by only "edge" there is no result.
Even the match query does not yield results for search word "edge".
Can somebody help here ?
I suggest to don't query with multi_match api for your use case, but to use a match query. The edgengram works in that way: it try to make ngram on the tokens generated by a whitespace tokenizer on you text. As written in documentation - read here:
The edge_ngram tokenizer first breaks text down into words whenever it
encounters one of a list of specified characters, then it emits
N-grams of each word where the start of the N-gram is anchored to the
beginning of the word.
As you have tested in your query to analyze API, it doesn't product "edge" - from poweredge - as ngram because it products ngram from the beginning of the word - look at the output of you analyze API call. Take a look here: https://www.elastic.co/guide/en/elasticsearch/guide/master/ngrams-compound-words.html

Analyze API does not work for Elasticsearch 1.7

We are running Elasticsearch 1.7 (planning to upgrade very soon) and I am trying to use the Analyze API to understand what the different analyzers do, but the result presented from elasticsearch is not what I expect.
If I run the following query against our elasticsearch instance
GET _analyze
{
"analyzer": "stop",
"text": "Extremely good food! We had the happiest waiter and the crowd's always flowing!"
}
I will get this result
{
"tokens": [
{
"token": "analyzer",
"start_offset": 6,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "stop",
"start_offset": 18,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "text",
"start_offset": 30,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "extremely",
"start_offset": 38,
"end_offset": 47,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "good",
"start_offset": 48,
"end_offset": 52,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "food",
"start_offset": 53,
"end_offset": 57,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "we",
"start_offset": 59,
"end_offset": 61,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "had",
"start_offset": 62,
"end_offset": 65,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "the",
"start_offset": 66,
"end_offset": 69,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "happiest",
"start_offset": 70,
"end_offset": 78,
"type": "<ALPHANUM>",
"position": 10
},
{
"token": "waiter",
"start_offset": 79,
"end_offset": 85,
"type": "<ALPHANUM>",
"position": 11
},
{
"token": "and",
"start_offset": 86,
"end_offset": 89,
"type": "<ALPHANUM>",
"position": 12
},
{
"token": "the",
"start_offset": 90,
"end_offset": 93,
"type": "<ALPHANUM>",
"position": 13
},
{
"token": "crowd's",
"start_offset": 94,
"end_offset": 101,
"type": "<ALPHANUM>",
"position": 14
},
{
"token": "always",
"start_offset": 102,
"end_offset": 108,
"type": "<ALPHANUM>",
"position": 15
},
{
"token": "flowing",
"start_offset": 109,
"end_offset": 116,
"type": "<ALPHANUM>",
"position": 16
}
]
}
which does not make sense to me. I am using the stop analyzer, why is the words "and" and "the" in the result? I have tried to change the stop analyzer to both whitespace and standard, but I get the exact same result as above. There is no difference between them.
However, if I run the exact same query against an instance of Elasticsearch 5.x the result does no longer contain "and" and "the" and it seems much more as expected.
Is this because we are using 1.7 or is it something in our setup of Elasticsearch that is causing this issue?
edit:
I am using Sense plugin in chrome to do my queries, and the plugin does not support GET with a request body so it changes the request to a POST. Elastic Analyze API 1.7 does not seem to support POST requests :( If I change the query like this GET _analyze?analyzer=stop&text=THIS+is+a+test&pretty it works
In 1.x the syntax is different from 2.x and 5.x. According to the 1.x documentation, you should be using the _analyze API like this:
GET _analyze?analyzer=stop
{
"text": "Extremely good food! We had the happiest waiter and the crowd's always flowing!"
}

Smart Chinese Analysis Elasticsearch returns unicodes

I'm trying to analyse documents in Elasticsearch using Smart Chinese Analyser, but, instead of getting the analysed Chinese characters, Elasticsearch returns the unicodes of these characters. For example:
PUT /test_chinese
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "smartcn"
}
}
}
}
}
}
GET /test_chinese/_analyze?text='我说世界好!'
I expect to get every chinese character, but I get:
{
"tokens": [
{
"token": "25105",
"start_offset": 3,
"end_offset": 8,
"type": "word",
"position": 4
},
{
"token": "35828",
"start_offset": 11,
"end_offset": 16,
"type": "word",
"position": 8
},
{
"token": "19990",
"start_offset": 19,
"end_offset": 24,
"type": "word",
"position": 12
},
{
"token": "30028",
"start_offset": 27,
"end_offset": 32,
"type": "word",
"position": 16
},
{
"token": "22909",
"start_offset": 35,
"end_offset": 40,
"type": "word",
"position": 20
}
]
}
Do you have any idea what's going on?
Thank you!
I found the problem regarding my question. It seems that there is a bug in Sense.
Here you can find the conversation with Zachary Tong, Elasticsearch Developer: https://discuss.elastic.co/t/smart-chinese-analysis-returns-unicodes-instead-of-chinese-tokens/37133
Here is the ticket for the bug found: https://github.com/elastic/sense/issues/88

Elasticsearch and Spanish Accents

I am trying to use elasticsearch to index some data about research paper. But I'am figthing to accents. For intance, if I use:
GET /_analyze?tokenizer=standard&filter=asciifolding&text="Boletínes de investigaciónes" I get
{
"tokens": [
{
"token": "Bolet",
"start_offset": 1,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "nes",
"start_offset": 7,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "de",
"start_offset": 11,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "investigaci",
"start_offset": 14,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "nes",
"start_offset": 26,
"end_offset": 29,
"type": "<ALPHANUM>",
"position": 5
}
]
}
and I should to get something like that
{
"tokens": [
{
"token": "Boletines",
"start_offset": 1,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "de",
"start_offset": 11,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "investigacion",
"start_offset": 14,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 4
}
]
}
what should I do?
To prevent the extra tokens being formed, you need to use an alternative tokenizer, e.g. try the whitespace tokenizer.
Alternatively use a language analyzer and specify the language.
You should use a ASCII folding filter in your analyzer.
For example, the filter changes à to a.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-asciifolding-tokenfilter.html

Resources