Whitespaces in queries - elasticsearch

I have an analyzer which ignores whitespaces. When I search for a string without space, it returns proper results. This is the analyzer:
{
"index": {
"number_of_shards": 1,
"analysis": {
"filter": {
"word_joiner": {
"type": "word_delimiter",
"catenate_all": true
}
},
"analyzer": {
"word_join_analyzer": {
"type": "custom",
"filter": [
"word_joiner"
],
"tokenizer": "keyword"
}
}
}
}
}
This is how it works:
curl -XGET "http://localhost:9200/cake/_analyze?analyzer=word_join_analyzer&pretty" -d 'ONE"\ "TWO'
Result:
{
"tokens" : [ {
"token" : "ONE",
"start_offset" : 1,
"end_offset" : 5,
"type" : "word",
"position" : 0
}, {
"token" : "ONETWO",
"start_offset" : 1,
"end_offset" : 13,
"type" : "word",
"position" : 0
}, {
"token" : "TWO",
"start_offset" : 7,
"end_offset" : 13,
"type" : "word",
"position" : 1
} ]
}
What I want is that I also get a "token" : "ONE TWO" from this analyzer. How can I do this?
Thanks!

You need to enable the preserve_original setting, which is false by default
{
"index": {
"number_of_shards": 1,
"analysis": {
"filter": {
"word_joiner": {
"type": "word_delimiter",
"catenate_all": true,
"preserve_original": true <--- add this
}
},
"analyzer": {
"word_join_analyzer": {
"type": "custom",
"filter": [
"word_joiner"
],
"tokenizer": "keyword"
}
}
}
}
}
This will yield:
{
"tokens": [
{
"token": "ONE TWO",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "ONE",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "ONETWO",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "TWO",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 1
}
]
}

Related

ElasticSearch Edge NGram Preserve Numbers

I'm working on creating an autocompletion API for residential addresses.
I would like to preserve the numbers, so I don't get the following problem:
Let's say the index contains a couple of documents:
{"fullAddressLine": "Kooimanweg 10 1442BZ Purmerend", "streetName": "Kooimanweg", houseNumber: "10", "postCode": "1442BZ", "cityName": "Purmerend"}
{"fullAddressLine": "Kooimanweg 1009 1442BZ Purmerend", "streetName": "Kooimanweg", houseNumber: "1009", "postCode": "1442BZ", "cityName": "Purmerend"}
{"fullAddressLine": "Kooimanweg 1011 1442BZ Purmerend", "streetName": "Kooimanweg", houseNumber: "1011", "postCode": "1442BZ", "cityName": "Purmerend"}
{"fullAddressLine": "Kooimanweg 1013 1442BZ Purmerend", "streetName": "Kooimanweg", houseNumber: "1013", "postCode": "1442BZ", "cityName": "Purmerend"}
These are the settings and mappings:
{
"settings": {
"analysis": {
"filter": {
"EdgeNGramFilter": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 40
}
},
"analyzer": {
"EdgeNGramAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"EdgeNGramFilter"
]
},
"keywordAnalyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"fullAddressLine": {
"type": "text",
"analyzer": "EdgeNGramAnalyzer",
"search_analyzer": "standard",
"fields": {
"raw": {
"type": "text",
"analyzer": "keywordAnalyzer"
}
}
}
}
}
}
And this would be the ElasticSearch query:
{
"query": {
"bool": {
"must": [{
"match": {
"fullAddressLine": {
"query": "kooiman 10",
"operator": "and"
}
}
}]
}
}
}
The result of this is:
Kooimanweg 10 1442BZ Purmerend
Kooimanweg 1009 1442BZ Purmerend
Kooimanweg 1011 1442BZ Purmerend
Kooimanweg 1033 1442BZ Purmerend
This works, but I would only like to see this:
Kooimanweg 10 1442BZ Purmerend
How can I change the query or mappings/settings to achieve this result?
When using the "EdgeNgramAnalyzer" analyzer on "Test 1009" I get:
{
"tokens" : [
{
"token" : "t",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "te",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "tes",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "test",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "10",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "100",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "1009",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<NUM>",
"position" : 1
}
]
}
I want to reserve numbers so they don't get split.
Thanks to everyone in advance.

Split text containing <number><unit> into 3 tokens

We index a lot of documents that may contain titles like "lightbulb 220V" or "Box 23cm" or "Varta Super-charge battery 74Ah".
However our users, when searching, tend to separate number and unit with whitespace, so they search for "Varta 74 Ah" they do not get what they expect.
The above is a simplification of the problem, but the main question is hopefully valid. How can I analyze "Varta Super-charge battery 74Ah" so that (on top of other tokens) 74, Ah and 74Ah are created?
Thanks,
Michal
I guess this will help you:
PUT index_name
{
"settings": {
"analysis": {
"filter": {
"custom_filter": {
"type": "word_delimiter",
"split_on_numerics": true
}
},
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["custom_filter"]
}
}
}
}
}
You can use split_on_numerics property in your custom filter. This will give you this response:
POST
POST /index_name/_analyze
{
"analyzer": "custom_analyzer",
"text": "Varta Super-charge battery 74Ah"
}
Response
{
"tokens" : [
{
"token" : "Varta",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "Super",
"start_offset" : 6,
"end_offset" : 11,
"type" : "word",
"position" : 1
},
{
"token" : "charge",
"start_offset" : 12,
"end_offset" : 18,
"type" : "word",
"position" : 2
},
{
"token" : "battery",
"start_offset" : 19,
"end_offset" : 26,
"type" : "word",
"position" : 3
},
{
"token" : "74",
"start_offset" : 27,
"end_offset" : 29,
"type" : "word",
"position" : 4
},
{
"token" : "Ah",
"start_offset" : 29,
"end_offset" : 31,
"type" : "word",
"position" : 5
}
]
}
You would need to create a Custom Analyzer which implement Ngram Tokenizer and then apply that on the text field you create.
Below is the sample mapping, document, query and the response:
Mapping:
PUT my_split_index
{
"settings": {
"index":{
"max_ngram_diff": 3
},
"analysis": {
"analyzer": {
"my_analyzer": { <---- Custom Analyzer
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"product":{
"type": "text",
"analyzer": "my_analyzer", <--- Note this as how custom analyzer is applied on this field
"fields": {
"keyword":{
"type": "keyword"
}
}
}
}
}
}
The feature that you are looking for is called Ngram which would create multiple tokens from a single token. The size of the tokens are dependent on the min_ngram and max_ngram setting as mentioned above.
Note that I've mentioned max_ngram_diff as 3, that is because in version 7.x, ES's default value is 1. Looking into your use-case I've created this as 3 This value is nothing but max_ngram - min_ngram.
Sample Documents:
POST my_split_index/_doc/1
{
"product": "Varta 74 Ah"
}
POST my_split_index/_doc/2
{
"product": "lightbulb 220V"
}
Query Request:
POST my_split_index/_search
{
"query": {
"match": {
"product": "74Ah"
}
}
}
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.7029606,
"hits" : [
{
"_index" : "my_split_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.7029606,
"_source" : {
"product" : "Varta 74 Ah"
}
}
]
}
}
Additional Info:
To understand what tokens are actually generated you can make use of below Analyze API:
POST my_split_index/_analyze
{
"analyzer": "my_analyzer",
"text": "Varta 74 Ah"
}
You could see that below tokens got generated when I execute the above API:
{
"tokens" : [
{
"token" : "Va",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "Var",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "Vart",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
},
{
"token" : "Varta",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 3
},
{
"token" : "ar",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 4
},
{
"token" : "art",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 5
},
{
"token" : "arta",
"start_offset" : 1,
"end_offset" : 5,
"type" : "word",
"position" : 6
},
{
"token" : "rt",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 7
},
{
"token" : "rta",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 8
},
{
"token" : "ta",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 9
},
{
"token" : "74",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 10
},
{
"token" : "Ah",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 11
}
]
}
Notice that the query I've mentioned in the Query Request section is 74Ah, however it still returns the document. That is because ES applies the analyzer twice, during the index time and during the search time. By default if you do not specify the search_analyzer in your query, the same analyzer you applied during indexing time also gets applied during query time.
Hope this helps!
You can define your index mapping as below and see it generates tokens, as you mentioned in your question. Also, it doesn't create a lot of tokens. Hence the size of your index would be smaller.
Index mapping
{
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "word_delimiter",
"split_on_numerics": "true",
"catenate_words": "true",
"preserve_original": "true"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"my_filter",
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
And check the tokens generated using _analyze API
{
"text": "Varta Super-charge battery 74Ah",
"analyzer" : "my_analyzer"
}
Tokens generated
{
"tokens": [
{
"token": "varta",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "super-charge",
"start_offset": 6,
"end_offset": 18,
"type": "word",
"position": 1
},
{
"token": "super",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "supercharge",
"start_offset": 6,
"end_offset": 18,
"type": "word",
"position": 1
},
{
"token": "charge",
"start_offset": 12,
"end_offset": 18,
"type": "word",
"position": 2
},
{
"token": "battery",
"start_offset": 19,
"end_offset": 26,
"type": "word",
"position": 3
},
{
"token": "74ah",
"start_offset": 27,
"end_offset": 31,
"type": "word",
"position": 4
},
{
"token": "74",
"start_offset": 27,
"end_offset": 29,
"type": "word",
"position": 4
},
{
"token": "ah",
"start_offset": 29,
"end_offset": 31,
"type": "word",
"position": 5
}
]
}
Edit: Tokens generated in one another might look the same in the first glace, But I made sure that it satisfies all your requirements, given in question and tokens generated are quite different in close inspection, details of which are below:
My tokens generated are all in small-case to provide the case insensitive search functionality, which is implicit in all the search engines.
The critical thing to note is tokens generated as 74ah and supercharge, this is mentioned in the question, and my analyzer provides these tokens as well.

Elasticsearch Analyzer first 4 and last 4 characters

With Elasticsearch, I would like to specify a search analyzer where the first 4 characters and last 4 characters are tokenized.
For example: supercalifragilisticexpialidocious => ["supe", "ious"]
I have had a go with an ngram as follows
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 4,
"max_gram": 4
}
}
}
}
}
I am testing the analyzer as follows
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "supercalifragilisticexpialidocious."
}
And get back `super' ... loads of stuff I don't want and 'cious'. The problem for me is how can I take only the first and last results from the ngram tokenizer specified above?
{
"tokens": [
{
"token": "supe",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "uper",
"start_offset": 1,
"end_offset": 5,
"type": "word",
"position": 1
},
...
{
"token": "ciou",
"start_offset": 29,
"end_offset": 33,
"type": "word",
"position": 29
},
{
"token": "ious",
"start_offset": 30,
"end_offset": 34,
"type": "word",
"position": 30
},
{
"token": "ous.",
"start_offset": 31,
"end_offset": 35,
"type": "word",
"position": 31
}
]
}
One way to achieve this is to leverage the pattern_capture token filter and take the first 4 and last 4 characters.
First, define your index like this:
PUT my_index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"first_last_four"
]
}
},
"filter": {
"first_last_four": {
"type": "pattern_capture",
"preserve_original": false,
"patterns": [
"""(\w{4}).*(\w{4})"""
]
}
}
}
}
}
}
Then, you can test your new custom analyzer:
POST test/_analyze
{
"text": "supercalifragilisticexpialidocious",
"analyzer": "my_analyzer"
}
And see that the tokens you expect are there:
{
"tokens" : [
{
"token" : "supe",
"start_offset" : 0,
"end_offset" : 34,
"type" : "word",
"position" : 0
},
{
"token" : "ious",
"start_offset" : 0,
"end_offset" : 34,
"type" : "word",
"position" : 0
}
]
}

Search for both numbers and text using a in-built or custom analyzer in elastic search

This question is in continuation of my previous this SO question. I've some text, on which I want to perform search both on numbers and text.
My Text:-
8080.foobar.getFooLabelFrombar(test.java:91)
And I want to search on getFooLabelFrombar, fooBar, 8080 and 91.
Earlier I was using the simple analyzer, which was tokenizing the above text into below tokens.
"tokens": [
{
"token": "foobar",
"start_offset": 10,
"end_offset": 16,
"type": "word",
"position": 2
},
{
"token": "getfoolabelfrombar",
"start_offset": 17,
"end_offset": 35,
"type": "word",
"position": 3
},
{
"token": "test",
"start_offset": 36,
"end_offset": 40,
"type": "word",
"position": 4
},
{
"token": "java",
"start_offset": 41,
"end_offset": 45,
"type": "word",
"position": 5
}
]
}
Beaucase of which, search on foobar and getFooLabelFrombar was giving the search result but not 8080 and 91, as simple analyzer doesn't tokenize the numbers.
Then as suggested in prev. SO post, I changed the analyzer to Standard, because of which numbers are searchable but not other 2 word search strings. As Standard analyzer would create below tokens :-
{
"tokens": [
{
"token": "8080",
"start_offset": 0,
"end_offset": 4,
"type": "<NUM>",
"position": 1
},
{
"token": "foobar.getfoolabelfrombar",
"start_offset": 5,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "test.java",
"start_offset": 36,
"end_offset": 45,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "91",
"start_offset": 46,
"end_offset": 48,
"type": "<NUM>",
"position": 4
}
]
}
I went to all the existing analyzers in ES, but nothing seems to fulfil my requirement. I tried creating my below custom analyzer but it doesn't work as well.
{
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "letter"
"filter" : ["lowercase", "extract_numbers"]
}
},
"filter" : {
"extract_numbers" : {
"type" : "keep_types",
"types" : [ "<NUM>","<ALPHANUM>","word"]
}
}
}
}
Please suggest, How can I build my custom analyzer to suit my requirements.
What about using a character filter to replace dots with spaces?
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": ["replace_dots"]
}
},
"char_filter": {
"replace_dots": {
"type": "mapping",
"mappings": [
". => \\u0020"
]
}
}
}
}
}
POST /my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "8080.foobar.getFooLabelFrombar(test.java:91)"
}
Which outputs what you want:
{
"tokens" : [
{
"token" : "8080",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "foobar",
"start_offset" : 10,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "getFooLabelFrombar",
"start_offset" : 17,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "test",
"start_offset" : 36,
"end_offset" : 40,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "java",
"start_offset" : 41,
"end_offset" : 45,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "91",
"start_offset" : 46,
"end_offset" : 48,
"type" : "<NUM>",
"position" : 6
}
]
}

custom tokenizer not generating tokens as expected if text contains special characters like # , #

i have defined the following tokenizer :
PUT /testanlyzer2
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_ngram_analyzer" : {
"tokenizer" : "my_ngram_tokenizer"
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "1",
"max_gram" : "3",
"token_chars": [ "letter", "digit","symbol","currency_symbol","modifier_symbol","other_symbol" ]
}
}
}
}
}
For the following request
GET /testanlyzer2/_analyze?analyzer=my_ngram_analyzer&text="i a#m not available 9177"
Result is:
{
"tokens": [
{
"token": "i",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "a",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 2
}
]
}
For the following request::
GET /testanlyzer2/_analyze?analyzer=my_ngram_analyzer&text="i a#m not available 9177"
Result is::
{
"tokens": [
{
"token": "i",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "a",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 2
}
]
}
For the following request ::
GET /testanlyzer2/_analyze?analyzer=my_ngram_analyzer&text="i a#m not available 9177"
Result is :
Request failed to get to the server (status code: 0):
Expected result should contain these special characters(#,#,currency's,etc..) as tokens. please correct me if anything wrong in my custom tokenizer.
--Thanks
# is a special character in Sense (if you are using the Marvel's Sense dashboard) and it comments out the line.
To remove any html escaping/Sense special chars, I would test this like this:
PUT /testanlyzer2
{
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "keyword",
"filter": [
"substring"
]
}
},
"filter": {
"substring": {
"type": "nGram",
"min_gram": 1,
"max_gram": 3
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_ngram_analyzer"
}
}
}
}
}
POST /testanlyzer2/test/1
{
"text": "i a#m not available 9177"
}
POST /testanlyzer2/test/2
{
"text": "i a#m not available 9177"
}
GET /testanlyzer2/test/_search
{
"fielddata_fields": ["text"]
}

Resources