Elasticsearch custom analyzer for hyphens, underscores, and numbers - elasticsearch

Admittedly, I'm not that well versed on the analysis part of ES. Here's the index layout:
{
"mappings": {
"event": {
"properties": {
"ipaddress": {
"type": "string"
},
"hostname": {
"type": "string",
"analyzer": "my_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "word_delimiter",
"preserve_original": true
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase", "my_filter"]
}
}
}
}
}
You can see that I've attempted to use a custom analyzer for the hostname field. This kind of works when I use this query to find the host named "WIN_1":
{
"query": {
"match": {
"hostname": "WIN_1"
}
}
}
The issue is that it also returns any hostname that has a 1 in it. Using the _analyze endpoint, I can see that the numbers are tokenized as well.
{
"tokens": [
{
"token": "win_1",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "win",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "1",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 2
}
]
}
What I'd like to be able to do is search for WIN and get back any host that has WIN in it's name. But I also need to be able to search for WIN_1 and get back that exact host or any host with WIN_1 in it's name. Below is some test data.
{
"ipaddress": "192.168.1.253",
"hostname": "WIN_8_ENT_1"
}
{
"ipaddress": "10.0.0.1",
"hostname": "server1"
}
{
"ipaddress": "172.20.10.36",
"hostname": "ServA-1"
}
Hopefully someone can point me in the right direction. It could be that my simple query isn't the right approach either. I've poured over the ES docs, but they aren't real good with examples.

You could change your analysis to use a pattern analyzer that discards the digits and under scores:
{
"analysis": {
"analyzer": {
"word_only": {
"type": "pattern",
"pattern": "([^\p{L}]+)"
}
}
}
}
Using the analyze API:
curl -XGET 'localhost:9200/{yourIndex}/_analyze?analyzer=word_only&pretty=true' -d 'WIN_8_ENT_1'
returns:
"tokens" : [ {
"token" : "win",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "ent",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 2
} ]
Your mapping would become:
{
"event": {
"properties": {
"ipaddress": {
"type": "string"
},
"hostname": {
"type": "string",
"analyzer": "word_only",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
You can use a multi_match query to get the results you want:
{
"query": {
"multi_match": {
"fields": [
"hostname",
"hostname.raw"
],
"query": "WIN_1"
}
}
}

Here's the analyzer and queries I ended up with:
{
"mappings": {
"event": {
"properties": {
"ipaddress": {
"type": "string"
},
"hostname": {
"type": "string",
"analyzer": "hostname_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"hostname_filter": {
"type": "pattern_capture",
"preserve_original": 0,
"patterns": [
"(\\p{Ll}{3,})"
]
}
},
"analyzer": {
"hostname_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [ "lowercase", "hostname_filter" ]
}
}
}
}
}
Queries:
Find host name starting with:
{
"query": {
"prefix": {
"hostname.raw": "WIN_8"
}
}
}
Find host name containing:
{
"query": {
"multi_match": {
"fields": [
"hostname",
"hostname.raw"
],
"query": "WIN"
}
}
}
Thanks to Dan for getting me in the right direction.

When ES 1.4 is released, there will be a new filter called 'keep types' that will allow you to only keep certain types once the string is tokenized. (i.e. keep words only, numbers only, etc).
Check it out here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-keep-types-tokenfilter.html#analysis-keep-types-tokenfilter
This may be a more convenient solution for your needs in the future

It looks like you want to apply two different types of searches on your hostname field. One for exact matches, and one for a variation of wildcard (maybe in your specific case, a prefix query).
After trying to implement all types of different searches using several different analyzers, I've found it sometimes simpler to add another field to represent each type of search you want to do. Is there a reason you do not want to add another field like the following:
{
"ipaddress": "192.168.1.253",
"hostname": "WIN_8_ENT_1"
"system": "WIN"
}
Otherwise, you could consider writing your own custom filter that does effectively the same thing under the hood. Your filter will read in your hostname field and index the exact keyword and a substring that matches your stemming pattern (e.g. WIN in WIN_8_ENT_1).
I do not think there is any existing analyzer/filter combination that can do what you are looking for, assuming I have understood your requirements correctly.

Related

Elasticsearch analyzer doesn't replace the apostophes (')

Using Elasticsearch v7.0
This is the analyzer I've implemented (http://phoenyx2:9200/search_dev/_settings?pretty=true):
{
"search_dev": {
"settings": {
"index": {
"refresh_interval": "30s",
"number_of_shards": "1",
"provided_name": "search_dev",
"creation_date": "1558444846417",
"analysis": {
"analyzer": {
"my_standard": {
"filter": [
"lowercase"
],
"char_filter": [
"my_char_filter"
],
"tokenizer": "standard"
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"' => "
]
}
}
},
"number_of_replicas": "1",
"uuid": "hYz0ZlWFTDKearW1rpx8lw",
"version": {
"created": "7000099"
}
}
}
}
}
I've recreated the whole index, and there is still no change in the analasis.
I've also run this : url (phoenyx2:9200/search_dev/_analyze)
{
"analyzer":"my_standard",
"field":"stakeholderName",
"text": "test't"
}
Reply was:
{
"tokens": [
{
"token": "test't",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
}
]
}
I was hoping the the returned token would be "testt"
When you re-create an index its not enough to define a new analyzer in the setting.
You also have to specify in the mapping which fields use what analyzer, for example:
"mappings":{
"properties":{
"stakeholderName": {
"type":"text",
"analyzer":"my_analyzer",
},
}
}
You're mapping (probably) looks like:
"mappings":{
"properties":{
"stakeholderName": {
"type":"text",
},
}
}
Basicaly if you run your "analyze" test again and drop the field:
{
"analyzer":"my_standard",
"text": "test't"
}
You'll get:
{
"token": "testt",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
}
As you expect it, so bad news buddy but you have to re-index all your data again and this time specify in the mapping which analyzer you want to be used for each field, otherwise elastic will default to their standard analyzer every time.

Get exact match after doing mapping as not_analyzed

I have elasticsearch type I mapped as below,
mappings": {
"jardata": {
"properties": {
"groupID": {
"index": "not_analyzed",
"type": "string"
},
"artifactID": {
"index": "not_analyzed",
"type": "string"
},
"directory": {
"type": "string"
},
"jarFileName": {
"index": "not_analyzed",
"type": "string"
},
"version": {
"index": "not_analyzed",
"type": "string"
}
}
}
}
I am using index of directory as analyzed since I want give only the last folder and get the results, But when I want to search a specific directory I need to give the whole path since there can be same folder in two paths. The problem here is since it is analyzed it will all data instead the specific one I want.
The problem here is I want to act it like both analyzed and not_analyzed. is there a way for that?
Let's say you have the following document indexed:
{
"directory": "/home/docs/public"
}
The standard analyzer is not enough in your case as it will create following terms while indexing:
[home, docs, public]
Note that it misses [/home/docs/public] token - characters like "/" etc. are acting as separators here.
One solution could be to use NGram tokenizer with punctuation character class in token_chars list. Elasticsearch would treat "/" as it would be a letter or digit. This would allow to search with following tokens:
[/hom, /home, ..., /home/docs/publi, /home/docs/public, ..., /docs/public, etc...]
Index mapping:
{
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 4,
"max_gram": 18,
"token_chars": [
"letter",
"digit",
"punctuation"
]
}
}
}
},
"mappings": {
"jardata": {
"properties": {
"directory": {
"type": "string",
"analyzer": "ngram_analyzer"
}
}
}
}
}
Now both search queries:
{
"query": {
"bool" : {
"must" : {
"term" : {
"directory": "/docs/private"
}
}
}
}
}
and
{
"query": {
"bool" : {
"must" : {
"term" : {
"directory": "/home/docs/private"
}
}
}
}
}
will give the indexed document in result.
One thing you have to consider is the maximum length of the token that is specified in "max_gram" setting. In case of directory paths it could be necessary to have it longer.
Alternative solution is to use Whitespace tokenizer, that breaks the phrase into terms only on whitespaces, and NGram filter with following mapping:
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 4,
"max_gram": 20
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {
"jardata": {
"properties": {
"directory": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}
update the mapping of the directory field to contain raw field like this:
"directory": {
"type": "string",
"fields": {
"raw": {
"index": "not_analyzed",
"type": "string"
}
}
}
And modify your query to include directory.raw which will treat it like not_analyzed. Refer this.

Elasticsearch keyword tokenizer doesn't work with phonetic analyzer

I want to add a custom phonetic analyzer, also I don't want to analyze my given string. Suppose, I have two string,
KAMRUL ISLAM
KAMRAL ISLAM
I don't want to get any result with a query string KAMRUL but want both two as a result with query string KAMRUL ISLAM.
For this, I have take a custom phonetic analyzer with a keyword tokenizer.
Index Settings :
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"dbl_metaphone": {
"tokenizer": "keyword",
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "keyword",
"filter": "dbl_metaphone"
}
}
}
}
}
Type Mappings:
PUT /my_index/_mapping/my_type
{
"properties": {
"name": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
}
I have inserted data with :
PUT /my_index/my_type/5
{
"name": "KAMRUL ISLAM"
}
And My query String:
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "KAMRAL"
}
}
}
}
Unfortunately I am given both two string. I am using ES-1.7.1. Is there any way to solve this ?
Additionally, While I have run
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=dbl_metaphone' -d 'KAMRUL ISLAM'
I got the result:
{
"tokens": [
{
"token": "KMRL",
"start_offset": 0,
"end_offset": 12,
"type": "word",
"position": 1
}
]
}
And While running :
curl -XGET 'localhost:9200/my_index/_analyze?analyzer=dbl_metaphone' -d 'KAMRAL'
I have got:
{
"tokens": [
{
"token": "KMRL",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 1
}
]
}

Elasticsearch: index first char of string

I'm using version 5.3.
I have a text field a. I'd like to aggregate on the first char of a. I also need the entire original value.
I'm assuming the most efficient way is to have a keyword field a.firstLetter with a custom normalizer. I've tried to achieve this with a pattern replace char filter but am struggling with the regexp.
Am I going at this entirely wrong? Can you help me?
EDIT
This is what I've tried.
settings.json
{
"settings": {
"index": {
"analysis": {
"char_filter": {
"first_char": {
"type": "pattern_replace",
"pattern": "(?<=^.)(.*)",
"replacement": ""
}
}
"normalizer": {
"first_letter": {
"type": "custom",
"char_filter": ["first_char"]
"filter": ["lowercase"]
}
}
}
}
}
}
mappings.json
{
"properties": {
"a": {
"type": "text",
"index_options": "positions",
"fields": {
"firstLetter": {
"type": "keyword",
"normalizer": "first_letter"
}
}
}
}
}
I get no buckets when I try to aggregate like so:
"aggregations": {
"grouping": {
"terms": {
"field": "a.firstLetter"
}
}
}
So basically my approach was "replace all but the first char with an empty string." The regexp is something I was able to gather by googling.
EDIT 2
I had misconfigured the normalizer (I've fixed the examples). The correct configuration reveals that normalizers do not support pattern replace char filters due to issue 23142. Apparently support for it will be implemented earliest in version 5.4.
So are there any other options? I'd hate to do this in code, by adding a field in the doc for the first letter, since I'm using Elasticsearch features for every other aggregation.
You can use the truncate filter with a length of one
PUT foo
{
"mappings": {
"bar" : {
"properties": {
"name" : {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "keyword",
"filter" : [ "my_filter", "lowercase" ]
}
},
"filter": {
"my_filter": {
"type": "truncate",
"length": 1
}
}
}
}
}
}
GET foo/_analyze
{
"field" : "name",
"text" : "New York"
}
# response
{
"tokens": [
{
"token": "n",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
}
]
}

In span_first query can we specify "end" paramter based on actual string that is stored in ES or do i have to specify in terms of tokens stored in ES

I asked previous question here Query in Elasticsearch for retrieving strings that start with a particular word on elasticsearch and my problem was solved by using span_first query but now my problem has been changed a bit, now my mapping has been changed because now i want to store words ending with apostrophe 's' as "word", "words", "word's" for example see below case
"joseph's -> "joseph's", "josephs", "joseph"
My mapping is given below
curl -X PUT "http://localhost:9200/colleges/" -d
'{
"settings": {
"index": {
"analysis": {
"char_filter": {
"apostrophe_comma": {
"type": "pattern_replace",
"pattern": "\\b((\\w+)\\u0027S)\\b",
"replacement": "$1 $2s $2"
}
},
"analyzer": {
"simple_wildcard": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter" : ["apostrophe_comma"],
"filter": ["lowercase", "unique"]
}
}
}
}
},
"mappings" : {
"college": {
"properties":{
"college_name" : { "type" : "string", "index": "analyzed", "analyzer": "simple_wildcard"}
}
}
}
}'
My span_first query i was using
"span_first" : {
"match" : {
"span_term" : { "college_name" : first_string[0] }
},
"end" : 1
}
Now the problem i am facing is consider below example
Suppose i have "Donald Duck's" now if anyone would search for "Donald Duck", "Donald Duck's", "Donald Ducks" etc i want them to give "Donald Duck's" but by using span_first query it is not happening because as due to mapping i have 4 tokens now "Donald", "Duck", "Ducks" and "Duck's". now for Donald "end" used in span_first query will be 1, but for other three i used 2 but as "end" is different for different tokens of same word i am not getting desired result.
In short my problem is as span_first query uses "end" parameter to describe position from beginning my token must be present now as due to my mapping i have broken one word "Duck's" to "Duck's", "Ducks" and "Duck" because of which all have "end" value different but while querying i can only use one "end" parameter that's why i don't know how to get my desired output.
If anyone of you have worked on span_first query please help me.
You can use english possessive stemmer to remove 's and english stemmer which maps to porter stem algorithm to handle plurals.
POST colleges
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"simple_wildcard": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"unique",
"english_possessive_stemmer",
"light_english_stemmer"
]
}
},
"filter": {
"light_english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
}
}
}
},
"mappings": {
"college": {
"properties": {
"college_name": {
"type": "string",
"index": "analyzed",
"analyzer": "simple_wildcard"
}
}
}
}
}
After that you will have to make two queries to get the right result. First you would have to run the user query through analyze api to get the tokens which you will pass to span queries.
GET colleges/_analyze
{
"text" : "donald ducks duck's",
"analyzer" : "simple_wildcard"
}
The output would be the tokens which will be passed to next phase i.e span query.
{
"tokens": [
{
"token": "donald",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "duck",
"start_offset": 7,
"end_offset": 12,
"type": "word",
"position": 1
},
{
"token": "duck",
"start_offset": 13,
"end_offset": 19,
"type": "word",
"position": 2
}
]
}
The tokens donald, duck, duck will be passed with end position as 1, 2 and 3 respectively.
NOTE : No stemming algorithm is 100%, you might miss some singular/plural combination. For this you could log your queries and then use either synonym token filter or mapping char filter.
Hope this solves the problem.

Resources