Elasticsearch: index first char of string - elasticsearch

I'm using version 5.3.
I have a text field a. I'd like to aggregate on the first char of a. I also need the entire original value.
I'm assuming the most efficient way is to have a keyword field a.firstLetter with a custom normalizer. I've tried to achieve this with a pattern replace char filter but am struggling with the regexp.
Am I going at this entirely wrong? Can you help me?
EDIT
This is what I've tried.
settings.json
{
"settings": {
"index": {
"analysis": {
"char_filter": {
"first_char": {
"type": "pattern_replace",
"pattern": "(?<=^.)(.*)",
"replacement": ""
}
}
"normalizer": {
"first_letter": {
"type": "custom",
"char_filter": ["first_char"]
"filter": ["lowercase"]
}
}
}
}
}
}
mappings.json
{
"properties": {
"a": {
"type": "text",
"index_options": "positions",
"fields": {
"firstLetter": {
"type": "keyword",
"normalizer": "first_letter"
}
}
}
}
}
I get no buckets when I try to aggregate like so:
"aggregations": {
"grouping": {
"terms": {
"field": "a.firstLetter"
}
}
}
So basically my approach was "replace all but the first char with an empty string." The regexp is something I was able to gather by googling.
EDIT 2
I had misconfigured the normalizer (I've fixed the examples). The correct configuration reveals that normalizers do not support pattern replace char filters due to issue 23142. Apparently support for it will be implemented earliest in version 5.4.
So are there any other options? I'd hate to do this in code, by adding a field in the doc for the first letter, since I'm using Elasticsearch features for every other aggregation.

You can use the truncate filter with a length of one
PUT foo
{
"mappings": {
"bar" : {
"properties": {
"name" : {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "keyword",
"filter" : [ "my_filter", "lowercase" ]
}
},
"filter": {
"my_filter": {
"type": "truncate",
"length": 1
}
}
}
}
}
}
GET foo/_analyze
{
"field" : "name",
"text" : "New York"
}
# response
{
"tokens": [
{
"token": "n",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
}
]
}

Related

Elasticsearch - Special Characters in Query String

I'm having trouble trying to search special characters using query string. I need to search an email address in format "xxx#xxx.xxx". At index time I use a custom normalizer which provide lowercase and ascii folding. At search time I use a custom analyzer which provide a tokenizer for whitespace and a filter that apply lowercase and ascii folding. By the way I am not able to search for a simple email address.
This is my mapping
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1,
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"normalizer": {
"lowerasciinormalizer": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"email": {
"type": "keyword",
"normalizer": "lowerasciinormalizer"
}
}
}
And this is my search query
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "pippo#pluto.it",
"fields": [
"email"
],
"analyzer": "folding"
}
}
]
}
}
}
Searching without special characters works fine. Infact if I do "query": "pippo*" I get the correct results.
I also tested the tokenizer doing
GET /_analyze
{
"analyzer": "whitespace",
"text": "pippo#pluto.com"
}
I get what I expect
{
"tokens" : [
{
"token" : "pippo#pluto.com",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 0
}
]
}
Any suggestions?
Thanks.
Edit:
I'm using elasticsearch 7.5.1
This works right. My problem was somewhere else.

How to create and add values to a standard lowercase analyzer in elastic search

Ive been around the houses with this for the past few days trying things in various orders but cant figure out why its not working.
I am trying to create an index in Elasticsearch with an analyzer which is the same as the "standard" analyzer but retains upper case characters when records are stored.
I create my analyzer and index as follows:
PUT /upper
{
"settings": {
"index" : {
"analysis" : {
"analyzer": {
"rebuilt_standard": {
"tokenizer": "standard",
"filter": [
"standard"
]
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "rebuilt_standard"
}
}
}
}
}
Then add two records to test like this...
POST /upper/doc
{
"text" : "TEST"
}
Add a second record...
POST /upper/doc
{
"text" : "test"
}
Using /upper/_settings gives the following:
{
"upper": {
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "upper",
"creation_date": "1537788581060",
"analysis": {
"analyzer": {
"rebuilt_standard": {
"filter": [
"standard"
],
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "s4oDgdsFTxOwsdRuPAWEkg",
"version": {
"created": "6030299"
}
}
}
}
}
But when I search with the following query I still get two matches! Both the upper and lower cases which must mean the analyser is not applied when I store the records.
Search like so...
GET /upper/_search
{
"query": {
"term": {
"text": {
"value": "test"
}
}
}
}
Thanks in advance!
first thing first you set your analyzer on the title field instead of upon the text field (since your search is on the text property, and since you are indexing doc with only text property)
"properties": {
"title": {
"type": "text",
"analyzer": "rebuilt_standard"
}
}
try
"properties": {
"text": {
"type": "text",
"analyzer": "rebuilt_standard"
}
}
and keep us posted ;)

Wildcard / regexp in a phrase which has space

Create an index:
Here I an using edge_ngram
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"city": {
"type": "keyword",
"fields": {
"raw": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
}
}
POST my_index/my_type/1
{
"text": "2 #Quick Foxes lived and died"
}
POST my_index/my_type/2
{
"text": "2 #Quick Foxes lived died"
}
Now when we search
GET my_index/my_type/_search
{
"query": {
"query_string": {
"default_operator" : "AND",
"query" : "f* d*",
"fields": ["text.raw"]
}
}
}
Only ID 2 should list. But nothing returns.
when you try this
GET my_index/my_type/_search
{
"query": {
"query_string": {
"default_operator" : "AND",
"query" : "f* d*",
"fields": ["text"]
}
}
}
It will return both.
If we have an index with huge data and if we wanted to search wildcards, how we will do it?
single keyword will work, but if we add phrases like which i mentioned in the example, it won't give you any proper result.
To generate a regex expression you can follow these websites:-
Generate regex expression here- http://buildregex.com/
and test your string with expression generated from here https://regex101.com/

Get exact match after doing mapping as not_analyzed

I have elasticsearch type I mapped as below,
mappings": {
"jardata": {
"properties": {
"groupID": {
"index": "not_analyzed",
"type": "string"
},
"artifactID": {
"index": "not_analyzed",
"type": "string"
},
"directory": {
"type": "string"
},
"jarFileName": {
"index": "not_analyzed",
"type": "string"
},
"version": {
"index": "not_analyzed",
"type": "string"
}
}
}
}
I am using index of directory as analyzed since I want give only the last folder and get the results, But when I want to search a specific directory I need to give the whole path since there can be same folder in two paths. The problem here is since it is analyzed it will all data instead the specific one I want.
The problem here is I want to act it like both analyzed and not_analyzed. is there a way for that?
Let's say you have the following document indexed:
{
"directory": "/home/docs/public"
}
The standard analyzer is not enough in your case as it will create following terms while indexing:
[home, docs, public]
Note that it misses [/home/docs/public] token - characters like "/" etc. are acting as separators here.
One solution could be to use NGram tokenizer with punctuation character class in token_chars list. Elasticsearch would treat "/" as it would be a letter or digit. This would allow to search with following tokens:
[/hom, /home, ..., /home/docs/publi, /home/docs/public, ..., /docs/public, etc...]
Index mapping:
{
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 4,
"max_gram": 18,
"token_chars": [
"letter",
"digit",
"punctuation"
]
}
}
}
},
"mappings": {
"jardata": {
"properties": {
"directory": {
"type": "string",
"analyzer": "ngram_analyzer"
}
}
}
}
}
Now both search queries:
{
"query": {
"bool" : {
"must" : {
"term" : {
"directory": "/docs/private"
}
}
}
}
}
and
{
"query": {
"bool" : {
"must" : {
"term" : {
"directory": "/home/docs/private"
}
}
}
}
}
will give the indexed document in result.
One thing you have to consider is the maximum length of the token that is specified in "max_gram" setting. In case of directory paths it could be necessary to have it longer.
Alternative solution is to use Whitespace tokenizer, that breaks the phrase into terms only on whitespaces, and NGram filter with following mapping:
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 4,
"max_gram": 20
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {
"jardata": {
"properties": {
"directory": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}
update the mapping of the directory field to contain raw field like this:
"directory": {
"type": "string",
"fields": {
"raw": {
"index": "not_analyzed",
"type": "string"
}
}
}
And modify your query to include directory.raw which will treat it like not_analyzed. Refer this.

Elasticsearch custom analyzer for hyphens, underscores, and numbers

Admittedly, I'm not that well versed on the analysis part of ES. Here's the index layout:
{
"mappings": {
"event": {
"properties": {
"ipaddress": {
"type": "string"
},
"hostname": {
"type": "string",
"analyzer": "my_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "word_delimiter",
"preserve_original": true
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase", "my_filter"]
}
}
}
}
}
You can see that I've attempted to use a custom analyzer for the hostname field. This kind of works when I use this query to find the host named "WIN_1":
{
"query": {
"match": {
"hostname": "WIN_1"
}
}
}
The issue is that it also returns any hostname that has a 1 in it. Using the _analyze endpoint, I can see that the numbers are tokenized as well.
{
"tokens": [
{
"token": "win_1",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "win",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "1",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 2
}
]
}
What I'd like to be able to do is search for WIN and get back any host that has WIN in it's name. But I also need to be able to search for WIN_1 and get back that exact host or any host with WIN_1 in it's name. Below is some test data.
{
"ipaddress": "192.168.1.253",
"hostname": "WIN_8_ENT_1"
}
{
"ipaddress": "10.0.0.1",
"hostname": "server1"
}
{
"ipaddress": "172.20.10.36",
"hostname": "ServA-1"
}
Hopefully someone can point me in the right direction. It could be that my simple query isn't the right approach either. I've poured over the ES docs, but they aren't real good with examples.
You could change your analysis to use a pattern analyzer that discards the digits and under scores:
{
"analysis": {
"analyzer": {
"word_only": {
"type": "pattern",
"pattern": "([^\p{L}]+)"
}
}
}
}
Using the analyze API:
curl -XGET 'localhost:9200/{yourIndex}/_analyze?analyzer=word_only&pretty=true' -d 'WIN_8_ENT_1'
returns:
"tokens" : [ {
"token" : "win",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
}, {
"token" : "ent",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 2
} ]
Your mapping would become:
{
"event": {
"properties": {
"ipaddress": {
"type": "string"
},
"hostname": {
"type": "string",
"analyzer": "word_only",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
You can use a multi_match query to get the results you want:
{
"query": {
"multi_match": {
"fields": [
"hostname",
"hostname.raw"
],
"query": "WIN_1"
}
}
}
Here's the analyzer and queries I ended up with:
{
"mappings": {
"event": {
"properties": {
"ipaddress": {
"type": "string"
},
"hostname": {
"type": "string",
"analyzer": "hostname_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"hostname_filter": {
"type": "pattern_capture",
"preserve_original": 0,
"patterns": [
"(\\p{Ll}{3,})"
]
}
},
"analyzer": {
"hostname_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [ "lowercase", "hostname_filter" ]
}
}
}
}
}
Queries:
Find host name starting with:
{
"query": {
"prefix": {
"hostname.raw": "WIN_8"
}
}
}
Find host name containing:
{
"query": {
"multi_match": {
"fields": [
"hostname",
"hostname.raw"
],
"query": "WIN"
}
}
}
Thanks to Dan for getting me in the right direction.
When ES 1.4 is released, there will be a new filter called 'keep types' that will allow you to only keep certain types once the string is tokenized. (i.e. keep words only, numbers only, etc).
Check it out here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-keep-types-tokenfilter.html#analysis-keep-types-tokenfilter
This may be a more convenient solution for your needs in the future
It looks like you want to apply two different types of searches on your hostname field. One for exact matches, and one for a variation of wildcard (maybe in your specific case, a prefix query).
After trying to implement all types of different searches using several different analyzers, I've found it sometimes simpler to add another field to represent each type of search you want to do. Is there a reason you do not want to add another field like the following:
{
"ipaddress": "192.168.1.253",
"hostname": "WIN_8_ENT_1"
"system": "WIN"
}
Otherwise, you could consider writing your own custom filter that does effectively the same thing under the hood. Your filter will read in your hostname field and index the exact keyword and a substring that matches your stemming pattern (e.g. WIN in WIN_8_ENT_1).
I do not think there is any existing analyzer/filter combination that can do what you are looking for, assuming I have understood your requirements correctly.

Resources