Index email with ElasticSearch - mapping problem - elasticsearch

I use ES v7. I want to index email address with ElasticSearch but using uax_url_email tokenizer.
I want to search Elastic with full email address.
I tried use this mapping:
PUT /test
{
"settings": {
"analysis": {
"filter": {
"email": {
"type": "pattern_capture",
"preserve_original": 1,
"patterns": [
"([^#]+)",
"(\\p{L}+)",
"(\\d+)",
"#(.+)",
"([^-#]+)"
]
}
},
"analyzer": {
"email": {
"tokenizer": "uax_url_email",
"filter": [
"email",
"lowercase",
"unique"
]
}
}
}
},
"mappings": {
"emails": {
"properties": {
"email": {
"type": "string",
"analyzer": "email"
}
}
}
}
}
but get error
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Failed to parse value [1] as only [true] or [false] are allowed."
}
],
"type": "illegal_argument_exception",
"reason": "Failed to parse value [1] as only [true] or [false] are allowed."
},
"status": 400
}
what is wrong with it ? How this mapping should look ?

Your request is malformed, you are passing 1 to preserve_original param which accepts only true and false as mentioned in the exception.
Apart from this, there are few more issues, like you are using String data type which is deprecated in v7.1 and emails is coming before properties in your JSON.
Correct mapping tested in my local would like
{
"settings": {
"analysis": {
"filter": {
"email": {
"type": "pattern_capture",
"preserve_original": true,
"patterns": [
"([^#]+)",
"(\\p{L}+)",
"(\\d+)",
"#(.+)",
"([^-#]+)"
]
}
},
"analyzer": {
"email": {
"tokenizer": "uax_url_email",
"filter": [
"email",
"lowercase",
"unique"
]
}
}
}
},
"mappings": {
"properties": {
"email": {
"type": "text",
"analyzer": "email"
}
}
}
}

Thank You.
I inserted few emails to index with this corrected mapping.
Now when I search for specific email I get all result.
I want to have only one record. How can I do this ?
http://localhost:9200/test/_search?q=email:abc#abc.net
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 0.21149008,
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "0IWQlXcBnPuV0JvQXCHW",
"_score": 0.21149008,
"_source": {
"email": "abc#abc.net"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "0oWUlXcBnPuV0JvQISFe",
"_score": 0.21149008,
"_source": {
"email": "abc1#abc.net"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "z4WQlXcBnPuV0JvQNCGn",
"_score": 0.19982167,
"_source": {
"email": "abc2#abc.net"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "0YWQlXcBnPuV0JvQdiHo",
"_score": 0.19982167,
"_source": {
"email": "abc3#abc.net"
}
}
]
}
}

Related

Elasticsearch template to support case insensitive searches

I've setup a normalizer on an index field to support case insensitive searches, cant seem to get it to work.
GET users/
Returns the following mapping:
{
"users": {
"aliases": {},
"mappings": {
"user": {
"properties": {
"active": {
"type": "boolean"
},
"first_name": {
"type": "keyword",
"fields": {
"normalize": {
"type": "keyword",
"normalizer": "search_normalizer"
}
}
},
},
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "users",
"creation_date": "1567936315432",
"analysis": {
"normalizer": {
"search_normalizer": {
"filter": [
"lowercase"
],
"type": "custom"
}
}
},
"number_of_replicas": "1",
"uuid": "5SknFdwJTpmF",
"version": {
"created": "6040299"
}
}
}
}
}
Although first_name is normalized to lowercase, queries on the first_name field are case sensitive.
Using the following query for a user with first name Dave
GET users/_search
{
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name": {
"value": ".*dave.*"
}
}
}
]
}
}
}
GET users/_analyze
{
"analyzer" : "standard",
"text": "Dave"
}
returns
{
"tokens": [
{
"token": "dave",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Although "Dave" is tokenized to "dave" the following query
GET users/_search
{
"query": {
"match": {
"first_name": "dave"
}
}
}
Returns no hits.
Is there an issue with my current mapping? or the query?
I think you have missed first_name.normalize in query
Indexing Records
{"first_name": "Daveraj"}
{"index": {}}
{"first_name": "RajdaveN"}
{"index": {}}
{"first_name": "Dave"}
Query
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name.normalize": {
"value": ".*dave.*"
}
}
}
]
}
}
}
Result
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.0,
"hits": [
{
"_index": "test3",
"_type": "test3_type",
"_id": "M8-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Dave"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Mc-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Daveraj"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Ms-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "RajdaveN"
}
}
]
}
}```
You have created a normalized multi-field: first_name.normalize , but you are searching on the original field first_name which doesn't have any analyzer specified (will default to index-default analyzer or standard).
The examples given here might help:
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html
You need to explicitly specify the multi-field you want to search on, note even though a multi-field cant have its own content, it indexes different terms as opposed to its parent (although not always) as a result of possibly being analyzed using diff analyzers/char/token filters.

How to control scoring or ordering of results while using ngram in Elasticsearch?

I am using Elasticsearch 6.X..
I have created an index test_index with index type doc as follow:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_ngram_tokenizer"
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "1",
"max_gram": "7",
"token_chars": [
"letter",
"digit",
"punctuation"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"my_text": {
"type": "text",
"fielddata": true,
"fields": {
"ngram": {
"type": "text",
"fielddata": true,
"analyzer": "my_analyzer"
}
}
}
}
}
}
}
I have indexed data as follow:
PUT /text_index/doc/1
{
"my_text": "ohio"
}
PUT /text_index/doc/2
{
"my_text": "ohlin"
}
PUT /text_index/doc/3
{
"my_text": "john"
}
Then I used search query:
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "oh",
"fields": [
"my_text^5",
"my_text.ngram"
]
}
}
]
}
}
}
And got the response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 1.0042334,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1.0042334,
"_source": {
"my_text": "ohio"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 0.97201055,
"_source": {
"my_text": "john"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.80404717,
"_source": {
"my_text": "ohlin"
}
}
]
}
}
Here, we can see the when I searched for oh, I got results in the order:
-> ohio
-> john
-> ohlin
But, I want to have scoring and order of the results in a way which gives higher priority to matching prefix:
-> ohio
-> ohlin
-> john
How can I achieve such result ? What approaches can I take here ?
Thanks in advance.
You should add a new subfield with a new analyzer using the edge_ngram tokenizer then add the new subfield in your multimatch.
You need then to use the type most_fields for your multimatch query. Then only the documents starting by the search term will match on this subfield and then will be boosted against others matching documents.

elasticsearch do not analyze field

I use Elasticsearch (2.4) and I have an index with a field that is, in theory, analyzed on index step. But, in practice, it's not analyzed. I think I miss something, but what ?
The complete index definition :
{
"test_index": {
"aliases": {},
"mappings": {
"users": {
"properties": {
"name": {
"type": "string",
"analyzer": "my_analyser"
},
"id": {
"type": "long"
}
}
}
},
"settings": {
"index": {
"index_directly": "1",
"number_of_shards": "1",
"cron_limit": "50",
"creation_date": "1496150121337",
"analysis": {
"analyzer": {
"standard": {
"type": "standard",
"max_token_length": "255",
"stopwords": ""
},
"my_analyser": {
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "3",
"type": "ngram",
"max_gram": "3"
}
}
},
"fields": {
"name": {
"type": "text"
}
},
"number_of_replicas": "0",
"uuid": "lmwPFWoISlC2knZZn2nNZQ",
"version": {
"created": "2040599"
}
}
},
"warmers": {}
}
}
A simple document to index :
{
"id": 0,
"name": "John"
}
The result :
{
"_index": "test_index",
"_type": "users",
"_id": "0",
"_version": 1,
"found": true,
"_source": {
"id": 0,
"name": "John"
}
}
What I am expecting :
{
"_index": "test_index",
"_type": "users",
"_id": "0",
"_version": 1,
"found": true,
"_source": {
"id": 0,
"name": [
"Joh",
"ohn"
]
}
}
I have other fields on this index, and I want my custom analyzer just on name field.
Your analyzer won't affect the _source object, it only impacts the result terms that are stored in index and used for search

Elasticsearch search for Turkish characters

I have some documents that i am indexing with elasticsearch. But some of the documents are written with upper case and Tukish characters are changed. For example "kürşat" is written as "KURSAT".
I want to find this document by searching "kürşat". How can i do that?
Thanks
Take a look at the asciifolding token filter.
Here is a small example for you to try out in Sense:
Index:
DELETE test
PUT test
{
"settings": {
"analysis": {
"filter": {
"my_ascii_folding": {
"type": "asciifolding",
"preserve_original": true
}
},
"analyzer": {
"turkish_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_ascii_folding"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"analyzer": "turkish_analyzer"
}
}
}
}
}
POST test/test/1
{
"name": "kürşat"
}
POST test/test/2
{
"name": "KURSAT"
}
Query:
GET test/_search
{
"query": {
"match": {
"name": "kursat"
}
}
}
Response:
"hits": {
"total": 2,
"max_score": 0.30685282,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.30685282,
"_source": {
"name": "KURSAT"
}
},
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.30685282,
"_source": {
"name": "kürşat"
}
}
]
}
Query:
GET test/_search
{
"query": {
"match": {
"name": "kürşat"
}
}
}
Response:
"hits": {
"total": 2,
"max_score": 0.4339554,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.4339554,
"_source": {
"name": "kürşat"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.09001608,
"_source": {
"name": "KURSAT"
}
}
]
}
Now the 'preserve_original' flag will make sure that if a user types: 'kürşat', documents with that exact match will be ranked higher than documents that have 'kursat' (Notice the difference in scores for both query responses).
If you want the score to be equal, you can put the flag on false.
Hope I got your problem right!

Unexpected (case-insensitive) string sorting in Elasticsearch

I have a list of console platforms that I'm sorting in Elasticsearch.
Here is the mapping for the "name" field:
{
"name": {
"type": "multi_field",
"fields": {
"name": {
"type": "string",
"index": "analyzed"
},
"sort_name": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
When I execute the following query
{
"query": {
"match_all": {}
},
"sort": [
{
"name.sort_name": { "order": "asc" }
}
],
"fields": ["name"]
}
I get these results:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"failed": 0
},
"hits": {
"total": 17,
"max_score": null,
"hits": [
{
"_index": "platforms",
"_type": "platform",
"_id": "1393602489",
"_score": null,
"fields": {
"name": "GameCube"
},
"sort": [
"GameCube"
]
},
{
"_index": "platforms",
"_type": "platform",
"_id": "1393602490",
"_score": null,
"fields": {
"name": "Gameboy Advance"
},
"sort": [
"Gameboy Advance"
]
},
{
"_index": "platforms",
"_type": "platform",
"_id": "1393602498",
"_score": null,
"fields": {
"name": "Nintendo 3DS"
},
"sort": [
"Nintendo 3DS"
]
},
...remove for brevity ...
{
"_index": "platforms",
"_type": "platform",
"_id": "1393602493",
"_score": null,
"fields": {
"name": "Xbox 360"
},
"sort": [
"Xbox 360"
]
},
{
"_index": "platforms",
"_type": "platform",
"_id": "1393602502",
"_score": null,
"fields": {
"name": "Xbox One"
},
"sort": [
"Xbox One"
]
},
{
"_index": "platforms",
"_type": "platform",
"_id": "1393602497",
"_score": null,
"fields": {
"name": "iPhone/iPod"
},
"sort": [
"iPhone/iPod"
]
}
]
}
Everything is sorted as expected except the iPhone/iPod result is at the end (instead of after GameBoy Advance) - why does the / in the name have an effect on the sorting?
Thanks
Okay so I discovered the reason wasn't anything to do with the /. ES will sort by capital letters then lower case letters.
I added a custom analyzer to the settings of the index creation:
{
"analysis": {
"analyzer": {
"sortable": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
}
Then in the field mapping I added 'analyzer': 'sortable' to the sort_name multi field.
Use Normalizer with keyword to handle the sort
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-normalizers.html#analysis-normalizers
PUT index_name
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": ["quote"],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
Search query may be modified like this
{
"query": {
"match_all": {}
},
"sort": [
{
"name.sort_name": { "order": "asc" }
}
],
"fields": "name.keyword"
}
According to https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-normalizers.html (ElasticSearch 7.16) ...
Elasticsearch ships with a lowercase built-in normalizer.
So you can define an additional field (in the example below named "lowersortable"):
PUT /myindex/_mapping
{
"properties": {
"myproperty": {
"type": "text",
"fields": {
"lowersortable": {
"type": "keyword",
"normalizer": "lowercase"
}
}
}
}
}
... and use this field myproperty.lowersortable for sorting in the search query.

Resources