different analyzers in different fields

different analyzers in different fields - elasticsearch

Because in my mapping below, when I put the URL field with the analyzer different from the title and description fields when I do a search simultaneously in the three fields, it doesn't return anything even if I have one of the three words below in each field
{
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "0",
"analysis": {
"filter": {
"stemmer_plural_portugues": {
"name": "minimal_portuguese",
"stopwords" : ["http", "https", "ftp", "www"],
"type": "stemmer"
},
"synonym_filter": {
"type": "synonym",
"lenient": true,
"synonyms_path": "analysis/synonym.txt",
"updateable" : true
},
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}
},
"analyzer": {
"analyzer_customizado": {
"filter": [
"lowercase",
"stemmer_plural_portugues",
"asciifolding",
"synonym_filter",
"shingle_filter" ],
"tokenizer": "standard"
},
"analyzer_url": {
"filter": [
"lowercase",
"stemmer_plural_portugues",
"asciifolding" ],
"tokenizer": "lowercase"
}
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "long"
},
"data": {
"type": "date"
},
"quebrado": {
"type": "byte"
},
"pgrk": {
"type": "integer"
},
"url_length": {
"type": "integer"
},
"title": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"description": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"url": {
"analyzer": "analyzer_url",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
}
}
}
}
in the query below the three words exist each of the fields, but it only returns results if I search for words that are in the title and in the description, if I also search for the word that is in the URL field that has the different analyzer does not return anything.
if I search only the words that are in the title and description field you will normally find, if I search only the word that is in the URL field also finds it, however if I search for the three words that exist in the three fields it doesn't return anything.
{
"from": 0,
"size": 10,
"query": {
"multi_match": {
"query": "carro moto aviao",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}
}
}

The issue is that you are using the operator as and which means all three words carro moto aviao must present, can you change it to OR and see if its returns result.
Adding a working example with you mapping, sample data and with search query with or parameter and confirming that it works.
Sample doc
{
"title": "carro",
"description": "moto",
"url": "aviao"
}
Search query with OR param
{
"from": 0,
"size": 10,
"query": {
"multi_match": {
"query": "carro moto aviao",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "or"
}
}
}
Search result
"hits": [
{
"_index": "jean",
"_type": "_doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"title": "carro",
"description": "moto",
"url": "aviao"
}
}
]
Note: confirmed that it doesn't work with and param if your query.

Related

Elastic Search_as_you_type case insensitive match

I want to perform partial search on 3 fields: UUID, tracking_id, and zip_code. They only contain 1 word and no special characters/space except hypen for UUID.
I'm not sure whether I should use search_as_you_type or edge ngram tokenizer or edge ngram token filter, so I tried search_as_you_type first.
I have created this index:
{
"settings": {
"index": {
"sort.field": [ "created_at", "id" ],
"sort.order": [ "desc", "desc" ]
}
},
"mappings": {
"properties": {
"id": { "type": "keyword", "fields": { "raw": { "type": "search_as_you_type" }}},
"current_status": { "type": "keyword" },
"tracking_id": { "type": "wildcard" },
"invoice_number": { "type": "keyword" },
"created_at": { "type": "date" }
}
}
}
and inserted this doc:
{
"id": "SIGRID",
"current_status": "unassigned",
"tracking_id": "AXXH",
"invoice_number": "xxx",
"created_at": "2021-03-24T09:36:10.717672467Z"
}
I sent this query:
{"query": {
"multi_match": {
"query": "sigrid",
"type": "bool_prefix",
"fields": [
"id"
]
}
}
}
this returns no result, but SIGRID, S, SIG returns the result. How can I make search_as_you_type query be case insensitive? should i use edge ngram tokenizer instead? Thanks

You can define a custom normalizer with a lowercase filter, lowercase filter will ensure that all the letters are changed to lowercase before indexing the document and searching. Modify your index mapping as
{
"settings": {
"index": {
"sort.field": [
"created_at",
"id"
],
"sort.order": [
"desc",
"desc"
]
},
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom", // note this
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "keyword",
"normalizer": "my_normalizer", // note this
"fields": {
"raw": {
"type": "search_as_you_type"
}
}
},
"current_status": {
"type": "keyword"
},
"tracking_id": {
"type": "wildcard"
},
"invoice_number": {
"type": "keyword"
},
"created_at": {
"type": "date"
}
}
}
}
Search Query:
{
"query": {
"multi_match": {
"query": "sigrid",
"type": "bool_prefix"
}
}
}
Search Result:
"hits": [
{
"_index": "66792606",
"_type": "_doc",
"_id": "1",
"_score": 2.0,
"_source": {
"id": "SIGRID",
"current_status": "unassigned",
"tracking_id": "AXXH",
"invoice_number": "xxx",
"created_at": "2021-03-24T09:36:10.717672467Z"
}
}
]

elasticsearch mapping with numeric token

I have the mapping below and it works normally
{
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "0",
"analysis": {
"filter": {
"stemmer_plural_portugues": {
"name": "minimal_portuguese",
"stopwords" : ["http", "https", "ftp", "www"],
"type": "stemmer"
},
"synonym_filter": {
"type": "synonym",
"lenient": true,
"synonyms_path": "analysis/synonym.txt",
"updateable" : true
},
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}
},
"analyzer": {
"analyzer_customizado": {
"filter": [
"lowercase",
"stemmer_plural_portugues",
"asciifolding",
"synonym_filter",
"shingle_filter"
],
"tokenizer": "lowercase"
}
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "long"
},
"data": {
"type": "date"
},
"quebrado": {
"type": "byte"
},
"pgrk": {
"type": "integer"
},
"url_length": {
"type": "integer"
},
"title": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"description": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"url": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
}
}
}
}
I insert the doc below
{
"title": "rocket 1960",
"description": "space",
"url": "www.nasa.com"
}
If I execute the query below using the AND operator, it will find the doc normally, because all the words searched exist in the doc.
{
"from": 0,
"size": 10,
"query": {
"multi_match": {
"query": "space nasa rocket",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}
}
}
but if I put it in the search also "1960" as the query below does not return anything
{
"from": 0,
"size": 10,
"query": {
"multi_match": {
"query": "1960 space nasa rocket",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}
}
}
I found that my "lowercase" tokenizer does not generate a numeric token. So I change my tokenizer to "standard" and the 1960 numeric token is generated.
but the query does not find anything, because the URL field that has the link www.nasa.com no longer generates the token "www nasa com" the generated token is the entire link www.nasa.com.
The query only works if I enter the full URL www.nasa.com as shown below
{
"from": 0,
"size": 10,
"query": {
"multi_match": {
"query": "1960 space www.nasa.com rocket",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}
}
}
If I generate another "lowercase" tokenizer just for the URL field the link www.nasa.com again generates the separate tokens "www nasa com"
but my query below does not find anything, because the URL field has a different tokenizer than the other fields title and description. The query below only works if I use the OR operator, but I need the AND operator,
{
"from": 0,
"size": 10,
"query": {
"multi_match": {
"query": "1960 space nasa rocket",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}
}
}
I cannot use Ngram in my mapping because I use "Phrase Suggester" and when I use Ngram the suggestions are being generated with hundreds of tokens generating inaccuracy in the suggestions.
would anyone know any solution for my mapping to be able to generate numeric tokens in my "title and descripton" fields, but that my URL field will continue with the website links being broken into several tokens "www nasa com" instead of the link being whole "www .nasa.com "and that my query works as an AND operator searching all fields at the same time.

If I put it in the search also "1960" as the query below does not
return anything
In the following Index Mapping, I have removed synonym_filter. After removing it and indexing the sample documents, and running the same search query as you mentioned in the question, I am able to get the desired result
Index Mapping :
{
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "0",
"analysis": {
"filter": {
"stemmer_plural_portugues": {
"name": "minimal_portuguese",
"stopwords": [
"http",
"https",
"ftp",
"www"
],
"type": "stemmer"
},
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}
},
"analyzer": {
"analyzer_customizado": {
"filter": [
"lowercase",
"stemmer_plural_portugues",
"asciifolding",
"shingle_filter"
],
"tokenizer": "lowercase"
}
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "long"
},
"data": {
"type": "date"
},
"quebrado": {
"type": "byte"
},
"pgrk": {
"type": "integer"
},
"url_length": {
"type": "integer"
},
"title": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"description": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"url": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
}
}
}
}
Search Query:
{
"from": 0,
"size": 10,
"query": {
"multi_match": {
"query": "1960 space nasa rocket",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}
}
}
Search Result:
"hits": [
{
"_index": "my-index",
"_type": "_doc",
"_id": "1",
"_score": 0.9370217,
"_source": {
"title": "rocket 1960",
"description": "space",
"url": "www.nasa.com"
}
}
]
As stated by #Gibbs, I think there is some issue in synonym_filter, so it would be better if you share synonym.txt otherwise, the search query is running perfectly.
Update 1 : (Including synonym_filter)
If you want to include Synonym Token Filter then, keep the index mapping same as yours, just making some changes in the mapping which is:
"synonym_filter": {
"type": "synonym",
"lenient": true,
"synonyms_path": "analysis/synonym.txt",
"updateable" : false --> set this to false
},
You set your synonym filter to "updateable", presumably because you
want to change synonyms without having to close and reopen the index
but instead use the reload API. Updatable synonyms restrict the
analyzer they are used in to be only used at search time .
To get the full explanation of this, you can refer to this ES discussion
Use the same search query as above (after making changes in the mapping
), you will get your desired result.
But if you still want to set "updateable" : true, then you can refer official documentation of Reload search analyzers API

Elasticsearch Sorting fields anomaly

Trying to sort a list on certain fields. firstName and lastName but I have noticed some inconstant result.
I am running a simple query
//Return all the employees from a specific company ordering by lastName asc | desc
GET employee-index-sorting
{
"query": {
"bool": {
"filter": {
"term": {
"companyId": 3179
}
}
}
},
"sort": [
{
"lastName.keyword": { <-- Should this be keyword? or not_analyzed
"order": "desc"
}
}
]
}
In the result why would van der Mescht and van Breda be before Zwane and Zwezwe?
I suspect there is something wrong with my mappings
{
"_index": "employee-index",
"_type": "_doc",
"_id": "637467",
"_score": null,
"_source": {
"companyId": 3179,
"firstName": "Name",
"lastName": "van der Mescht",
},
"sort": [
"van der Mescht"
]
},
{
"_index": "employee-index",
"_type": "_doc",
"_id": "678335",
"_score": null,
"_source": {
"companyId": 3179,
"firstName": "Name3",
"lastName": "van Breda",
},
"sort": [
"van Breda"
]
},
{
"_index": "employee-index",
"_type": "_doc",
"_id": "113896",
"_score": null,
"_source": {
"companyId": 3179,
"firstName": "Name2",
"lastName": "Zwezwe",
},
"sort": [
"Zwezwe"
]
},
{
"_index": "employee-index",
"_type": "_doc",
"_id": "639639",
"_score": null,
"_source": {
"companyId": 3179,
"firstName": "Name1",
"lastName": "Zwane",
},
"sort": [
"Zwane"
]
}
Mappings
Posting the entire map because I am not sure if there might be something else wrong with it.
How should i change the lastName and firstName propery to allow for sorting on them?
PUT employee-index-sorting
{
"settings": {
"index": {
"analysis": {
"filter": {},
"analyzer": {
"keyword_analyzer": {
"filter": [
"lowercase",
"asciifolding",
"trim"
],
"char_filter": [],
"type": "custom",
"tokenizer": "keyword"
},
"edge_ngram_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "edge_ngram_tokenizer"
},
"edge_ngram_search_analyzer": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 5,
"token_chars": [
"letter"
]
}
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"employeeId": {
"type": "keyword"
},
"companyGroupId": {
"type": "keyword"
},
"companyId": {
"type": "keyword"
},
"number": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"preferredName": {
"type": "text",
"index": false
},
"firstName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"middleName": {
"type": "text",
"index": false
},
"lastName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"fullName": {
"type": "text",
"fields": {
"keywordstring": {
"type": "text",
"analyzer": "keyword_analyzer"
},
"edgengram": {
"type": "text",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "edge_ngram_search_analyzer"
}
},
"analyzer": "standard"
},
"terminationDate": {
"type": "date"
},
"companyName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"email": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"idNumber": {
"type": "text"
},
"description": {
"type": "text",
"index": false
},
"jobNumber": {
"type": "keyword"
},
"frequencyId": {
"type": "long"
},
"frequencyCode": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"frequencyAccess": {
"type": "boolean"
}
}
}
}
}

For sorting you need to use lastName.keyword, that's correct, no need to change anything there.
The reason why van der Mescht and van Breda are before Zwane and Zwezwe is because sorting on strings happens on a lexicographical level, i.e. basically using the ASCII table and uppercase characters happen before lowercase ones, so words are sorted in that same order. But since you're sorting in desc mode, that's exactly the opposite:
z...
...
van der Mescht
...
van Breda
...
a...
...
Zwezwe
...
Zwane
...
Z...
...
A...
To fix this, what you simply need to do is to add a normalizer to your lastName.keyword field, i.e. change your mapping to this and it will work:
{
"settings": {
"index": {
"analysis": {
"filter": {},
"analyzer": {
...
},
"tokenizer": {
...
},
"normalizer": { <-- add this
"lowersort": {
"type": "custom",
"filter": [
"lowercase"
]
}
}
}
}
},
"mappings": {
"_doc": {
"properties": {
...
"lastName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "lowersort", <-- add this
"ignore_above": 256
}
}
},
...
}
}
}
}

Not able to search a phrase in elasticsearch 5.4

I am searching for a phrase in a email body. Need to get the exact data filtered like, if I search for 'Avenue New', it should return only results which has the phrase 'Avenue New' not 'Avenue Street', 'Park Avenue'etc
My mapping is like:
{
"exchangemailssql": {
"aliases": {},
"mappings": {
"email": {
"dynamic_templates": [
{
"_default": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"doc_values": true,
"type": "keyword"
}
}
}
],
"properties": {
"attachments": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"body": {
"type": "text",
"analyzer": "keylower",
"fielddata": true
},
"count": {
"type": "short"
},
"emailId": {
"type": "long"
}
}
}
},
"settings": {
"index": {
"refresh_interval": "3s",
"number_of_shards": "1",
"provided_name": "exchangemailssql",
"creation_date": "1500527793230",
"analysis": {
"filter": {
"nGram": {
"min_gram": "4",
"side": "front",
"type": "edge_ngram",
"max_gram": "100"
}
},
"analyzer": {
"keylower": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
},
"email": {
"filter": [
"lowercase",
"unique",
"nGram"
],
"type": "custom",
"tokenizer": "uax_url_email"
},
"full": {
"filter": [
"lowercase",
"snowball",
"nGram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "0",
"uuid": "2XTpHmwaQF65PNkCQCmcVQ",
"version": {
"created": "5040099"
}
}
}
}
}
I have given the search query like:
{
"query": {
"match_phrase": {
"body": "Avenue New"
}
},
"highlight": {
"fields" : {
"body" : {}
}
}
}

The problem here is that you're tokenizing the full body content using the keyword tokenizer, i.e. it will be one big lowercase string and you cannot search inside of it.
If you simply change the analyzer of your body field to standard instead of keylower, you'll find what you need using the match_phrase query.
"body": {
"type": "text",
"analyzer": "standard", <---change this
"fielddata": true
},

Why is this term query not returning any results?

I am currently implementing a simple person search in elastic search. I did some research and found quite a lot content about how to implement features as full text search and so on.
The problem is, that some queries just don't return any results.
I have the following index template:
PUT /_template/template_hca_bp
{
"template": "test",
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
},
"search_ngram": {
"type": "custom",
"tokenizer": "lowercase",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"persons": {
"properties": {
"address": {
"properties": {
"city": {
"type": "text",
"search_analyzer": "standard",
"analyzer": "autocomplete"
},
"countryCode": {
"type": "keyword"
},
"doorNumber": {
"type": "keyword"
},
"id": {
"type": "text",
"index": "no",
"include_in_all": false
},
"stairwayNumber": {
"type": "keyword"
},
"street": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
},
"streetNumber": {
"type": "keyword"
},
"zipCode": {
"type": "keyword"
}
}
},
"id": {
"type": "keyword",
"index": "no",
"include_in_all": false
},
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard",
"boost":2
},
"personType": {
"type": "keyword",
"index": "no",
"include_in_all": false
},
"title": {
"type": "text"
}
}
}
}
}
My query looks like the following:
POST test/_search
{
"query": {
"multi_match": {
"query": "Maria",
"type":"cross_fields",
"fields": [
"name^2", "city", "street", "streetNumber", "zipCode"
]
}
}
}
If I now search e.g. for "Maria" then I get a result. But if I'm searching for a zipCode (e.g. 12345) than I don't get any result.
The analyze api has the following response:
"detail": {
"custom_analyzer": false,
"analyzer": {
"name": "default",
"tokens": [
{
"token": "12345",
"start_offset": 0,
"end_offset": 5,
"type": "<NUM>",
"position": 0,
"bytes": "[31 32 33 34 35]",
"positionLength": 1
}
]
}
}
I'm not getting any response. I have tried term, and match queries and all other kind of stuff, but I can't get it working?
The desired document:
"id": "V2718984F3A0ADA95176424457A068F9DC93FC8BDA0898A4E8248F194AE1AF4FCE04C29F46367DDEC33721C15C2679B7BB",
"name": "Maria Smith",
"personType": "APO",
"address": {
"countryCode": "A",
"city": "Testcity",
"zipCode": "12345",
"street": "Avenue",
"streetNumber": "2"
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

different analyzers in different fields - elasticsearch

Related

Elastic Search_as_you_type case insensitive match

elasticsearch mapping with numeric token

Elasticsearch Sorting fields anomaly

Not able to search a phrase in elasticsearch 5.4

Why is this term query not returning any results?

Categories

Resources