I am using Elastic builder npm
Using esb.termQuery(Email, "test")
Mapping:
"CompanyName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
Database fields:
"Email": "test#mycompany.com",
"CompanyName": "my company"
Query JSON: { term: { CompanyName: 'my' } }. or { term: { Email: 'test' } }
Result :
"Email": "test#mycompany.com",
"CompanyName": "my company"
Expectation:
No result, need a full-text match, Match here is acting like 'like' or queryStringQuery.
I have 3 filters prefix, exact match, include.
The standard analyzer is the default analyzer which is used if none is
specified. It provides grammar based tokenization
In your example, maybe that you are not specifying any analyzer explicitly in the index mapping, therefore text fields are analyzed by default and the standard analyzer is the default analyzer for them.
Refer this SO answer, to get a detailed explanation on this.
The following tokens are generated if no analyzer is defined.
POST/_analyze
{
"analyzer" : "standard",
"text" : "test#mycompany.com"
}
Tokens are:
{
"tokens": [
{
"token": "test",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "mycompany.com",
"start_offset": 5,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 1
}
]
}
If you want a full-text search then you can define a custom analyzer with a lowercase filter, lowercase filter will ensure that all the letters are changed to lowercase before indexing the document and searching.
The normalizer property of keyword fields is similar to analyzer
except that it guarantees that the analysis chain produces a single
token.
The uax_url_email tokenizer is like the standard tokenizer except that
it recognises URLs and email addresses as single tokens.
Index Mapping:
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"filter": [
"lowercase"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "uax_url_email"
}
}
}
},
"mappings": {
"properties": {
"CompanyName": {
"type": "keyword",
"normalizer": "my_normalizer"
},
"Email": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"Email": "test#mycompany.com",
"CompanyName": "my company"
}
Search Query:
{
"query": {
"bool": {
"should": [
{
"match": {
"CompanyName": "My Company"
}
},
{
"match": {
"Email": "test"
}
}
],
"minimum_should_match": 1
}
}
}
Search Result:
"hits": [
{
"_index": "stof_64220291",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"Email": "test#mycompany.com",
"CompanyName": "my company"
}
}
]
Related
Trying to fetch two documents that fit on the params searched, searching by each document separately works fine.
The query:
{
"query":{
"bool":{
"should":[
{
"match_phrase":{
"email":"elpaso"
}
},
{
"match_phrase":{
"email":"walker"
}
}
]
}
}
}
Im expecting to retrieve both documents that have these words in their email address field, but the query is only returning the first one elpaso
Is this an issue related to index mapping? I'm using type text for this field.
Any concept I am missing?
Index mapping:
{
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"name":{
"type": "text"
},
"email":{
"type" : "text"
}
}
}
}
Sample data:
{
"id":"4a43f351-7b62-42f2-9b32-9832465d271f",
"name":"Walker, Gary (Mr.) .",
"email":"walkergrym#mail.com"
}
{
"id":"1fc18c05-da40-4607-a901-3d78c523cea6",
"name":"Texas Chiropractic Association P.A.C.",
"email":"txchiro#mail.com"
}
{
"id":"9a2323f4-e008-45f0-9f7f-11a1f4439042",
"name":"El Paso Energy Corp. PAC",
"email":"elpaso#mail.com"
}
I also noticed that if I use elpaso and txchiro instead of walker the query works as expected!
noticed that the issue happens, when I use only parts of the field. If i search by the exact entire email address, everything works fine.
is this expected from match_phrase?
You are not getting any result from walker because elasticsearch uses a standard analyzer if no analyzer is specified which will tokenize walkergrym#mail.com as
GET /_analyze
{
"analyzer" : "standard",
"text" : "walkergrym#mail.com"
}
The following token will be generated
{
"tokens": [
{
"token": "walkergrym",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "mail.com",
"start_offset": 11,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Since there is no token for walker you are not getting "walkergrym#mail.com" in your search result.
Whereas for "txchiro#mail.com", token generated are txchiro and mail.com and for "elpaso#mail.com" tokens are elpaso and mail.com
You can use the edge_ngram tokenizer, to achieve your required result
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 6,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"email": {
"type": "text",
"analyzer": "my_analyzer"
},
"id": {
"type": "keyword"
},
"name": {
"type": "text"
}
}
}
}
Search Query:
{
"query": {
"bool": {
"should": [
{
"match": {
"email": "elpaso"
}
},
{
"match": {
"email": "walker"
}
}
]
}
}
}
Search Result:
"hits": [
{
"_index": "66907434",
"_type": "_doc",
"_id": "1",
"_score": 3.9233165,
"_source": {
"id": "4a43f351-7b62-42f2-9b32-9832465d271f",
"name": "Walker, Gary (Mr.) .",
"email": "walkergrym#mail.com"
}
},
{
"_index": "66907434",
"_type": "_doc",
"_id": "3",
"_score": 3.9233165,
"_source": {
"id": "9a2323f4-e008-45f0-9f7f-11a1f4439042",
"name": "El Paso Energy Corp. PAC",
"email": "elpaso#mail.com"
}
}
]
I want to perform partial search on 3 fields: UUID, tracking_id, and zip_code. They only contain 1 word and no special characters/space except hypen for UUID.
I'm not sure whether I should use search_as_you_type or edge ngram tokenizer or edge ngram token filter, so I tried search_as_you_type first.
I have created this index:
{
"settings": {
"index": {
"sort.field": [ "created_at", "id" ],
"sort.order": [ "desc", "desc" ]
}
},
"mappings": {
"properties": {
"id": { "type": "keyword", "fields": { "raw": { "type": "search_as_you_type" }}},
"current_status": { "type": "keyword" },
"tracking_id": { "type": "wildcard" },
"invoice_number": { "type": "keyword" },
"created_at": { "type": "date" }
}
}
}
and inserted this doc:
{
"id": "SIGRID",
"current_status": "unassigned",
"tracking_id": "AXXH",
"invoice_number": "xxx",
"created_at": "2021-03-24T09:36:10.717672467Z"
}
I sent this query:
{"query": {
"multi_match": {
"query": "sigrid",
"type": "bool_prefix",
"fields": [
"id"
]
}
}
}
this returns no result, but SIGRID, S, SIG returns the result. How can I make search_as_you_type query be case insensitive? should i use edge ngram tokenizer instead? Thanks
You can define a custom normalizer with a lowercase filter, lowercase filter will ensure that all the letters are changed to lowercase before indexing the document and searching. Modify your index mapping as
{
"settings": {
"index": {
"sort.field": [
"created_at",
"id"
],
"sort.order": [
"desc",
"desc"
]
},
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom", // note this
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "keyword",
"normalizer": "my_normalizer", // note this
"fields": {
"raw": {
"type": "search_as_you_type"
}
}
},
"current_status": {
"type": "keyword"
},
"tracking_id": {
"type": "wildcard"
},
"invoice_number": {
"type": "keyword"
},
"created_at": {
"type": "date"
}
}
}
}
Search Query:
{
"query": {
"multi_match": {
"query": "sigrid",
"type": "bool_prefix"
}
}
}
Search Result:
"hits": [
{
"_index": "66792606",
"_type": "_doc",
"_id": "1",
"_score": 2.0,
"_source": {
"id": "SIGRID",
"current_status": "unassigned",
"tracking_id": "AXXH",
"invoice_number": "xxx",
"created_at": "2021-03-24T09:36:10.717672467Z"
}
}
]
I am currently using this elasticsearch DSL query:
{
"_source": [
"title",
"bench",
"id_",
"court",
"date"
],
"size": 15,
"from": 0,
"query": {
"bool": {
"must": {
"multi_match": {
"query": "i r coelho",
"fields": [
"title",
"content"
]
}
},
"filter": [],
"should": {
"multi_match": {
"query": "i r coelho",
"fields": [
"title.standard^16",
"content.standard"
]
}
}
}
},
"highlight": {
"pre_tags": [
"<tag1>"
],
"post_tags": [
"</tag1>"
],
"fields": {
"content": {}
}
}
}
Here's what's happening. If I search for I.r coelhoit returns the correct results. But, if I search for I R coelho (without the period) then it returns a different result. How do I prevent this from happening? I want the search to behave the same even if there are extra periods, spaces, commas etc.
Mapping
{
"courts_2": {
"mappings": {
"properties": {
"author": {
"type": "text",
"analyzer": "my_analyzer"
},
"bench": {
"type": "text",
"analyzer": "my_analyzer"
},
"citation": {
"type": "text"
},
"content": {
"type": "text",
"fields": {
"standard": {
"type": "text"
}
},
"analyzer": "my_analyzer"
},
"court": {
"type": "text"
},
"date": {
"type": "text"
},
"id_": {
"type": "text"
},
"title": {
"type": "text",
"fields": {
"standard": {
"type": "text"
}
},
"analyzer": "my_analyzer"
},
"verdict": {
"type": "text"
}
}
}
}
}
Settings:
{
"courts_2": {
"settings": {
"index": {
"highlight": {
"max_analyzed_offset": "19000000"
},
"number_of_shards": "5",
"provided_name": "courts_2",
"creation_date": "1581094116992",
"analysis": {
"filter": {
"my_metaphone": {
"replace": "true",
"type": "phonetic",
"encoder": "metaphone"
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"my_metaphone"
],
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "MZSecLIVQy6jiI6YmqOGLg",
"version": {
"created": "7010199"
}
}
}
}
}
EDIT
Here are the results for I.R coelho from my analyzer - {
"tokens": [
{
"token": "IR",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "KLH",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Standard analyzer:
{
"tokens": [
{
"token": "i.r",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "coelho",
"start_offset": 4,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
}
]
}
the reason why you have a different behaviour when searching for I.r coelho and I R coelho is that you are using different analyzers on the same fields, i.e., my_analyzer for title and content (must block), and standard (the default) for title.standard and content.standard (should block).
The two analyzers generate different tokens, thus determining a different score when you're searching for I.r coelho (e.g., 2 tokens with the standard analyzer) or I R coelho (e.g., 3 tokens with the standard analyzer). You can test the behaviour of your analyzers by using the analyze API (see the Elastic Documentation).
You have to decide whether this is your desired behaviour.
Updates (after requested clarifications from OP)
The results of the _analyze query confirmed the hypothesis: the two analyzers lead to a different score contribution, and, subsequently, to different results depending on whether your query includes symbol chars or not.
If you don't want the results of your query to be affected by symbols such as dots or upper/lower case, you will need to reconsider what analyzers you want to apply. The ones currently used will never satisfy your requirements. If I understood your requirements correctly, the simple built-in analyzer should be the right one for your use case.
In a nutshell, (1) you should consider to replace the standard built-in analyzer with the simple one, (2) you should decide whether you want that your query applies different scores to the hits based on different analyzers (i.e., the phonetic custom one on the value of the title and content fields, and the simple one on their respective subfield).
I'm trying to match text with an "#" prefix, e.g. "#stackoverflow" on ElasticSearch. I'm using a boolean query, and both these return the exact same results and actually ignore my # sign:
Query 1 with #:
{"query":{"bool":{"must":[{"query_string":{"default_field":"text","default_operator":"AND","query":"#stackoverflow"}}]}},"size":20}
Query 2 without:
{"query":{"bool":{"must":[{"query_string":{"default_field":"text","default_operator":"AND","query":"stackoverflow"}}]}},"size":20}
My Mapping:
{"posts":{"mappings":{"post":{"properties":{"upvotes":{"type":"long"},"created_time":{"type":"date","format":"strict_date_optional_time||epoch_millis"},"ratings":{"type":"long"},"link":{"type":"string"},"pic":{"type":"string"},"text":{"type":"string"},"id":{"type":"string"}}}}}}
I've tried encoding it to \u0040 but that didn't do any difference.
Your text field is of type text and is analyzed by default by the standard analyzer, which means that #stackoverflow will be indexed as stackoverflow after the analysis process, as can be seen below
GET /_analyze?analyzer=standard&text=#stackoverflow
{
"tokens": [
{
"token": "stackoverflow",
"start_offset": 1,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 0
}
]
}
You probably want to either use the keyword type if you need exact matching or specify a different analyzer, such as whitespace, which will preserve the # sign in your data:
GET /_analyze?analyzer=whitespace&text=#stackoverflow
{
"tokens": [
{
"token": "#stackoverflow",
"start_offset": 0,
"end_offset": 14,
"type": "word",
"position": 0
}
]
}
UPDATE:
Then I suggest using a custom analyzer for that field so you can control how the values are indexed. Recreate your index like this and then you should be able to do your searches:
PUT posts
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [ "lowercase" ]
}
}
}
}
},
"mappings": {
"post": {
"properties": {
"upvotes": {
"type": "long"
},
"created_time": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"ratings": {
"type": "long"
},
"link": {
"type": "string"
},
"pic": {
"type": "string"
},
"text": {
"type": "string",
"analyzer": "my_analyzer"
},
"id": {
"type": "string"
}
}
}
}
}
I have an elasticsearch index with the following data:
"The A-Team" (as an example)
My index settings are :
"index": {
"number_of_shards": "1",
"provided_name": "tyh.tochniyot",
"creation_date": "1481039136127",
"analysis": {
"analyzer": {
"whitespace_analyzer": {
"type": "whitespace"
},
"ngram_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer"
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram",
"min_gram": "3",
"max_gram": "7"
}
}
},
When i search for :
_search
{
"from": 0,
"size": 20,
"track_scores": true,
"highlight": {
"fields": {
"*": {
"fragment_size": 100,
"number_of_fragments": 10,
"require_field_match": false
}
}
},
"query": {
"match": {
"_all": {
"query": "Tea"
}
}
}
I expect to get the highlight result :
"highlight": {
"field": [
"The A-<em>Tea</em>m"
]
}
But i dont get any highlight at all.
The reason i am using whitespace for search and ngram for indexing is that i dont want in search phase to break the word i am searching, like if i am searching for "Team" it will find me "Tea","eam","Team"
Thank you
The problem was that my Analyzer and search Analyzer were running on the _all filed.
When i placed Analyzer attribute on the specific fields the Highlight started working.