Search Query about phone number in Elasticsearch - elasticsearch

I have a question about Elasticsearch
I made a search query about the phone number. My plan is that even I don't put the hyphen or bracket, result would show the phone number.
For example,
phone number is (213)234-1111
and
search query is
GET _msearch
{ "query": {"fuzzy": { "tel": {"value": "2132341111", "max_expansions" : 100}}}}
the result is
{
"took" : 0,
"responses" : [
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"status" : 200
}
]
}
I need a help that even I put the number without bracket and hyphen, the result show the real phone number with information.

To allow efficient querying, make sure to index the documents accordingly.
In this example that I just made, I am making sure that phone-numbers are indexed without the hyphens and parenthesis. This allows me to query without using those characters as well.
Example:
(1) Create the index:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "\\((\\d+)\\)(\\d+)-(\\d+)",
"replacement": "$1$2$3"
}
}
}
}
}
(2) Add a document to the index:
POST my_index/doc
{
"Description": "My phone number is (213)234-1111"
}
(3) Query with the original phone number:
GET my_index/_search
{
"query": {
"match": {
"Description": "(213)234-1111"
}
}
}
(1 result)
(4) Query without special characters:
GET my_index/_search
{
"query": {
"match": {
"Description": "2132341111"
}
}
}
(1 result)
So how did that work?
By using the pattern_replace char filter, we're stripping away everything but the raw numbers, meaning that "(213)234-1111" is actually stored as "2132341111" whenever we match a phone numbes. Since this pattern_replace is also applied at query time, we can now search both with and without the special characters in the phone number and get a match.

Related

Elasticsearch's minimumShouldMatch for each member of an array

Consider an Elasticsearch entity:
{
"id": 123456,
"keywords": ["apples", "bananas"]
}
Now, imagine I would like to find this entity by searching for apple.
{
"match" : {
"keywords" : {
"query" : "apple",
"operator" : "AND",
"minimum_should_match" : "75%"
}
}
}
The problem is that the 75% minimum for matching would be required for both of the strings of the array – so nothing will be found. Is there a way to say something like minimumSouldMatch: "75% of any array fields"?
Note that I need to use AND as each item of keywords may be composed of longer text.
EDIT:
I tried the proposed solutions, but none of them was giving expected results. I guess the problem is that the text might be quite long, eg.:
["national gallery in prague", "narodni galerie v praze"]
I guess the fuzzy expansion is just not able to expand such long strings if you just start searching by "national g".
Would this may be be possible somehow via Nested objects?
{ keywords: [{keyword: "apples"}, {keyword: "babanas"}}
and then have minimumShouldMatch=1 on keywords and then 75% on each keyword?
As per doc
The match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text. The operator parameter can be set to or or and to control the boolean clauses (defaults to or). The minimum number of optional should clauses to match can be set using the minimum_should_match parameter.
If you are searching for multiple tokens example "apples mangoes" and set minimum as 100%. It will mean both tokens should be present in document. If you set it at 50% , it means at least one of these should be present.
If you want to match tokens partially
You can use fuzziness parameter
Using fuzziness you can set maximum edit distance allowed for matching
{
"query": {
"match": {
"keywords": {
"query": "apple",
"fuzziness": "auto"
}
}
}
}
If you are trying to match word to its root form you can use "stemming" token filter
PUT index-name
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [ "stemmer" ]
}
}
}
},
"mappings": {
"properties": {
"keywords":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Tokens generated
GET index-name/_analyze
{
"text": ["apples", "bananas"],
"analyzer": "my_analyzer"
}
"tokens" : [
{
"token" : "appl",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "banana",
"start_offset" : 7,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 101
}
]
stemming breaks words to their root form.
You can also explore n-grams, edge grams for partial matching

Get the number of appearances of a particular term in an elasticsearch field

I have an elasticsearch index (posts) with following mappings:
{
"id": "integer",
"title": "text",
"description": "text"
}
I want to simply find the number of occurrences of a particular term inside the description field for a single particular document (i have the document id and term to find).
e.g i have a post like this {id: 123, title:"some title", description: "my city is LA, this post description has two occurrences of word city "}.
I have the the document id/ post id for this post, just want to find how many times word "city" appears in the description for this particular post. (result should be 2 in this case)
Cant seem to find the way for this search, i don't want the occurrences across ALL the documents but just for a single document and inside its' one field. Please suggest a query for this. Thanks
Elasticsearch Version: 7.5
You can use a terms aggregation on your description but need to make sure its fielddata is set to true on it.
PUT kamboh/
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"title": {
"type": "text"
},
"description": {
"type": "text",
"fields": {
"simple_analyzer": {
"type": "text",
"fielddata": true,
"analyzer": "simple"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Ingesting a sample doc:
PUT kamboh/_doc/1
{
"id": 123,
"title": "some title",
"description": "my city is LA, this post description has two occurrences of word city "
}
Aggregating:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_agg": {
"terms": {
"field": "description.simple_analyzer",
"size": 20
}
}
}
}
Yielding:
"aggregations" : {
"terms_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "city",
"doc_count" : 1
},
{
"key" : "description",
"doc_count" : 1
},
...
]
}
}
Now, as you can see, the simple analyzer split the string into words and made them lowercase but it also got rid of the duplicate city in your string! I could not come up with an analyzer that'd keep the duplicates... With that being said,
It's advisable to do these word counts before you index!
You would split your string by whitespace and index them as an array of words instead of a long string.
This is also possible at search time, albeit it's very expensive, does not scale well and you need to have script.painless.regex.enabled: true in your es.yaml:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_script": {
"scripted_metric": {
"params": {
"word_of_interest": ""
},
"init_script": "state.map = [:];",
"map_script": """
if (!doc.containsKey('description')) return;
def split_by_whitespace = / /.split(doc['description.keyword'].value);
for (def word : split_by_whitespace) {
if (params['word_of_interest'] !== "" && params['word_of_interest'] != word) {
return;
}
if (state.map.containsKey(word)) {
state.map[word] += 1;
return;
}
state.map[word] = 1;
}
""",
"combine_script": "return state.map;",
"reduce_script": "return states;"
}
}
}
}
yielding
...
"aggregations" : {
"terms_script" : {
"value" : [
{
"occurrences" : 1,
"post" : 1,
"city" : 2, <------
"LA," : 1,
"of" : 1,
"this" : 1,
"description" : 1,
"is" : 1,
"has" : 1,
"my" : 1,
"two" : 1,
"word" : 1
}
]
}
}
...

Wildcard-querying an array with forward slash in the query

In my documents indexed by elasticsearch, I have a field called IPC8s.IPC8 which is an array of strings, which can look like these :
["B63H011/00"]
["B60F3", "B60K1", "B60K17", "B60K17/23", "B60K6", "B60K6"]
["G06F017/00"]
etc...
(for anyone curious, these are CPC patent classification numbers)
I need to query this field with trailing wildcards. In other words, if I put in "B63H", the document containing "B63H011/00" should match. Same if I put in "B63H011/" or "B63H011/0".
I tried multiple queries, none of which worked :
{
query_string: {
default_field: "IPC8s.IPC8",
query: "(B63H*) OR (B63H011/*)",
analyze_wildcard: true
}
}
I tried this one also with \"B63H*\" OR \"B63H011/*\", doesn't work.
Then I tried :
[{
wildcard: {
"IPC8s.IPC8": { value: "B63H*" }
}
},
{
wildcard: {
"IPC8s.IPC8": { value: "B63H011/*" }
}
}]
This doesn't work either. I then tried escaping the "/" because it has to be taken literally. Didn't work.
What am I doing wrong ? Thanks.
Edit : Here is the mapping for that specific field :
"IPC8s": {
"properties": {
"IPC8": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
Here is my latest try that still didn't work (if I don't escape the forward slash, elasticsearch returns an error) :
{
query_string: {
default_field: "IPC8s.IPC8",
query: "(B63H*) OR (B63H011\\/*)",
analyze_wildcard: true,
analyzer: "keyword"
}
}
Edit 2 : This seems to do the trick :
{
query_string: {
default_field: "IPC8s.IPC8.keyword",
query: "(B63H*) OR (B63H011\\/*)",
analyze_wildcard: true,
analyzer: "keyword"
}
}
Text type with standard analyzer will create following token, hence you are not able to search on /
{
"tokens" : [
{
"token" : "b63h011",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "00",
"start_offset" : 8,
"end_offset" : 10,
"type" : "<NUM>",
"position" : 1
}
]
}
Create a subfield for IPC8 with type keyword, which will store text as it is
GET index21/_search
{
"query": {
"wildcard": {
"IPC8s.IPC8.keyword": {
"value": "B63H011/*"
}
}
}
}`

Elastic Search multilingual field

I have read through few articles and advices, but unfortunately I haven't found working solution for me.
The problem is I have a field in index that can have content in any possible language and I don't know in which language it is. I need to search and sort on it. It is not localisation, just values in different languages.
The first language (excluding few European) I have tried it on was Japanese. For the beginning I set for this field only one analyzer and tried to search only for Japanese words/phrases. I took example from here. Here is what I used for this:
'analysis': {
"filter": {
...
"ja_pos_filter": {
"type": "kuromoji_part_of_speech",
"stoptags": [
"\\u52a9\\u8a5e-\\u683c\\u52a9\\u8a5e-\\u4e00\\u822c",
"\\u52a9\\u8a5e-\\u7d42\\u52a9\\u8a5e"]
},
...
},
"analyzer": {
...
"ja_analyzer": {
"type": "custom",
"filter": ["kuromoji_baseform", "ja_pos_filter", "icu_normalizer", "icu_folding", "cjk_width"],
"tokenizer": "kuromoji_tokenizer"
},
...
},
"tokenizer": {
"kuromoji": {
"type": "kuromoji_tokenizer",
"mode": "search"
}
}
}
Mapper:
'name': {
'type': 'string',
'index': 'analyzed',
'analyzer': 'ja_analyzer',
}
And here are few tries to get result from it:
{
'filter': {
'query': {
'bool': {
'must': [
{
# 'wildcard': {'name': u'*ネバーランド福島*'}
# 'match': {'name": u'ネバーランド福島'
# },
"query_string": {
"fields": ['name'],
"query": u'ネバーランド福島',
"default_operator": 'AND'
}
},
],
'boost': 1.0
}
}
}
}
None of them works.
If I just take a standard analyser and query in with query_string or brake phrase myself (breaking on whitespace, what i don't have here) and use wildcard *<>* for this it will find me nothing again. Analyser says that ネバーランド and 福島 are separate words/parts:
curl -XPOST 'http://localhost:9200/test/_analyze?analyzer=ja_analyzer&pretty' -d 'ネバーランド福島'
{
"tokens" : [ {
"token" : "ネハラント",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "福島",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
} ]
}
And in case of standard analyser I'll get result if I'll look for ネバーランド I'll get what I want. But if I use customised analyser and try the same or just one symbol I'm still getting nothing.
The behaviour I'm looking for is: breaking query string on words/parts, all words/parts should be present in resulting name field.
Thank you in advance

elasticsearch: match query can get result, but more like query can not

I have a question about an elasticsearch query:
I have two docs, the content field used standard analyzer, the values are :
"content": "But Buffett's stake in Wells Fargo, and johnson hot stakes in a number of other major banks"
"content": "Taco Bell's chief information johnson hot Fancher told Nation's Restaurant News that mobile is their prime focus at the moment"
I can get these two docs when I used match query:
localhost:9200/content/en-us/_search?pretty=true -d '
{
"query" : {
"match" : {
"content" : "johnson hot"
}
}
}'
But I can't get them using more like query:
localhost:9200/content/en-us/_search?pretty=true -d '
{
"from": 0,
"size": 200,
"query": {
"more_like_this_field": {
"content": {
"like_text": "johnson hot",
"min_term_freq": 1,
"max_query_terms": 12
}
}
}
}'
result:
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
Does anybody know why?
Is the more like query similar with the "like" in SQL?
Thanks
Johnson
More like this field query will give you the similar docs by matching the content of the specified fields.
You have to set min_doc_freq to 1. The default value for this is 5, so the it ignores the terms in like_text if they don't appear in more than 5 docs. As you have only two docs both the terms jhonson and hot will be ignored. Below query will should work
curl 'http://localhost:9200/content/en-us/_search?pretty=true' -d '{
"from": 0,
"size": 200,
"query": {
"more_like_this_field": {
"content": {
"like_text": "johnson hot",
"min_term_freq": 1,
"max_query_terms": 12,
"min_doc_freq": 1
}
}
}
}'

Resources