Wildcard-querying an array with forward slash in the query - elasticsearch

In my documents indexed by elasticsearch, I have a field called IPC8s.IPC8 which is an array of strings, which can look like these :
["B63H011/00"]
["B60F3", "B60K1", "B60K17", "B60K17/23", "B60K6", "B60K6"]
["G06F017/00"]
etc...
(for anyone curious, these are CPC patent classification numbers)
I need to query this field with trailing wildcards. In other words, if I put in "B63H", the document containing "B63H011/00" should match. Same if I put in "B63H011/" or "B63H011/0".
I tried multiple queries, none of which worked :
{
query_string: {
default_field: "IPC8s.IPC8",
query: "(B63H*) OR (B63H011/*)",
analyze_wildcard: true
}
}
I tried this one also with \"B63H*\" OR \"B63H011/*\", doesn't work.
Then I tried :
[{
wildcard: {
"IPC8s.IPC8": { value: "B63H*" }
}
},
{
wildcard: {
"IPC8s.IPC8": { value: "B63H011/*" }
}
}]
This doesn't work either. I then tried escaping the "/" because it has to be taken literally. Didn't work.
What am I doing wrong ? Thanks.
Edit : Here is the mapping for that specific field :
"IPC8s": {
"properties": {
"IPC8": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
Here is my latest try that still didn't work (if I don't escape the forward slash, elasticsearch returns an error) :
{
query_string: {
default_field: "IPC8s.IPC8",
query: "(B63H*) OR (B63H011\\/*)",
analyze_wildcard: true,
analyzer: "keyword"
}
}
Edit 2 : This seems to do the trick :
{
query_string: {
default_field: "IPC8s.IPC8.keyword",
query: "(B63H*) OR (B63H011\\/*)",
analyze_wildcard: true,
analyzer: "keyword"
}
}

Text type with standard analyzer will create following token, hence you are not able to search on /
{
"tokens" : [
{
"token" : "b63h011",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "00",
"start_offset" : 8,
"end_offset" : 10,
"type" : "<NUM>",
"position" : 1
}
]
}
Create a subfield for IPC8 with type keyword, which will store text as it is
GET index21/_search
{
"query": {
"wildcard": {
"IPC8s.IPC8.keyword": {
"value": "B63H011/*"
}
}
}
}`

Related

Get the number of appearances of a particular term in an elasticsearch field

I have an elasticsearch index (posts) with following mappings:
{
"id": "integer",
"title": "text",
"description": "text"
}
I want to simply find the number of occurrences of a particular term inside the description field for a single particular document (i have the document id and term to find).
e.g i have a post like this {id: 123, title:"some title", description: "my city is LA, this post description has two occurrences of word city "}.
I have the the document id/ post id for this post, just want to find how many times word "city" appears in the description for this particular post. (result should be 2 in this case)
Cant seem to find the way for this search, i don't want the occurrences across ALL the documents but just for a single document and inside its' one field. Please suggest a query for this. Thanks
Elasticsearch Version: 7.5
You can use a terms aggregation on your description but need to make sure its fielddata is set to true on it.
PUT kamboh/
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"title": {
"type": "text"
},
"description": {
"type": "text",
"fields": {
"simple_analyzer": {
"type": "text",
"fielddata": true,
"analyzer": "simple"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Ingesting a sample doc:
PUT kamboh/_doc/1
{
"id": 123,
"title": "some title",
"description": "my city is LA, this post description has two occurrences of word city "
}
Aggregating:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_agg": {
"terms": {
"field": "description.simple_analyzer",
"size": 20
}
}
}
}
Yielding:
"aggregations" : {
"terms_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "city",
"doc_count" : 1
},
{
"key" : "description",
"doc_count" : 1
},
...
]
}
}
Now, as you can see, the simple analyzer split the string into words and made them lowercase but it also got rid of the duplicate city in your string! I could not come up with an analyzer that'd keep the duplicates... With that being said,
It's advisable to do these word counts before you index!
You would split your string by whitespace and index them as an array of words instead of a long string.
This is also possible at search time, albeit it's very expensive, does not scale well and you need to have script.painless.regex.enabled: true in your es.yaml:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_script": {
"scripted_metric": {
"params": {
"word_of_interest": ""
},
"init_script": "state.map = [:];",
"map_script": """
if (!doc.containsKey('description')) return;
def split_by_whitespace = / /.split(doc['description.keyword'].value);
for (def word : split_by_whitespace) {
if (params['word_of_interest'] !== "" && params['word_of_interest'] != word) {
return;
}
if (state.map.containsKey(word)) {
state.map[word] += 1;
return;
}
state.map[word] = 1;
}
""",
"combine_script": "return state.map;",
"reduce_script": "return states;"
}
}
}
}
yielding
...
"aggregations" : {
"terms_script" : {
"value" : [
{
"occurrences" : 1,
"post" : 1,
"city" : 2, <------
"LA," : 1,
"of" : 1,
"this" : 1,
"description" : 1,
"is" : 1,
"has" : 1,
"my" : 1,
"two" : 1,
"word" : 1
}
]
}
}
...

How do I query a null date inside an array in elasticsearch?

In an elasticsearch query I am trying to search Document objects that have an array of approval notifications. The notifications are considered complete when dateCompleted is populated with a date, and considered pending when either dateCompleted doesn't exist or exists with null. If the document does not contain an array of approval notifications then it is out of the scope of the search.
I am aware of putting null_value for field dateCompleted and setting it to some arbitrary old date but that seems hackish to me.
I've tried to use Bool queries with must exist doc.approvalNotifications and must not exist doc.approvalNotifications.dateCompleted but that does not work if a document contains a mix of complete and pending approvalNotifications. e.g. it only returns document with ID 2 below. I am expecting documents with IDs 1 and 2 to be found.
How can I find pending approval notifications using elasticsearch?
PUT my_index/_mapping/Document
"properties" : {
"doc" : {
"properties" : {
"approvalNotifications" : {
"properties" : {
"approvalBatchId" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"approvalTransitionState" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"approvedByUser" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"dateCompleted" : {
"type" : "date"
}
}
}
}
}
}
Documents:
{
"id": 1,
"status": "Pending Notifications",
"approvalNotifications": [
{
"approvalBatchId": "e6c39194-5475-4168-9729-8ddcf46cf9ab",
"dateCompleted": "2018-11-15T16:09:15.346+0000"
},
{
"approvalBatchId": "05eaeb5d-d802-4a28-b699-5e593a59d445",
}
]
}
{
"id": 2,
"status": "Pending Notifications",
"approvalNotifications": [
{
"approvalBatchId": "e6c39194-5475-4168-9729-8ddcf46cf9ab",
}
]
}
{
"id": 3,
"status": "Complete",
"approvalNotifications": [
{
"approvalBatchId": "e6c39194-5475-4168-9729-8ddcf46cf9ab",
"dateCompleted": "2018-11-15T16:09:15.346+0000"
},
{
"approvalBatchId": "05eaeb5d-d802-4a28-b699-5e593a59d445",
"dateCompleted": "2018-11-16T16:09:15.346+0000"
}
]
}
{
"id": 4
"status": "No Notifications"
}
You are almost there, you can achieve the desired behavior by using nested datatype for the "approvalNotifications" field.
What happens is that Elasticsearch flattens your approvalNotifications objects, treating their subfields as subfields of the original document. The nested field instead will tell ES to index each inner object as an implicit separate object, though related to the original one.
To query nested objects one should use nested query.
Hope that helps!

ElasticSearch: preserve_position_increments not working

According to the docs
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html
preserve_position_increments=false is supposed to make consecutive keywords in a string searchable. But for me it's not working. Is this a bug? Steps to reproduce in Kibana:
PUT /example-index/
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"_doc": {
"properties": {
"example-suggest-field": {
"type": "completion",
"analyzer": "stop",
"preserve_position_increments": false,
"max_input_length": 50
}
}
}
}
}
PUT /example-index/_doc/1
{
"example-suggest-field": [
{
"input": "Nevermind Nirvana",
"weight" : 10
}
]
}
POST /example-index/_search
{
"suggest": {
"bib-suggest" : {
"prefix" : "nir",
"completion" : {
"field" : "example-suggest-field"
}
}
}
}
POST /example-index/_search
{
"suggest": {
"bib-suggest" : {
"prefix" : "nev",
"completion" : {
"field" : "example-suggest-field"
}
}
}
}
If yes I will make a bug report
It's not a bug, preserve_position_increments is only useful when you are removing stopwords and would like to search for the token coming after the stopword (i.e. search for Beat and find The Beatles).
In your case, you should probably index ["Nevermind", "Nirvana"] instead, i.e. and array of tokens.
If you try to indexing "The Nirvana" instead, you'll find it by searching for nir

Elastic Search multilingual field

I have read through few articles and advices, but unfortunately I haven't found working solution for me.
The problem is I have a field in index that can have content in any possible language and I don't know in which language it is. I need to search and sort on it. It is not localisation, just values in different languages.
The first language (excluding few European) I have tried it on was Japanese. For the beginning I set for this field only one analyzer and tried to search only for Japanese words/phrases. I took example from here. Here is what I used for this:
'analysis': {
"filter": {
...
"ja_pos_filter": {
"type": "kuromoji_part_of_speech",
"stoptags": [
"\\u52a9\\u8a5e-\\u683c\\u52a9\\u8a5e-\\u4e00\\u822c",
"\\u52a9\\u8a5e-\\u7d42\\u52a9\\u8a5e"]
},
...
},
"analyzer": {
...
"ja_analyzer": {
"type": "custom",
"filter": ["kuromoji_baseform", "ja_pos_filter", "icu_normalizer", "icu_folding", "cjk_width"],
"tokenizer": "kuromoji_tokenizer"
},
...
},
"tokenizer": {
"kuromoji": {
"type": "kuromoji_tokenizer",
"mode": "search"
}
}
}
Mapper:
'name': {
'type': 'string',
'index': 'analyzed',
'analyzer': 'ja_analyzer',
}
And here are few tries to get result from it:
{
'filter': {
'query': {
'bool': {
'must': [
{
# 'wildcard': {'name': u'*ネバーランド福島*'}
# 'match': {'name": u'ネバーランド福島'
# },
"query_string": {
"fields": ['name'],
"query": u'ネバーランド福島',
"default_operator": 'AND'
}
},
],
'boost': 1.0
}
}
}
}
None of them works.
If I just take a standard analyser and query in with query_string or brake phrase myself (breaking on whitespace, what i don't have here) and use wildcard *<>* for this it will find me nothing again. Analyser says that ネバーランド and 福島 are separate words/parts:
curl -XPOST 'http://localhost:9200/test/_analyze?analyzer=ja_analyzer&pretty' -d 'ネバーランド福島'
{
"tokens" : [ {
"token" : "ネハラント",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "福島",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
} ]
}
And in case of standard analyser I'll get result if I'll look for ネバーランド I'll get what I want. But if I use customised analyser and try the same or just one symbol I'm still getting nothing.
The behaviour I'm looking for is: breaking query string on words/parts, all words/parts should be present in resulting name field.
Thank you in advance

Elastic exact match w/o changing indexing

I have following query to elastic:
"query": {
"filtered": {
"filter": {
"and": {
"filters": [
{
"term": {
"entities.hashtags": "gf"
}
}
]
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
},
entities.hashtags is array and as a result I receive entries with hashtags gf_anime, gf_whatever, gf_foobar etc.
But what I need is receive entries where exact "gf" hashtag exists.
I've looked in other questions on SO and saw that the solution in this case is to change analyzing of entities.hashtags so it'll match only exact values (I am pretty new with elastic hence can mistake with terms here).
My question is whether it's possible to define exact match search INSIDE THE QUERY? Id est w/o changing how elastic indexes its fields?
Are you sure that you need to do anything? Given your examples, you don't and you probably don't want to do not_analyzed:
curl -XPUT localhost:9200/test -d '{
"mappings": {
"test" : {
"properties": {
"body" : { "type" : "string" },
"entities" : {
"type" : "object",
"properties": {
"hashtags" : {
"type" : "string"
}
}
}
}
}
}
}'
curl -XPUT localhost:9200/test/test/1 -d '{
"body" : "anime", "entities" : { "hashtags" : "gf_anime" }
}'
curl -XPUT localhost:9200/test/test/2 -d '{
"body" : "anime", "entities" : { "hashtags" : ["GF", "gf_anime"] }
}'
curl -XPUT localhost:9200/test/test/3 -d '{
"body" : "anime", "entities" : { "hashtags" : ["gf_whatever", "gf_anime"] }
}'
With the above data indexed, your query only returns document 2 (note: this is simplified version of your query without the unnecessary/undesirable and filter; at least for the time being, you should always use the bool filter rather than and/or as it understands how to use the filter caches):
curl -XGET localhost:9200/test/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"entities.hashtags": "gf"
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
}
}
Where this breaks down is when you start putting in hashtag values that will be split into multiple tokens, thereby triggering false hits with the term filter. You can determine how the field's analyzer will treat any value by passing it to the _analyze endpoint and telling it the field to use the analyzer from:
curl -XGET localhost:9200/test/_analyze?field=entities.hashtags\&pretty -d 'gf_anime'
{
"tokens" : [ {
"token" : "gf_anime",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
# Note the space instead of the underscore:
curl -XGET localhost:9200/test/_analyze?field=entities.hashtags\&pretty -d 'gf anime'
{
"tokens" : [ {
"token" : "gf",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "anime",
"start_offset" : 3,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
If you were to add a fourth document with the "gf anime" variant, then you will get a false hit.
curl -XPUT localhost:9200/test/test/4 -d '{
"body" : "anime", "entities" : { "hashtags" : ["gf whatever", "gf anime"] }
}'
This is really not an indexing problem, but a bad data problem.
With all of the explanation out of the way, you can inefficiently solve this by using a script that always follows the term filter (to efficiently rule out the more common cases that don't hit it):
curl -XGET localhost:9200/test/_search
{
"query": {
"filtered": {
"filter": {
"bool" : {
"must" : [{
"term" : {
"entities.hashtags" : "gf"
}
},
{
"script" : {
"script" :
"_source.entities.hashtags == tag || _source.entities.hashtags.find { it == tag } != null",
"params" : {
"tag" : "gf"
}
}
}]
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
}
}
This works by parsing the original the _source (and not using the indexed doc values). That is why this is not going to be very efficient, but it will work until you reindex. The _source.entities.hashtags == tag portion is only necessary if hashtags is not always an array (in my example, document 1 would not be an array). If it is always an array, then you can use _source.entities.hashtags.contains(tag) instead of _source.entities.hashtags == tag || _source.entities.hashtags.find { it == tag } != null.
Note: The script language is Groovy, which is the default starting in 1.4.0. It is not the default in earlier versions, and it must be explicitly enabled using script.default_lang : groovy.

Resources