How do I query a null date inside an array in elasticsearch? - elasticsearch

In an elasticsearch query I am trying to search Document objects that have an array of approval notifications. The notifications are considered complete when dateCompleted is populated with a date, and considered pending when either dateCompleted doesn't exist or exists with null. If the document does not contain an array of approval notifications then it is out of the scope of the search.
I am aware of putting null_value for field dateCompleted and setting it to some arbitrary old date but that seems hackish to me.
I've tried to use Bool queries with must exist doc.approvalNotifications and must not exist doc.approvalNotifications.dateCompleted but that does not work if a document contains a mix of complete and pending approvalNotifications. e.g. it only returns document with ID 2 below. I am expecting documents with IDs 1 and 2 to be found.
How can I find pending approval notifications using elasticsearch?
PUT my_index/_mapping/Document
"properties" : {
"doc" : {
"properties" : {
"approvalNotifications" : {
"properties" : {
"approvalBatchId" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"approvalTransitionState" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"approvedByUser" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"dateCompleted" : {
"type" : "date"
}
}
}
}
}
}
Documents:
{
"id": 1,
"status": "Pending Notifications",
"approvalNotifications": [
{
"approvalBatchId": "e6c39194-5475-4168-9729-8ddcf46cf9ab",
"dateCompleted": "2018-11-15T16:09:15.346+0000"
},
{
"approvalBatchId": "05eaeb5d-d802-4a28-b699-5e593a59d445",
}
]
}
{
"id": 2,
"status": "Pending Notifications",
"approvalNotifications": [
{
"approvalBatchId": "e6c39194-5475-4168-9729-8ddcf46cf9ab",
}
]
}
{
"id": 3,
"status": "Complete",
"approvalNotifications": [
{
"approvalBatchId": "e6c39194-5475-4168-9729-8ddcf46cf9ab",
"dateCompleted": "2018-11-15T16:09:15.346+0000"
},
{
"approvalBatchId": "05eaeb5d-d802-4a28-b699-5e593a59d445",
"dateCompleted": "2018-11-16T16:09:15.346+0000"
}
]
}
{
"id": 4
"status": "No Notifications"
}

You are almost there, you can achieve the desired behavior by using nested datatype for the "approvalNotifications" field.
What happens is that Elasticsearch flattens your approvalNotifications objects, treating their subfields as subfields of the original document. The nested field instead will tell ES to index each inner object as an implicit separate object, though related to the original one.
To query nested objects one should use nested query.
Hope that helps!

Related

ElasticSearch Accessing Nested Documents in Script - Null Pointer Exception

Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
I have a mapping as such (simplified and obfuscated)
{
"video_entry" : {
"aliases" : { },
"mappings" : {
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"type" : "nested",
"properties" : {
"country" : {
"type" : "keyword",
},
"date_of_birth" : {
"type" : "date",
}
}
}
}
Each video_entry document can have 0 or more members nested documents.
Sample Document
{
"captions_added": true,
"category" : "Mental Health",
"is_votable: : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
If one or more nested document exist, we want to write some painless scripts that'd check certain fields across all the nested documents. My script works on mappings with a few documents but when I try it on larger set of documents I get null pointer exceptions despite having every null check possible. I've tried various access patterns, error checking mechanisms but I get exceptions.
POST /video_entry/_search
{
"query": {
"script": {
"script": {
"source": """
// various NULL checks that I already tried
// also tried short circuiting on finding null values
if (!params['_source'].empty && params['_source'].containsKey('members')) {
def total = 0;
for (item in params._source.members) {
// custom logic here
// if above logic holds true
// total += 1;
}
return total > 3;
}
return true;
""",
"lang": "painless"
}
}
}
}
Other Statements That I've Tried
if (params._source == null) {
return true;
}
if (params._source.members == null) {
return true;
}
if (!ctx._source.contains('members')) {
return true;
}
if (!params['_source'].empty && params['_source'].containsKey('members') &&
params['_source'].members.value != null) {
// logic here
}
if (doc.containsKey('members')) {
for (mem in params._source.members) {
}
}
Error Message
&& params._source.members",
^---- HERE"
"caused_by" : {
"type" : "null_pointer_exception",
"reason" : null
}
I've looked into changing the structure (flattening the document) and the usage of must_not as indicated in this answer. They don't suit our use case as we need to incorporate some more custom logic.
Different tutorials use ctx, doc and some use params. To add to the confusion Debug.explain(doc.members), Debug.explain(params._source.members) return empty responses and I'm having a hard time figuring out the types.
Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
Any help is appreciated.
TLDr;
Elastic flatten objects. Such that
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
Turn into:
{
"group" : "fans",
"user.first" : [ "alice", "john" ],
"user.last" : [ "smith", "white" ]
}
To access members inner value you need to reference it using doc['members.<field>'] as members will not exist on its own.
Details
As you may know, Elastic handles inner documents in its own way. [doc]
So you will need to reference them accordingly.
Here is what I did to make it work.
Btw, I have been using the Dev tools of kibana
PUT /so_test/
PUT /so_test/_mapping
{
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"properties" : {
"country" : {
"type" : "keyword"
},
"date_of_birth" : {
"type" : "date"
}
}
}
}
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental Health",
"is_votable" : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
PUT /so_test/_doc/
{
"captions_added": true,
"category" : "Mental breakdown",
"is_votable" : true,
"members": []
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental success",
"is_votable" : true,
"members": [
{"country": "France", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Japan", "date_of_birth": "1999-05-05T00:00:00"}
]
}
And then I did this query (it is only a bool filter, but I guess making it work for your own use case should not prove too difficult)
GET /so_test/_search
{
"query":{
"bool": {
"filter": {
"script": {
"script": {
"lang": "painless",
"source": """
def flag = false;
// /!\ notice how the field is referenced /!\
if(doc['members.country'].size() != 0)
{
for (item in doc['members.country']) {
if (item == params.country){
flag = true
}
}
}
return flag;
""",
"params": {
"country": "Japan"
}
}
}
}
}
}
}
BTW you were saying you were a bit confused about the context for painless. you can find in the documentation so details about it.
[doc]
In this case the filter context is the one we want to look at.

Creating a new field into existing index - ElasticSearch

I am wanting to create a new field and add it to an existing index so that way I can send a unique value to that new field. I was hoping there was an API to do this without having to do it in the CLI of Kibana. But I ran into this article that tells you how to add new fields to an existing index.
I tried to add it under _source field but it did not allow me.
PUT customer-simulation-es-app-logs-development-2021-07/_mapping
{
"_source":{
"TransactionKey":{
"type": "keyword"
}
}
}
So I then added it to properties which allowed me:
PUT customer-simulation-es-app-logs-development-2021-07/_mapping
{
"properties":{
"TransactionKey":{
"type": "keyword"
}
}
}
To make sure it was updated I ran the cmd GET customer-simulation-es-app-logs-development-2021-07/_mapping which did return it.
{
"customer-simulation-es-app-logs-development-2021-07" : {
"mappings" : {
"properties" : {
"#timestamp" : {
"type" : "date"
},
"TransactionKey" : {
"type" : "keyword"
},
"exceptions" : {
"properties" : {
"ClassName" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
.....
But when I go to Discover and type in TransactionKey for the fields nothing pops up. Did I not add the new field correctly to the existing index?
If you're running a version prior to 7.11, then you need to go to Stack Management > Index pattern and refresh your index pattern before seeing your new field in the Discover view. You need to do this every time your index mapping changes.
Since 7.11, the index pattern are being refreshed automatically whenever needed.

Wildcard-querying an array with forward slash in the query

In my documents indexed by elasticsearch, I have a field called IPC8s.IPC8 which is an array of strings, which can look like these :
["B63H011/00"]
["B60F3", "B60K1", "B60K17", "B60K17/23", "B60K6", "B60K6"]
["G06F017/00"]
etc...
(for anyone curious, these are CPC patent classification numbers)
I need to query this field with trailing wildcards. In other words, if I put in "B63H", the document containing "B63H011/00" should match. Same if I put in "B63H011/" or "B63H011/0".
I tried multiple queries, none of which worked :
{
query_string: {
default_field: "IPC8s.IPC8",
query: "(B63H*) OR (B63H011/*)",
analyze_wildcard: true
}
}
I tried this one also with \"B63H*\" OR \"B63H011/*\", doesn't work.
Then I tried :
[{
wildcard: {
"IPC8s.IPC8": { value: "B63H*" }
}
},
{
wildcard: {
"IPC8s.IPC8": { value: "B63H011/*" }
}
}]
This doesn't work either. I then tried escaping the "/" because it has to be taken literally. Didn't work.
What am I doing wrong ? Thanks.
Edit : Here is the mapping for that specific field :
"IPC8s": {
"properties": {
"IPC8": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
Here is my latest try that still didn't work (if I don't escape the forward slash, elasticsearch returns an error) :
{
query_string: {
default_field: "IPC8s.IPC8",
query: "(B63H*) OR (B63H011\\/*)",
analyze_wildcard: true,
analyzer: "keyword"
}
}
Edit 2 : This seems to do the trick :
{
query_string: {
default_field: "IPC8s.IPC8.keyword",
query: "(B63H*) OR (B63H011\\/*)",
analyze_wildcard: true,
analyzer: "keyword"
}
}
Text type with standard analyzer will create following token, hence you are not able to search on /
{
"tokens" : [
{
"token" : "b63h011",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "00",
"start_offset" : 8,
"end_offset" : 10,
"type" : "<NUM>",
"position" : 1
}
]
}
Create a subfield for IPC8 with type keyword, which will store text as it is
GET index21/_search
{
"query": {
"wildcard": {
"IPC8s.IPC8.keyword": {
"value": "B63H011/*"
}
}
}
}`

How do I get the occurences and doc_count of every term of an ES Index?

I have an ES Index of the form
{
"adminfile" : {
"mappings" : {
"properties" : {
"text" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
With the field 'title' being the title of the string found in the field 'text'. The titles do not contain any spaces, while the texts are normal texts (sentences with spaces and dots etc).
I want to get all the terms in the index and their doc_count and/or frequency. I found this query in the ES doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
GET /adminfile/_search
{
"size": 10,
"aggs" : {
"text" : {
"terms" : {
"field" : "text.keyword",
"order" : { "_count" : "asc" },
"size": 10
}
}
}
}
This returns all the sources but the aggregation buckets are empty. If I change "text.keyword" to "title.keyword" in that command, it does work and return all the titles as keys.
Why does it not work on the text fields?
Is there a better command to use? I know that this:
GET /adminfile/_search
{
"query" : {
"match" : {"text" : "WordToSearch"}
},
"_source":false,
"aggregations": {
"keywords" : {
"significant_text" : {
"field" : "text",
"filter_duplicate_text": true,
"size": 100
}
}
},
"highlight": {
"fields": {
"text": {}
}
}
}
works to get all occurences of wordToSearch in every document of the index, with the counts and frequency. Is there a way to ask this command to match every word of every doc?
EDIT: I have also tried changing the name of the text field to "contenu" in case ES didn't like have a field of name 'text' and of type 'text'. No effect.
Another option could be using https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html but it _termvectors only works for one specific ID (or _mtermvectors for mutliple specific ID, not all the documents in any case)
EDIT2: I realised that the ignore_above could be a problem. I tried cutting all my texts to 200 chars as a test. The query now runs, except that it returns the entire text as a key instead of cutting it into words.
When you use the keyword version of the field, the content is kept as a single, large token. You're right in assuming that ignore_above is the source of your problem, since these tokens apparently will be longer than 256 characters in your data set.
If you instead aggregate across the tokenized field (the normal text field), instead of the keyword version, you'll get counts for each word (i.e. each token) as processed by the field.

How to add default values while adding a new field in existing mapping in elasticsearch

This is my existing mapping in elastic search for one of the child document
sessions" : {
"_routing" : {
"required" : true
},
"properties" : {
"operatingSystem" : {
"index" : "not_analyzed",
"type" : "string"
},
"eventDate" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"durations" : {
"type" : "integer"
},
"manufacturer" : {
"index" : "not_analyzed",
"type" : "string"
},
"deviceModel" : {
"index" : "not_analyzed",
"type" : "string"
},
"applicationId" : {
"type" : "integer"
},
"deviceId" : {
"type" : "string"
}
},
"_parent" : {
"type" : "userinfo"
}
}
in above mapping "durations" field is an integer array. I need to update the existing mapping by adding a new field called "durationCount" whose default value should be the size of durations array.
PUT sessions/_mapping
{
"properties" : {
"sessionCount" : {
"type" : "integer"
}
}
}
using above mapping I am able to update the existing mapping but I am not able to figure out how to assign a value ( which would vary for each session document like it should be durations array size ) while updating the mapping. any ideas ?
Well 2 recommendations here -
Instead of adding default value , you can adjust it in the query using missing filter. Lets say , you want to search based on a match query - Instead of just match query , use a bool query with should clause having the match and missing filter. inside filtered query. This way , those documents which did not have the field is also accounted.
If you absolutely need the value in that field for existing documents , you need to reindex the whole set of documents. Or , use the out of box plugin , update by query -

Resources