Elasticsearch: remove/update field inside nested object - elasticsearch

{
"_index" : "test",
"_type" : "test",
"_id" : "1212",
"_version" : 5,
"found" : true,
"_source" : {
"count" : 42,
"list_data" : [ {
"list_id" : 11,
"timestamp" : 1469125397
}, {
"list_id" : 122,
"timestamp" : 1469050965
} ]
}
}
This is my document schema.list_data is nested object. I have requirement to update/delete particular filed inside list_data. I am able to update count field using groovy script.
$ curl -XPOST 'localhost:9200/index/type/1212/_update?pretty' -d '
{
"script" : "ctx._source.count = 41"
}'
But don't know how to update nested object.
For example i want to add this into list_data.
{
"list_id" : 121,
"timestamp" : 1469050965
}
and my document should change to:
{
"_index" : "test",
"_type" : "test",
"_id" : "1212",
"_version" : 6,
"found" : true,
"_source" : {
"count" : 41,
"list_data" : [ {
"list_id" : 11,
"timestamp" : 1469125397
}, {
"list_id" : 121,
"timestamp" : 1469050965
}, {
"list_id" : 122,
"timestamp" : 1469050965
} ]
}
}
and if i perform delete based on list_id = 122 my record should look like
{
"_index" : "test",
"_type" : "test",
"_id" : "1212",
"_version" : 7,
"found" : true,
"_source" : {
"count" : 41,
"list_data" : [ {
"list_id" : 11,
"timestamp" : 1469125397
}, {
"list_id" : 121,
"timestamp" : 1469050965
}]
}
}

To add a new element to your nested field you can proceed like this:
$ curl -XPOST 'localhost:9200/index/type/1212/_update?pretty' -d '
{
"script" : "ctx._source.list_data += newElement",
"params": {
"newElement": {
"list_id" : 121,
"timestamp" : 1469050965
}
}
}'
To remove an existing element from your nested field list, you can proceed like this:
$ curl -XPOST 'localhost:9200/index/type/1212/_update?pretty' -d '
{
"script" : "ctx._source.list_data.removeAll{it.list_id == remove_id}",
"params": {
"remove_id" : 122
}
}'

I was getting the error [UpdateRequest] unknown field [params] as I was using the latest version of ElasticSearch 7.9.0 (When wrote this answer 7.9.0 was the latest), seems like the syntax is changed a bit.
Following should work for newer versions of ElasticSearch:
$ curl -XPOST 'localhost:9200/<index-name>/_update/1212'
{
"script": {
"source": "ctx._source.list_data.removeIf(list_item -> list_item.list_id == params.remove_id);",
"params": {
"remove_id": 122
}
}
}

I don't know why, but I find that
ctx._source.list_data.removeAll{it.list_id == remove_id}
can't work. Instead I use removeIf like this:
ctx._source.list_data.removeIf{list_item -> list_item.list_id == remove_id}
where list_item could be arbitrary string.

What worked for me was the instructions in the following link. Perhaps it is the version of ES.

Related

Enriching documents in ElasticSearch with only matching nested elements by ID

We're creating some packages, but that process is currently rather slow, because of the sheer amount of data being sent between microservices. Therefore, I have pruned the information being sent between those microservices and instead want to enrich the documents with the necessary information directly from within ElasticSearch. This gives documents of the following shape:
{
"_index" : "packages-2022.02.28",
"_type" : "_doc",
"_id" : "SG_DH-8019-ao-74783-20220315-12",
"_score" : 1.0,
"_source" : {
"id" : "SG_DH-8019-ao-74783-20220315-12",
"updatedOn" : "2022-02-28T14:45:57.7511562+01:00",
"code" : "SG",
"createdDate" : "2022-02-28T15:17:48.2571391+01:00",
"content" : {
"contentId" : "74783",
"units" : [
{
"id" : "HB_DBL.ST_RO_NFP",
"globalId" : "74783_HB_DBL.ST_RO_NFP",
"globalIntId" : -592692223,
"forPackaging" : false
},
{
"id" : "HB_DBL.ST_BB_NFP",
"globalId" : "74783_HB_DBL.ST_BB_NFP",
"globalIntId" : 446952442,
"forPackaging" : false
},
{
"id" : "HB_DBL.ST_AI_NFP",
"globalId" : "74783_HB_DBL.ST_AI_NFP",
"globalIntId" : -1174348304,
"forPackaging" : false
},
{
"id" : "HB_DBL.SU_RO_NFP",
"globalId" : "74783_HB_DBL.SU_RO_NFP",
"globalIntId" : -2111509049,
"forPackaging" : false
},
{
"id" : "HB_DBL.SU_BB_NFP",
"globalId" : "74783_HB_DBL.SU_BB_NFP",
"globalIntId" : 307969427,
"forPackaging" : false
},
{
"id" : "HB_DBL.SU_AI_NFP",
"globalId" : "74783_HB_DBL.SU_AI_NFP",
"globalIntId" : 1418623211,
"forPackaging" : false
},
{
"id" : "HB_DBL.PO-1_RO_NFP",
"globalId" : "74783_HB_DBL.PO-1_RO_NFP",
"globalIntId" : 1328251159,
"forPackaging" : false
},
{
"id" : "HB_DBL.PO-1_BB_NFP",
"globalId" : "74783_HB_DBL.PO-1_BB_NFP",
"globalIntId" : -1228155826,
"forPackaging" : false
},
{
"id" : "HB_DBL.PO-1_AI_NFP",
"globalId" : "74783_HB_DBL.PO-1_AI_NFP",
"globalIntId" : 749215308,
"forPackaging" : false
},
{
"id" : "HB_DBL.OF_RO_NFP",
"globalId" : "74783_HB_DBL.OF_RO_NFP",
"globalIntId" : 1981865239,
"forPackaging" : false
},
{
"id" : "HB_DBL.OF_BB_NFP",
"globalId" : "74783_HB_DBL.OF_BB_NFP",
"globalIntId" : 545563435,
"forPackaging" : false
},
{
"id" : "HB_DBL.OF_AI_NFP",
"globalId" : "74783_HB_DBL.OF_AI_NFP",
"globalIntId" : -481310774,
"forPackaging" : false
}
]
"duration" : {
"value" : 12,
"durationType" : "Day"
}
},
"generatedInfo" : {
"productGroupName" : null,
"subProductGroupName" : "Foo",
"version" : 0
}
}
}
]
with information from an enrich policy's index of the shape (when queried):
{
"_index" : ".enrich-package-enrich-1646044129711",
"_type" : "_doc",
"_id" : "zt_gP38BZeMUiw0-LxLa",
"_score" : 1.0,
"_source" : {
"contentId" : "365114",
"name" : "PackageName",
"board" : [
"B1",
"B2"
],
"units" : [
{
"price" : [
{
"margin" : 0,
"combination" : 10000,
"value" : 189030,
"currency" : "EUR"
}
],
"id" : "W2M_AX2_SC_NFP",
"globalId" : "365114_W2M_AX2_SC_NFP",
"globalIntId" : -988330164,
"name" : "UnitName",
"prop1": "Foo",
"prop2": "Bar"
}
]
}
}
]
I originally could get this working. However, when enriching, I only want to keep the units with the same global ID as those in the document to save. To this end, I have tried also enriching each unit with a simple Enrich processor and a ForEach processor referencing the enrich policy, matching on globalId and have even attempted matching on its hash code globalIntId (although in even in the latter case I would often get the error that it 'is not an integer', even though it clearly is one). This separate enrich-policy index has a shape similar to the following:
{
"_index" : ".enrich-package-unit-enrich-1646044158417",
"_type" : "_doc",
"_id" : "dN_gP38BZeMUiw0-t2Io",
"_score" : 1.0,
"_source" : {
"units" : [
{
"price" : [
{
"margin" : 0,
"combination" : 10000,
"value" : 189030,
"currency" : "EUR"
}
],
"globalId" : "365114_W2M_AX2_SC_NFP",
"globalIntId" : -988330164,
"name" : "UnitName",
"prop1": "Foo",
"prop2": "Bar",
"id" : "W2M_AX2_SC_NFP"
}
]
}
}
]
I have also tried to use Painless script, but so far my experience hasn't been exactly painless (pun intended). Every time I would try to access any data (I've tried various ways I encountered), I would get nothing but compilation errors. Also, given that I'm working on making this process faster, I'm a bit worried about performance here if I were to get it to work. I've read that Painless is fast, yet I've also heard it's actually fairly slow (I think compared to using processors, not necessarily other scripts).
Now, I'm at a loss about how to get this to work. I would prefer to do this without scripting if possible. However, if it is only possible using scripting, that's okay as long as the performance is acceptable. I'm using Elastic 7.12.
Update 1:
I'm creating the enrich policy from C# using Nest like so:
var enrichPolicyRequest = new PutEnrichPolicyRequest(enrichPolicyName)
{
Match = new MyPackageBedEnrichPolicy(index)
};
var putEnrichPolicyResponse = await elasticClient.Enrich.PutPolicyAsync(enrichPolicyRequest);
var executeEnrichPolicyResponse = await elasticClient.Enrich.ExecutePolicyAsync(enrichPolicyName);
...
public class MyPackageBedEnrichPolicy : IEnrichPolicy
{
public MyPackageBedEnrichPolicy(string index)
{
Indices = index;
MatchField = "contentId";
EnrichFields = new[] { "name", "board", "units" };
}
public Indices Indices { get; set; }
public Field MatchField { get; set; }
public Fields EnrichFields { get; set; }
public string Query { get; set; }
}
and the index for the units very similarly, but with
public class MyPackageUnitEnrichPolicy : IEnrichPolicy
{
public MyPackageUnitEnrichPolicy(string index)
{
Indices = index;
MatchField = "units.globalId";
EnrichFields = new[] { "units" };
}
...
For now, I have created the ingest processors in Kibana for easier prototyping, though I will have take care of that using Nest later as well. I have defined them basically as follows:
This is the definition of the ingest pipeline in JSON:
[
{
"enrich": {
"field": "content.contentId",
"policy_name": "enrichPolicyName",
"target_field": "enrichTest"
}
},
{
"foreach": {
"field": "content.units.globalId",
"processor": {
"enrich": {
"field": "content.units.globalId",
"policy_name": "unitEnrichPolicyName",
"target_field": "enrichTest.units",
"tag": "enrich-units-on-globalId-processor"
}
}
}
}
]

Elasticsearch - Count of matches per document

I'm using this query to search a field for occurrences of phrases.
"query": {
"match_phrase": {
"content": "my test phrase"
}
}
I need to calculate how many matches occurred for each phrase per document (if this is even possible?)
I've considered aggregators but think these don't meet the requirements as these will give me the number of matches over the whole index not per document.
Thanks.
This can be achieved by using Script Fields /painless script.
You can count the number of occurrences per field and add it up for the document.
Example:
## Here's my test index with some sample values
POST t1/doc/1 <-- this has one occurence
{
"content" : "my test phrase"
}
POST t1/doc/2 <-- this document has 5 occurences
{
"content": "my test phrase ",
"content1" : "this is my test phrase 1",
"content2" : "this is my test phrase 2",
"content3" : "this is my test phrase 3",
"content4" : "this is my test phrase 4"
}
POST t1/doc/3
{
"content" : "my test new phrase"
}
Now using the script I can count the phrase match for each field. I'm counting it once per field, but you can modify script to multi match per field.
Obviously, the Drawback here is that you need to mention each and every field from the document in the script, unless there's a way to loop through doc field that i am not aware of.
POST t1/_search
{
"script_fields": {
"phrase_Count": {
"script": {
"lang": "painless",
"source": """
int count = 0;
if(doc['content.keyword'].size() > 0 && doc['content.keyword'].value.indexOf(params.phrase)!=-1) count++;
if(doc['content1.keyword'].size() > 0 && doc['content1.keyword'].value.indexOf(params.phrase)!=-1) count++;
if(doc['content2.keyword'].size() > 0 && doc['content2.keyword'].value.indexOf(params.phrase)!=-1) count++;
if(doc['content3.keyword'].size() > 0 && doc['content3.keyword'].value.indexOf(params.phrase)!=-1) count++;
if(doc['content4.keyword'].size() > 0 && doc['content4.keyword'].value.indexOf(params.phrase)!=-1) count++;
return count;
""",
"params": {
"phrase": "my test phrase"
}
}
}
}
}
This will give me the phrase count per document as a scripted field
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [
{
"_index" : "t1",
"_type" : "doc",
"_id" : "2",
"_score" : 1.0,
"fields" : {
"phrase_Count" : [
5 <--- count of occurrences of the phrase in the document
]
}
},
{
"_index" : "t1",
"_type" : "doc",
"_id" : "1",
"_score" : 1.0,
"fields" : {
"phrase_Count" : [
1
]
}
},
{
"_index" : "t1",
"_type" : "doc",
"_id" : "3",
"_score" : 1.0,
"fields" : {
"phrase_Count" : [
0
]
}
}
]
}
}
You can use Term Vectors to achieve this functionality. Please have a look
Term Vectors

Elasticsearch Top 10 Most Frequent Values In Array Across All Records

I have an index "test". Document structure is as shown below. Each document has an array of "tags". I am not able to figure out how to query this index to get top 10 most frequently occurring tags?
Also, what are the best practices one should follow if we have more than 2mil docs in this index?
{
"_index" : "test",
"_type" : "data",
"_id" : "1412879673545024927_1373991666",
"_score" : 1.0,
"_source" : {
"instagramuserid" : "1373991666",
"likes_count" : 163,
"#timestamp" : "2017-06-08T08:52:41.803Z",
"post" : {
"created_time" : "1482648403",
"comments" : {
"count" : 9
},
"user_has_liked" : true,
"link" : "https://www.instagram.com/p/BObjpPMBWWf/",
"caption" : {
"created_time" : "1482648403",
"from" : {
"full_name" : "PARAMSahib ™",
"profile_picture" : "https://scontent.cdninstagram.com/t51.2885-19/s150x150/12750236_1692144537739696_350427084_a.jpg",
"id" : "1373991666",
"username" : "parambanana"
},
"id" : "17845953787172829",
"text" : "This feature talks about how to work pastels .\n\nDull gold pullover + saffron khadi kurta + baby pink pants + Deep purple patka and white sneakers - Perfect colours for a Happy sunday christmas morning . \n#paramsahib #men #menswear #mensfashion #mensfashionblog #mensfashionblogger #menswearofficial #menstyle #fashion #fashionfashion #fashionblog #blog #blogger #designer #fashiondesigner #streetstyle #streetfashion #sikh #sikhfashion #singhstreetstyle #sikhdesigner #bearded #indian #indianfashionblog #indiandesigner #international #ootd #lookbook #delhistyleblog #delhifashionblog"
},
"type" : "image",
"tags" : [
"men",
"delhifashionblog",
"menswearofficial",
"fashiondesigner",
"singhstreetstyle",
"fashionblog",
"mensfashion",
"fashion",
"sikhfashion",
"delhistyleblog",
"sikhdesigner",
"indianfashionblog",
"lookbook",
"fashionfashion",
"designer",
"streetfashion",
"international",
"paramsahib",
"mensfashionblogger",
"indian",
"blog",
"mensfashionblog",
"menstyle",
"ootd",
"indiandesigner",
"menswear",
"blogger",
"sikh",
"streetstyle",
"bearded"
],
"filter" : "Normal",
"attribution" : null,
"location" : null,
"id" : "1412879673545024927_1373991666",
"likes" : {
"count" : 163
}
}
}
},
If your tags type in mapping is object (which is by default) you can use an aggregation query like this:
{
"size": 0,
"aggs": {
"frequent_tags": {
"terms": {"field": "post.tags"}
}
}
}

Elasticsearch river - no _meta document found after 5 attempts

I am using elasticsearch version 1.3.0. when I create a river using wikipedia plugin version 2.3.0 as thus
PUT _river/my_river/_meta -d
{
"type" : "wikipedia",
"wikipedia" : {
"url" : "http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2"
},
"index" : {
"index" : "wikipedia",
"type" : "wiki",
"bulk_size" : 1000,
"max_concurrent_bulk" : 3
}
}
the server responds with this message
{
"_index": "_river",
"_type": "my_river",
"_id": "_meta -d",
"_version": 1,
"created": true
}
however, I don't see the wikipedia documents when I run a search. also, when I restart my server I get river-routing no _meta document found after 5 attempts
Remove the -d at the end as it creates a document named _meta -d and not _meta.
PUT _river/my_river/_meta
{
"type" : "wikipedia",
"wikipedia" : {
"url" : "http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2"
},
"index" : {
"index" : "wikipedia",
"type" : "wiki",
"bulk_size" : 1000,
"max_concurrent_bulk" : 3
}
}

ElasticSearch doesn't seem to support array lookups

I currently have a fairly simple document stored in ElasticSearch that I generated with an integration test:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "unit-test_project600",
"_type" : "recordDefinition505",
"_id" : "400",
"_score" : 1.0, "_source" : {
"field900": "test string",
"field901": "500",
"field902": "2050-01-01T00:00:00",
"field903": [
"Open"
]
}
} ]
}
}
I would like to filter for specifically field903 and a value of "Open", so I perform the following query:
{
query: {
filtered: {
filter: {
term: {
field903: "Open",
}
}
}
}
}
This returns no results. However, I can use this with other fields and it will return the record:
{
query: {
filtered: {
filter: {
term: {
field901: "500",
}
}
}
}
}
It would appear that I'm unable to search in arrays with ElasticSearch. I have read a few instances of people with a similar problem, but none of them appear to have solved it. Surely this isn't a limitation of ElasticSearch?
I thought that it might be a mapping problem. Here's my mapping:
{
"unit-test_project600" : {
"recordDefinition505" : {
"properties" : {
"field900" : {
"type" : "string"
},
"field901" : {
"type" : "string"
},
"field902" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"field903" : {
"type" : "string"
}
}
}
}
}
However, the ElasticSearch docs indicate that there is no difference between a string or an array mapping, so I don't think I need to make any changes here.
Try searching for "open" rather than "Open." By default, Elasticsearch uses a standard analyzer when indexing fields. The standard analyzer uses a lowercase filter, as described in the example here. From my experience, Elasticsearch does search arrays.

Resources