Enriching documents in ElasticSearch with only matching nested elements by ID - elasticsearch

We're creating some packages, but that process is currently rather slow, because of the sheer amount of data being sent between microservices. Therefore, I have pruned the information being sent between those microservices and instead want to enrich the documents with the necessary information directly from within ElasticSearch. This gives documents of the following shape:
{
"_index" : "packages-2022.02.28",
"_type" : "_doc",
"_id" : "SG_DH-8019-ao-74783-20220315-12",
"_score" : 1.0,
"_source" : {
"id" : "SG_DH-8019-ao-74783-20220315-12",
"updatedOn" : "2022-02-28T14:45:57.7511562+01:00",
"code" : "SG",
"createdDate" : "2022-02-28T15:17:48.2571391+01:00",
"content" : {
"contentId" : "74783",
"units" : [
{
"id" : "HB_DBL.ST_RO_NFP",
"globalId" : "74783_HB_DBL.ST_RO_NFP",
"globalIntId" : -592692223,
"forPackaging" : false
},
{
"id" : "HB_DBL.ST_BB_NFP",
"globalId" : "74783_HB_DBL.ST_BB_NFP",
"globalIntId" : 446952442,
"forPackaging" : false
},
{
"id" : "HB_DBL.ST_AI_NFP",
"globalId" : "74783_HB_DBL.ST_AI_NFP",
"globalIntId" : -1174348304,
"forPackaging" : false
},
{
"id" : "HB_DBL.SU_RO_NFP",
"globalId" : "74783_HB_DBL.SU_RO_NFP",
"globalIntId" : -2111509049,
"forPackaging" : false
},
{
"id" : "HB_DBL.SU_BB_NFP",
"globalId" : "74783_HB_DBL.SU_BB_NFP",
"globalIntId" : 307969427,
"forPackaging" : false
},
{
"id" : "HB_DBL.SU_AI_NFP",
"globalId" : "74783_HB_DBL.SU_AI_NFP",
"globalIntId" : 1418623211,
"forPackaging" : false
},
{
"id" : "HB_DBL.PO-1_RO_NFP",
"globalId" : "74783_HB_DBL.PO-1_RO_NFP",
"globalIntId" : 1328251159,
"forPackaging" : false
},
{
"id" : "HB_DBL.PO-1_BB_NFP",
"globalId" : "74783_HB_DBL.PO-1_BB_NFP",
"globalIntId" : -1228155826,
"forPackaging" : false
},
{
"id" : "HB_DBL.PO-1_AI_NFP",
"globalId" : "74783_HB_DBL.PO-1_AI_NFP",
"globalIntId" : 749215308,
"forPackaging" : false
},
{
"id" : "HB_DBL.OF_RO_NFP",
"globalId" : "74783_HB_DBL.OF_RO_NFP",
"globalIntId" : 1981865239,
"forPackaging" : false
},
{
"id" : "HB_DBL.OF_BB_NFP",
"globalId" : "74783_HB_DBL.OF_BB_NFP",
"globalIntId" : 545563435,
"forPackaging" : false
},
{
"id" : "HB_DBL.OF_AI_NFP",
"globalId" : "74783_HB_DBL.OF_AI_NFP",
"globalIntId" : -481310774,
"forPackaging" : false
}
]
"duration" : {
"value" : 12,
"durationType" : "Day"
}
},
"generatedInfo" : {
"productGroupName" : null,
"subProductGroupName" : "Foo",
"version" : 0
}
}
}
]
with information from an enrich policy's index of the shape (when queried):
{
"_index" : ".enrich-package-enrich-1646044129711",
"_type" : "_doc",
"_id" : "zt_gP38BZeMUiw0-LxLa",
"_score" : 1.0,
"_source" : {
"contentId" : "365114",
"name" : "PackageName",
"board" : [
"B1",
"B2"
],
"units" : [
{
"price" : [
{
"margin" : 0,
"combination" : 10000,
"value" : 189030,
"currency" : "EUR"
}
],
"id" : "W2M_AX2_SC_NFP",
"globalId" : "365114_W2M_AX2_SC_NFP",
"globalIntId" : -988330164,
"name" : "UnitName",
"prop1": "Foo",
"prop2": "Bar"
}
]
}
}
]
I originally could get this working. However, when enriching, I only want to keep the units with the same global ID as those in the document to save. To this end, I have tried also enriching each unit with a simple Enrich processor and a ForEach processor referencing the enrich policy, matching on globalId and have even attempted matching on its hash code globalIntId (although in even in the latter case I would often get the error that it 'is not an integer', even though it clearly is one). This separate enrich-policy index has a shape similar to the following:
{
"_index" : ".enrich-package-unit-enrich-1646044158417",
"_type" : "_doc",
"_id" : "dN_gP38BZeMUiw0-t2Io",
"_score" : 1.0,
"_source" : {
"units" : [
{
"price" : [
{
"margin" : 0,
"combination" : 10000,
"value" : 189030,
"currency" : "EUR"
}
],
"globalId" : "365114_W2M_AX2_SC_NFP",
"globalIntId" : -988330164,
"name" : "UnitName",
"prop1": "Foo",
"prop2": "Bar",
"id" : "W2M_AX2_SC_NFP"
}
]
}
}
]
I have also tried to use Painless script, but so far my experience hasn't been exactly painless (pun intended). Every time I would try to access any data (I've tried various ways I encountered), I would get nothing but compilation errors. Also, given that I'm working on making this process faster, I'm a bit worried about performance here if I were to get it to work. I've read that Painless is fast, yet I've also heard it's actually fairly slow (I think compared to using processors, not necessarily other scripts).
Now, I'm at a loss about how to get this to work. I would prefer to do this without scripting if possible. However, if it is only possible using scripting, that's okay as long as the performance is acceptable. I'm using Elastic 7.12.
Update 1:
I'm creating the enrich policy from C# using Nest like so:
var enrichPolicyRequest = new PutEnrichPolicyRequest(enrichPolicyName)
{
Match = new MyPackageBedEnrichPolicy(index)
};
var putEnrichPolicyResponse = await elasticClient.Enrich.PutPolicyAsync(enrichPolicyRequest);
var executeEnrichPolicyResponse = await elasticClient.Enrich.ExecutePolicyAsync(enrichPolicyName);
...
public class MyPackageBedEnrichPolicy : IEnrichPolicy
{
public MyPackageBedEnrichPolicy(string index)
{
Indices = index;
MatchField = "contentId";
EnrichFields = new[] { "name", "board", "units" };
}
public Indices Indices { get; set; }
public Field MatchField { get; set; }
public Fields EnrichFields { get; set; }
public string Query { get; set; }
}
and the index for the units very similarly, but with
public class MyPackageUnitEnrichPolicy : IEnrichPolicy
{
public MyPackageUnitEnrichPolicy(string index)
{
Indices = index;
MatchField = "units.globalId";
EnrichFields = new[] { "units" };
}
...
For now, I have created the ingest processors in Kibana for easier prototyping, though I will have take care of that using Nest later as well. I have defined them basically as follows:
This is the definition of the ingest pipeline in JSON:
[
{
"enrich": {
"field": "content.contentId",
"policy_name": "enrichPolicyName",
"target_field": "enrichTest"
}
},
{
"foreach": {
"field": "content.units.globalId",
"processor": {
"enrich": {
"field": "content.units.globalId",
"policy_name": "unitEnrichPolicyName",
"target_field": "enrichTest.units",
"tag": "enrich-units-on-globalId-processor"
}
}
}
}
]

Related

Problem re-adding the same fields to start with a lowercase letter when updating

I am encountering a problem while updating. The problem I'm having is, for example, I want to update the Title field, but I see that a new field has been created for the Title field in the document. (A title field that starts with a lowercase letter). I'm doing the update process with NEST, can anyone share an idea with me? Thank you in advance for your help.
The state of the document before the update:
{
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "uPggFnoBChFNLIc8qdjW",
"_score" : 31.908756,
"_source" : {
"RelatedPassiveCompanyId" : "0d075c1681106286cfe9f31999f8247c",
"CreateTime" : "2021-06-16T21:41:17.2697847+03:00",
"Title" : "FE NEW CENTURY INDUSTRY(SINGAPORE)PTE LTD",
"IsBannedFromOpenCorpCompanies" : false,
"CreatedBy" : 1,
"IsActivated" : false,
"IsCancelled" : false,
"IsMembershipTypeBought" : false
}
}
The state of the document after the update:
{
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "uPggFnoBChFNLIc8qdjW",
"_score" : 26.380388,
"_source" : {
"RelatedPassiveCompanyId" : "0d075c1681106286cfe9f31999f8247c",
"CreateTime" : "2021-06-16T21:41:17.2697847+03:00",
"Title" : "FE NEW CENTURY INDUSTRY(SINGAPORE)PTE LTD",
"IsBannedFromOpenCorpCompanies" : false,
"CreatedBy" : 1,
"IsActivated" : false,
"IsCancelled" : false,
"IsMembershipTypeBought" : false,
"isBannedFromOpenCorpCompanies" : false,
"contactInformations" : {
"contactPerson" : { },
"phones" : [ ]
},
"isCancelled" : false,
"dnbInformation" : {
"processId" : "fba921ee-493d-4f12-aa0a-0a432b9e8b3a",
"requestLogs" : [
{
"requestTime" : "2021-11-23T10:03:09.8302661+03:00",
"message" : "Company not found on Dnb",
"resultType" : 2
}
]
},
"createTime" : "2021-06-16T21:41:17.2697847+03:00",
"createdBy" : 1,
"isMembershipTypeBought" : false,
"isActivated" : false,
"title" : "FE NEW CENTURY INDUSTRY(SINGAPORE)PTE LTD",
"relatedPassiveCompanyId" : "0d075c1681106286cfe9f31999f8247c"
}
},
my update function:
public bool UpdateDocuments(IHit<MyESModel> documentHitItem)
{
var response = elasticClient.Update<MyESModel, object>(DocumentPath<MyESModel>
.Id(documentHitItem.Id), u => u
.Index("my_test_index")
.Doc(documentHitItem.Source)
.DocAsUpsert(true)
.RetryOnConflict(8)
);
return response.IsValid;
}
I would be very happy if anyone has any idea what the problem could be.

QuickSight or Elasticsearch - Column wise aggregration

Is this possible to do in QuickSight or Elasticsearch? I have tried calculated fields in QuickSight and runtime scripts in Elasticsearch, not sure how to do it? Also, is what I'm not what I'm expecting is even possible in this tool.
Trying out a simple date difference between columns based on their action, here... "Time taken for 'creating a post' after a user registered"
Data Input:
Data output
It is possible using scripted metric aggregation
Data
"hits" : [
{
"_index" : "index121",
"_type" : "_doc",
"_id" : "aqJ3HnoBF6_U07qsNY-s",
"_score" : 1.0,
"_source" : {
"user" : "Jen",
"activity" : "Logged In",
"activity_Time" : "2020-01-08"
}
},
{
"_index" : "index121",
"_type" : "_doc",
"_id" : "a6J3HnoBF6_U07qsXY_8",
"_score" : 1.0,
"_source" : {
"user" : "Jen",
"activity" : "Created a post",
"activity_Time" : "2020-05-08"
}
},
{
"_index" : "index121",
"_type" : "_doc",
"_id" : "bKJ3HnoBF6_U07qsk4-0",
"_score" : 1.0,
"_source" : {
"user" : "Mark",
"activity" : "Logged In",
"activity_Time" : "2020-01-03"
}
},
{
"_index" : "index121",
"_type" : "_doc",
"_id" : "baJ3HnoBF6_U07qsu48g",
"_score" : 1.0,
"_source" : {
"user" : "Mark",
"activity" : "Created a post",
"activity_Time" : "2020-01-08"
}
}
]
Query
{
"size": 0,
"aggs": {
"user": {
"terms": {
"field": "user.keyword",
"size": 10000
},
"aggs": {
"distinct_sum_feedback": {
"scripted_metric": {
"init_script": "state.docs = []",
"map_script": """ Map span = [
'timestamp':doc['activity_Time'],
'activity':doc['activity.keyword'].value
];
state.docs.add(span)
""",
"combine_script": "return state.docs;",
"reduce_script": """
def all_docs = [];
for (s in states)
{
for (span in s) {
all_docs.add(span);
}
}
all_docs.sort((HashMap o1, HashMap o2)->o1['timestamp'].getValue().toInstant().toEpochMilli().compareTo(o2['timestamp'].getValue().toInstant().toEpochMilli()));
Hashtable result= new Hashtable();
boolean found = false;
JodaCompatibleZonedDateTime loggedIn;
for (s in all_docs)
{
if(s.activity =='Logged In')
{
loggedIn=s.timestamp.getValue();
found= true;
}
if(s.activity =='Created a post' && found==true)
{
found=false;
def dt=loggedIn.getYear()+ '-' + loggedIn.getMonth() + '-' + loggedIn.getDayOfMonth();
def diff= s.timestamp.getValue().toInstant().toEpochMilli() - loggedIn.toInstant().toEpochMilli();
if(result.get(dt) == null)
{
result.put(dt, diff / 1000 / 60 / 60 / 24 )
}
}
}
return result;
"""
}
}
}
}
}
}
Result
"user" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Jen",
"doc_count" : 2,
"distinct_sum_feedback" : {
"value" : {
"2020-JANUARY-8" : 121
}
}
},
{
"key" : "Mark",
"doc_count" : 2,
"distinct_sum_feedback" : {
"value" : {
"2020-JANUARY-3" : 5
}
}
}
]
}
Explanation
"init_script":
Executed prior to any collection of documents. Allows the aggregation
to set up any initial state.
Have declared a Map"
"map_script"
Executed once per document collected
Loop through all document and add activity and timestamp to map
combine_script
Executed once on each shard after document collection is complete
Return collection of Map for all shards
reduce_script
Executed once on the coordinating node after all shards have returned their results
Once again go through through all Map and create a single collection and sort on timestamp. Then go through sorted Map and insert logged in and next "created post" time (diff of logged in and post created time)

Elasticsearch Top 10 Most Frequent Values In Array Across All Records

I have an index "test". Document structure is as shown below. Each document has an array of "tags". I am not able to figure out how to query this index to get top 10 most frequently occurring tags?
Also, what are the best practices one should follow if we have more than 2mil docs in this index?
{
"_index" : "test",
"_type" : "data",
"_id" : "1412879673545024927_1373991666",
"_score" : 1.0,
"_source" : {
"instagramuserid" : "1373991666",
"likes_count" : 163,
"#timestamp" : "2017-06-08T08:52:41.803Z",
"post" : {
"created_time" : "1482648403",
"comments" : {
"count" : 9
},
"user_has_liked" : true,
"link" : "https://www.instagram.com/p/BObjpPMBWWf/",
"caption" : {
"created_time" : "1482648403",
"from" : {
"full_name" : "PARAMSahib ™",
"profile_picture" : "https://scontent.cdninstagram.com/t51.2885-19/s150x150/12750236_1692144537739696_350427084_a.jpg",
"id" : "1373991666",
"username" : "parambanana"
},
"id" : "17845953787172829",
"text" : "This feature talks about how to work pastels .\n\nDull gold pullover + saffron khadi kurta + baby pink pants + Deep purple patka and white sneakers - Perfect colours for a Happy sunday christmas morning . \n#paramsahib #men #menswear #mensfashion #mensfashionblog #mensfashionblogger #menswearofficial #menstyle #fashion #fashionfashion #fashionblog #blog #blogger #designer #fashiondesigner #streetstyle #streetfashion #sikh #sikhfashion #singhstreetstyle #sikhdesigner #bearded #indian #indianfashionblog #indiandesigner #international #ootd #lookbook #delhistyleblog #delhifashionblog"
},
"type" : "image",
"tags" : [
"men",
"delhifashionblog",
"menswearofficial",
"fashiondesigner",
"singhstreetstyle",
"fashionblog",
"mensfashion",
"fashion",
"sikhfashion",
"delhistyleblog",
"sikhdesigner",
"indianfashionblog",
"lookbook",
"fashionfashion",
"designer",
"streetfashion",
"international",
"paramsahib",
"mensfashionblogger",
"indian",
"blog",
"mensfashionblog",
"menstyle",
"ootd",
"indiandesigner",
"menswear",
"blogger",
"sikh",
"streetstyle",
"bearded"
],
"filter" : "Normal",
"attribution" : null,
"location" : null,
"id" : "1412879673545024927_1373991666",
"likes" : {
"count" : 163
}
}
}
},
If your tags type in mapping is object (which is by default) you can use an aggregation query like this:
{
"size": 0,
"aggs": {
"frequent_tags": {
"terms": {"field": "post.tags"}
}
}
}

Update an existing collection mongodb

I have a collection
{
"_id" : 100000001,
"horses" : []
"race" : {
"date" : ISODate("2014-06-05T00:00:00.000Z"),
"time" : ISODate("2014-06-05T02:40:00.000Z"),
"type" : "Flat",
"name" : "Hindwoods Maiden Stakes (Div I)",
"run_befor" : 11,
"finish" : null,
"channel" : "ATR",
},
"track" : {
"fences" : 0,
"omitted" : 0,
"hdles" : 0,
"name" : "Lingfield",
"country" : "GB",
"type" : "",
"going" : "Good"
}
}
I'm trying to update it
#result value
{
"race":{
"run_after":"10",
"finish":{
"time":152.34,
"slow":1,
"fast":0,
"gap":5.34
}
},
"track":{
"name":"Lingfield",
"country":"GB",
"type":"",
"going":"Good",
"fences":0,
"omitted":0,
"hdles":0
}
}
Card.where(_id:100000001).update(#result)
When I update the collection of all data is deleted and inserted new
If do set() same
How do to upgrade the existing collection records are updated and not existing added?

ElasticSearch doesn't seem to support array lookups

I currently have a fairly simple document stored in ElasticSearch that I generated with an integration test:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "unit-test_project600",
"_type" : "recordDefinition505",
"_id" : "400",
"_score" : 1.0, "_source" : {
"field900": "test string",
"field901": "500",
"field902": "2050-01-01T00:00:00",
"field903": [
"Open"
]
}
} ]
}
}
I would like to filter for specifically field903 and a value of "Open", so I perform the following query:
{
query: {
filtered: {
filter: {
term: {
field903: "Open",
}
}
}
}
}
This returns no results. However, I can use this with other fields and it will return the record:
{
query: {
filtered: {
filter: {
term: {
field901: "500",
}
}
}
}
}
It would appear that I'm unable to search in arrays with ElasticSearch. I have read a few instances of people with a similar problem, but none of them appear to have solved it. Surely this isn't a limitation of ElasticSearch?
I thought that it might be a mapping problem. Here's my mapping:
{
"unit-test_project600" : {
"recordDefinition505" : {
"properties" : {
"field900" : {
"type" : "string"
},
"field901" : {
"type" : "string"
},
"field902" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"field903" : {
"type" : "string"
}
}
}
}
}
However, the ElasticSearch docs indicate that there is no difference between a string or an array mapping, so I don't think I need to make any changes here.
Try searching for "open" rather than "Open." By default, Elasticsearch uses a standard analyzer when indexing fields. The standard analyzer uses a lowercase filter, as described in the example here. From my experience, Elasticsearch does search arrays.

Resources