Elasticsearch pre-processing to remove null fields as part of ingest - elasticsearch

I have a use case where an API i'm calling to retrieve data to put into elasticsearch is returning nulls.
I need to write an ingest pipeline that uses processors to remove all null fields before writing it into elasticsearch. Processors may or may not use painless scripting.
Here is a sample payload that i currently get from the API
{
"master_desc": "TESTING PART",
"date_added": "2019-10-24T09:30:03",
"master_no": {
"master_no": 18460110,
"barcode": "NLSKYTEST1-1",
"external_key": null,
"umid": null
}
}
The pipeline should ideally insert the document as -
{
"master_desc": "TESTING PART",
"date_added": "2019-10-24T09:30:03",
"master_no": {
"master_no": 18460110,
"barcode": "NLSKYTEST1-1"
}
}
Note, the fields are dynamic so i can't write a processor that checks for nulls against a defined set of fields.
Thanks!

Null fields are not indexed nor are searchable.I have written below pipeline to remove such fields. Please test it before use on all of your scenarios. After posting documents using this pipeline, you won't be able to search null fields using "exists"
Pipeline:
PUT _ingest/pipeline/remove_null_fields
{
"description": "Remove any null field",
"processors": [
{
"script": {
"source": """
// return list of field with null values
def loopAllFields(def x){
def ret=[];
if(x instanceof Map){
for (entry in x.entrySet()) {
if (entry.getKey().indexOf("_")==0) {
continue;
}
def val=entry.getValue();
if( val instanceof HashMap ||
val instanceof Map ||
val instanceof ArrayList)
{
def list=[];
if(val instanceof ArrayList)
{
def index=0;
// Call for each object in arraylist
for(v in val)
{
list=loopAllFields(v);
for(item in list)
{
ret.add(entry.getKey()+"["+index+"]."+ item);
}
index++;
}
}
else
{
list =loopAllFields(val);
}
if(list.size()==val.size())
{
ret.add(entry.getKey());
}
else{
for(item in list)
{
ret.add(entry.getKey()+"."+ item);
}
}
}
if(val==null)
{
ret.add(entry.getKey());
}
}
}
return ret;
}
/* remove fields from source, recursively deletes fields which part of other fields */
def removeField(def ctx, def fieldname)
{
def pos=fieldname.indexOf(".");
if(pos>0)
{
def str=fieldname.substring(0,pos);
if(str.indexOf('[')>0 && str.indexOf(']')>0)
{
def s=str.substring(0,str.indexOf('['));
def i=str.substring(str.indexOf('[')+1,str.length()-1);
removeField(ctx[s][Integer.parseInt(i)],fieldname.substring(pos+1,fieldname.length()));
}
else
{
if(ctx[str] instanceof Map)
{
removeField(ctx[str],fieldname.substring(pos+1,fieldname.length()));
}
}
}else{
ctx.remove(fieldname);
}
return ctx;
}
def list=[];
list=loopAllFields(ctx);
for(item in list)
{
removeField(ctx,item);
}
"""
}
}
]
}
Post Document:
POST index8/_doc?pipeline=remove_null_fields
{
"master_desc": "TESTING PART",
"ddd":null,
"date_added": "2019-10-24T09:30:03",
"master_no": {
"master_no": 18460110,
"barcode": "NLSKYTEST1-1",
"external_key": null,
"umid": null
}
}
Result:
"hits" : [
{
"_index" : "index8",
"_type" : "_doc",
"_id" : "06XAyXEBAWHHnYGOSa_M",
"_score" : 1.0,
"_source" : {
"date_added" : "2019-10-24T09:30:03",
"master_no" : {
"master_no" : 18460110,
"barcode" : "NLSKYTEST1-1"
},
"master_desc" : "TESTING PART"
}
}
]

#Jaspreet, so the script almost worked. It didn't however eliminate empty objects, empty arrays or empty values. Here is a doc i tried to index -
{
"master_desc": "TESTING PART",
"date_added": "2019-10-24T09:30:03",
"master_no": {
"master_no": 18460110,
"barcode": "NLSKYTEST1-1",
"external_key": null,
"umid": null
},
"remote_sync_state": "",
"lib_title_footage": [],
"prj_no": {
"prj_no": null,
"prj_desc": null,
}
The above returned -
{
"master_desc": "TESTING PART",
"date_added": "2019-10-24T09:30:03",
"master_no": {
"master_no": 18460110,
"barcode": "NLSKYTEST1-1"
},
"remote_sync_state": "",
"lib_title_footage": [ ],
"prj_no": { }
I tried updated the script to have the condition check for these patterns but got a compile error unfortunately.

Related

How to create a ResponseBody without an "entity" tag

I have built a controller which returns a list of objects
fun getUserCommunicationSettings(
#PathVariable commId: String,
#CommunicationTypeConstraint #RequestParam(required = false) commType: CommunicationType?
): ResponseEntity<out UserCommResponse> {
return communicationSettingsService.getUserCommSettings(commType, commId)
.mapError { mapUserCommErrorToResponse(it) }
.map { ResponseEntity.ok(SuccessResponse(it)) }
.fold<ResponseEntity<SuccessResponse<List<CommunicationSettingsDto>>>, ResponseEntity<ErrorResponse>, ResponseEntity<out UserCommResponse>>(
{ it },
{ it },
)
}
problem is it returns the following json with "entity" tag which i'd like to get rid of
{
"entity": [
{
"userId": "1075",
"userType": "CUSTOMER",
"communicationId": "972547784682",
"communicationType": "CALL",
"messageType": "CALL_COLLECTION"
}
]
}
any ideas?

Mongodb aggregation remove null values from object with nested properties

Is there a way to remove literally all null or empty string values from an object? We have an aggregation which creates an object with empty fields and empty objects should the value be null.
What we wish to do is remove all null properties and empty objects and recreate the object, in order to keep the data as small as possible.
e.g. in the following object, only 'test' and 'more-nested-data' should be taken into account, the rest can be removed
{
"test": "some",
"test2": {
},
"test3": {
"some-key": {
},
"some-other-key": {
"more-nested-data": true,
"more-nested-emtpy": null
}
}
}
which should become:
{
"test": "some",
"test3": {
"some-other-key": {
"more-nested-data": true
}
}
}
I tried a lot, but I think by using objectToArray that something could be done, but I have not found the solution yet. The required aggregation should need to recursively (or by defined levels) remove null properties and empty objects.
Use the $function operator available in 4.4 (Aug 2021) to do this recursively as you note. Given this input which is a slightly expanded version of that supplied in the question:
var dd = {
"test": "some",
"test2": { },
"test3": {
"some-key": { },
"some-other-key": {
"more-nested-data": true,
"more-nested-emtpy": null,
"emptyArr": [],
"notEmptyArr": [
"XXX",
null,
{"corn":"dog"},
{"bad":null},
{"other": {zip:null, empty:[], zap:"notNull"}}
]
}
}
}
db.foo.insert(dd);
then this pipeline:
db.foo.aggregate([
{$replaceRoot: {newRoot: {$function: {
body: function(obj) {
var process = function(holder, spot, value) {
var remove_it = false;
// test FIRST since [] instanceof Object is true!
if(Array.isArray(value)) {
// walk BACKWARDS due to potential splice() later
// that will change the length...
for(var jj = value.length - 1; jj >= 0; jj--) {
process(value, jj, value[jj]);
}
if(0 == value.length) {
remove_it = true;
}
} else if(value instanceof Object) {
walkObj(value);
if(0 == Object.keys(value).length) {
remove_it = true;
}
} else {
if(null == value) {
remove_it = true;
}
}
if(remove_it) {
if(Array.isArray(holder)) {
holder.splice(spot,1); // snip out the val
} else if(holder instanceof Object) {
delete holder[spot];
}
}
};
var walkObj = function(obj) {
Object.keys(obj).forEach(function(k) {
process(obj, k, obj[k]);
});
}
walkObj(obj); // entry point!
return obj;
},
args: [ "$$CURRENT" ],
lang: "js"
}}
}}
]);
produces this result:
{
"_id" : 0,
"test" : "some",
"test3" : {
"some-other-key" : {
"more-nested-data" : true,
"notEmptyArr" : [
"XXX",
{
"corn" : "dog"
},
{
"other" : {
"zap" : "notNull"
}
}
]
}
}
}
A convenient way to debug such complex functions is by declaring them as variables outside of the pipeline and running data through them to simulate the documents (objects) coming out the database, e.g.:
ff = function(obj) {
var process = function(holder, spot, value) {
var remove_it = false;
// test FIRST since [] instanceof Object is true!
if(Array.isArray(value)) {
...
printjson(ff(dd)); // use the same doc as above
You can put print and other debugging aids into the code and then when you are done, you can remove them and call the pipeline to process the real data as follows:
db.foo.aggregate([
{$replaceRoot: {newRoot: {$function: {
body: ff, // substitute here!
args: [ "$$CURRENT" ],
lang: "js"
}}
}}
]);
Sounds like the unwind operator would help. Checkout the unwind operator at https://docs.mongodb.com/manual/reference/operator/aggregation/unwind/

Elasticsearch merging documents in response

I am having data in 3 indexes. I want to generate a invoice report using information from other indexes. For example the following are the sample document of each index
Users index
{
"_id": "userId1",
"name": "John"
}
Invoice index
{
"_id": "invoiceId1",
"userId": "userId1",
"cost": "10000",
"startdate": "",
"enddate": ""
}
Orders index
{
"_id": "orderId1",
"userId": "userId1",
"productName": "Mobile"
}
I want to generate a invoice report by combining information from these three indexes as follows
{
"_id": "invoiceId1",
"userName": "John",
"productName": "Mobile",
"cost": "10000",
"startdate": "",
"enddate": ""
}
How to write Elasticsearch query which returns response by combining information from other index documents?
You cannot do query-time joins in Elasticsearch and will need to denormalize your data in order to efficiently retrieve and group it.
Having said that, you could:
leverage the multi-target syntax and query multiple indices at once
use an OR query on the id and userId -- since either of those is referenced at least once in any of your docs
and then trivially join your data through a map/reduce tool called scripted metric aggregations
Quick side note: you won't be able to use the _id keyword inside your docs because it's reserved.
Assuming your docs and indices are structured as follows:
POST users_index/_doc
{"id":"userId1","name":"John"}
POST invoices_index/_doc
{"id":"invoiceId1","userId":"userId1","cost":"10000","startdate":"","enddate":""}
POST orders_index/_doc
{"id":"orderId1","userId":"userId1","productName":"Mobile"}
Here's how the scripted metric aggregation could look like:
POST users_index,invoices_index,orders_index/_search
{
"size": 0,
"query": {
"bool": {
"should": [
{
"term": {
"id.keyword": {
"value": "userId1"
}
}
},
{
"term": {
"userId.keyword": {
"value": "userId1"
}
}
}
]
}
},
"aggs": {
"group_by_invoiceId": {
"scripted_metric": {
"init_script": "state.users = []; state.invoices = []; state.orders = []",
"map_script": """
def source = params._source;
if (source.containsKey("name")) {
// we're dealing with the users index
state.users.add(source);
} else if (source.containsKey("cost")) {
// we're dealing with the invoices index
state.invoices.add(source);
} else if (source.containsKey("productName")) {
// we're dealing with the orders index
state.orders.add(source);
}
""",
"combine_script": """
def non_empty_state = [:];
for (entry in state.entrySet()) {
if (entry != null && entry.getValue().length > 0) {
non_empty_state[entry.getKey()] = entry.getValue();
}
}
return non_empty_state;
""",
"reduce_script": """
def final_invoices = [];
def all_users = [];
def all_invoices = [];
def all_orders = [];
// flatten all resources
for (state in states) {
for (kind_entry in state.entrySet()) {
def map_kind = kind_entry.getKey();
if (map_kind == "users") {
all_users.addAll(kind_entry.getValue());
} else if (map_kind == "invoices") {
all_invoices.addAll(kind_entry.getValue());
} else if (map_kind == "orders") {
all_orders.addAll(kind_entry.getValue());
}
}
}
// iterate the invoices and enrich them
for (invoice_entry in all_invoices) {
def invoiceId = invoice_entry.id;
def userId = invoice_entry.userId;
def userName = all_users.stream().filter(u -> u.id == userId).findFirst().get().name;
def productName = all_orders.stream().filter(o -> o.userId == userId).findFirst().get().productName;
def cost = invoice_entry.cost;
def startdate = invoice_entry.startdate;
def enddate = invoice_entry.enddate;
final_invoices.add([
'id': invoiceId,
'userName': userName,
'productName': productName,
'cost': cost,
'startdate': startdate,
'enddate': enddate
]);
}
return final_invoices;
"""
}
}
}
}
which'd return
{
...
"aggregations" : {
"group_by_invoiceId" : {
"value" : [
{
"cost" : "10000",
"enddate" : "",
"id" : "invoiceId1",
"userName" : "John",
"startdate" : "",
"productName" : "Mobile"
}
]
}
}
}
Summing up, there are workarounds to achieve query-time joins. At the same time, scripts like this shouldn't be used in production because they could take forever.
Instead, this aggregation should be emulated outside of Elasticsearch after the query resolves and returns the index-specific hits.
BTW — I set size: 0 to return just the aggregation results so increase this parameter if you want to get some actual hits.

Elasticsearch ingest pipeline: how to recursively modify values in a HashMap

Using an ingest pipeline, I want to iterate over a HashMap and remove underscores from all string values (where underscores exist), leaving underscores in the keys intact. Some values are arrays that must further be iterated over to do the same modification.
In the pipeline, I use a function to traverse and modify the values of a Collection view of the HashMap.
PUT /_ingest/pipeline/samples
{
"description": "preprocessing of samples.json",
"processors": [
{
"script": {
"tag": "remove underscore from sample_tags values",
"source": """
void findReplace(Collection collection) {
collection.forEach(element -> {
if (element instanceof String) {
element.replace('_',' ');
} else {
findReplace(element);
}
return true;
})
}
Collection samples = ctx.samples;
samples.forEach(sample -> { //sample.sample_tags is a HashMap
Collection sample_tags = sample.sample_tags.values();
findReplace(sample_tags);
return true;
})
"""
}
}
]
}
When I simulate the pipeline ingestion, I find the string values are not modified. Where am I going wrong?
POST /_ingest/pipeline/samples/_simulate
{
"docs": [
{
"_index": "samples",
"_id": "xUSU_3UB5CXFr25x7DcC",
"_source": {
"samples": [
{
"sample_tags": {
"Entry_A": [
"A_hyphentated-sample",
"sample1"
],
"Entry_B": "A_multiple_underscore_example",
"Entry_C": [
"sample2",
"another_example_with_underscores"
],
"Entry_E": "last_example"
}
}
]
}
}
]
}
\\Result
{
"docs" : [
{
"doc" : {
"_index" : "samples",
"_type" : "_doc",
"_id" : "xUSU_3UB5CXFr25x7DcC",
"_source" : {
"samples" : [
{
"sample_tags" : {
"Entry_E" : "last_example",
"Entry_C" : [
"sample2",
"another_example_with_underscores"
],
"Entry_B" : "A_multiple_underscore_example",
"Entry_A" : [
"A_hyphentated-sample",
"sample1"
]
}
}
]
},
"_ingest" : {
"timestamp" : "2020-12-01T17:29:52.3917165Z"
}
}
}
]
}
Here is a modified version of your script that will work on the data you provided:
PUT /_ingest/pipeline/samples
{
"description": "preprocessing of samples.json",
"processors": [
{
"script": {
"tag": "remove underscore from sample_tags values",
"source": """
String replaceString(String value) {
return value.replace('_',' ');
}
void findReplace(Map map) {
map.keySet().forEach(key -> {
if (map[key] instanceof String) {
map[key] = replaceString(map[key]);
} else {
map[key] = map[key].stream().map(this::replaceString).collect(Collectors.toList());
}
});
}
ctx.samples.forEach(sample -> {
findReplace(sample.sample_tags);
return true;
});
"""
}
}
]
}
The result looks like this:
{
"samples" : [
{
"sample_tags" : {
"Entry_E" : "last example",
"Entry_C" : [
"sample2",
"another example with underscores"
],
"Entry_B" : "A multiple underscore example",
"Entry_A" : [
"A hyphentated-sample",
"sample1"
]
}
}
]
}
You were on the right path but you were working on copies of values and weren't setting the modified values back onto the document context ctx which is eventually returned from the pipeline. This means you'll need to keep track of the current iteration indexes -- so for the array lists, as for the hash maps and everything in between -- so that you can then target the fields' positions in the deeply nested context.
Here's an example taking care of strings and (string-only) array lists. You'll need to extend it to handle hash maps (and other types) and then perhaps extract the whole process into a separate function. But AFAIK you cannot return multiple data types in Java so it may be challenging...
PUT /_ingest/pipeline/samples
{
"description": "preprocessing of samples.json",
"processors": [
{
"script": {
"tag": "remove underscore from sample_tags values",
"source": """
ArrayList samples = ctx.samples;
for (int i = 0; i < samples.size(); i++) {
def sample = samples.get(i).sample_tags;
for (def entry : sample.entrySet()) {
def key = entry.getKey();
def val = entry.getValue();
def replaced_val;
if (val instanceof String) {
replaced_val = val.replace('_',' ');
} else if (val instanceof ArrayList) {
replaced_val = new ArrayList();
for (int j = 0; j < val.length; j++) {
replaced_val.add(val[j].replace('_',' '));
}
}
// else if (val instanceof HashMap) {
// do your thing
// }
// crucial part
ctx.samples[i][key] = replaced_val;
}
}
"""
}
}
]
}

Painless scripting initialize new array

I'm trying to add or update a nested object in Elasticsearch using a script. Below script works fine if integrations is already an array, but when it is null, below script throws a null pointer exception. How do I initialize ctx._source.integrations to be an empty array if its value is null? (Something like the equivalent of JavaScript's myObject.integrations = myObject.integrations ?? [])
POST /products/_update/VFfrnQrKlC5bwdfdeaQ7
{
"script": {
"source": """
ctx._source.integrations.removeIf(i -> i.id == params.integration.id);
ctx._source.integrations.add(params.integration);
ctx._source.integrationCount = ctx._source.integrations.length;
""",
"params": {
"integration": {
"id": "dVTV8GjHj8pXFnlYUUlI",
"from": true,
"to": false,
"vendor": "sfeZWDpZXlF5Qa8mUsiF",
"targetProduct": {
"id": "nyILhphvCrGYm53cfaOx",
"name": "Test Product",
"categoryIds": []
}
}
}
}
}
ok i think this does the trick:
if (ctx._source.integrations == null) {
ctx._source.integrations = new ArrayList();
}
is there a short hand to this like in the JS example?

Resources