Run Elasticsearch processor on all the fields of a document - elasticsearch

I am trying to trim and lowercase all the values of the document that is getting indexed into Elasticsearch
The processors available has the field key is mandatory. This means one can use a processor on only one field
Is there a way to run a processor on all the fields of a document?

There sure is. Use a script processor but beware of reserved keys like _type, _id etc:
PUT _ingest/pipeline/my_string_trimmer
{
"description": "Trims and lowercases all string values",
"processors": [
{
"script": {
"source": """
def forbidden_keys = [
'_type',
'_id',
'_version_type',
'_index',
'_version'
];
def corrected_source = [:];
for (pair in ctx.entrySet()) {
def key = pair.getKey();
if (forbidden_keys.contains(key)) {
continue;
}
def value = pair.getValue();
if (value instanceof String) {
corrected_source[key] = value.trim().toLowerCase();
} else {
corrected_source[key] = value;
}
}
// overwrite the original
ctx.putAll(corrected_source);
"""
}
}
]
}
Test with a sample doc:
POST my-index/_doc?pipeline=my_string_trimmer
{
"abc": " DEF ",
"def": 123,
"xyz": false
}

Related

Painless script to increase the count if the full path exists or else add the full path and add the count

I am creating a script to increase the count value of the field if the field full path exist, or else I have to add the full path dynamically. For example, In the below example
If the record already has inner->board1->count, I should increment the value of it by the value of the count
If I don't have inner or board1 or count, I should add them and add the value of the count. Please also note here the inner or board1orcount` are not fixed.
If the value is not an object, I can check using ctx._source.myCounts == null, but I am not sure how to check for the object fields and subfields and sub subfields.
Code
POST test/_update/3
{
"script": {
"source": "ctx._source.board_counts = params.myCounts",
"lang": "painless",
"params": {
"myCounts": {
"inner":{
"board1":{"count":5},
"board2":{"count":4},
"board3":{"temp":1,"temp2":3}
},
"outer":{
"board1":{"count":5},
"board10":{"temp":1,"temp2":3}
}
}
}
}
}
I am able to come up with this and working fine.
POST test/_update/3
{
"script": {
"source": "{"source": "if (ctx._source['myCounts'] == null) {ctx._source['myCounts'] = [:];} for (mainItem in params.myCounts) { for (accessItemKey in mainItem.keySet()) { if (ctx._source.myCounts[accessItemKey] == null) { ctx._source.myCounts[accessItemKey] = [:];}for (boardItemKey in mainItem[accessItemKey].keySet()) {if (ctx._source.myCounts[accessItemKey][boardItemKey] == null) {ctx._source.myCounts[accessItemKey][boardItemKey] = [:];} for (countItemKey in mainItem[accessItemKey][boardItemKey].keySet()) { if (ctx._source.myCounts[accessItemKey][boardItemKey][countItemKey] == null) { ctx._source.myCounts[accessItemKey][boardItemKey][countItemKey] =mainItem[accessItemKey][boardItemKey][countItemKey]; }else {ctx._source.myCounts[accessItemKey][boardItemKey][countItemKey] += mainItem[accessItemKey][boardItemKey][countItemKey];}}}}}",
"lang": "painless",
"params": {
"myCounts": {
"inner":{
"board1":{"count":5},
"board2":{"count":4},
"board3":{"temp":1,"temp2":3}
},
"outer":{
"board1":{"count":5},
"board10":{"temp":1,"temp2":3}
}
}
}
}
}

Aggregating sequence of connected events

Lets say I have events like this in my log
{type:"approval_revokation", approval_id=22}
{type:"approval", request_id=12, approval_id=22}
{type:"control3", request_id=12}
{type:"control2", request_id=12}
{type:"control1", request_id=12}
{type:"request", request_id=12 requesting_user="user1"}
{type:"registration", userid="user1"}
I would like to do a search that aggregates one bucket for each approval_id containing all events connected to it as above. As you see there is not a single id field that can be used throughout the events, but they are all connected in a chain.
The reason I would like this is to feed this into a anomaly detector to verify things like that all controls where executed and validate registration event for a eventual approval.
Can this be done using aggregation or are there any other suggestion?
If there's no single unique "glue" parameter to tie these events together, I'm afraid the only choice is a brute-force map-reduce iterator on all the docs in the index.
After ingesting the above events:
POST _bulk
{"index":{"_index":"events","_type":"_doc"}}
{"type":"approval_revokation","approval_id":22}
{"index":{"_index":"events","_type":"_doc"}}
{"type":"approval","request_id":12,"approval_id":22}
{"index":{"_index":"events","_type":"_doc"}}
{"type":"control3","request_id":12}
{"index":{"_index":"events","_type":"_doc"}}
{"type":"control2","request_id":12}
{"index":{"_index":"events","_type":"_doc"}}
{"type":"control1","request_id":12}
{"index":{"_index":"events","_type":"_doc"}}
{"type":"request","request_id":12,"requesting_user":"user1"}
{"index":{"_index":"events","_type":"_doc"}}
{"type":"registration","userid":"user1"}
we can link them together like so:
POST events/_search
{
"size": 0,
"aggs": {
"log_groups": {
"scripted_metric": {
"init_script": "state.groups = [];",
"map_script": """
int fetchIndex(List groups, def key, def value, def backup_key) {
if (key == null || value == null) {
// nothing to search
return -1
}
return IntStream.range(0, groups.size())
.filter(i -> groups.get(i)['docs']
.stream()
.anyMatch(_doc -> _doc.get(key) == value
|| (backup_key != null
&& _doc.get(backup_key) == value)))
.findFirst()
.orElse(-1);
}
def approval_id = doc['approval_id'].size() != 0
? doc['approval_id'].value
: null;
def request_id = doc['request_id'].size() != 0
? doc['request_id'].value
: null;
def requesting_user = doc['requesting_user.keyword'].size() != 0
? doc['requesting_user.keyword'].value
: null;
def userid = doc['userid.keyword'].size() != 0
? doc['userid.keyword'].value
: null;
HashMap valueMap = ['approval_id':approval_id,
'request_id':request_id,
'requesting_user':requesting_user,
'userid':userid];
def found = false;
for (def entry : valueMap.entrySet()) {
def field = entry.getKey();
def value = entry.getValue();
def backup_key = field == 'userid'
? 'requesting_user'
: field == 'requesting_user'
? 'userid'
: null;
def found_index = fetchIndex(state.groups, field, value, backup_key);
if (found_index != -1) {
state.groups[found_index]['docs'].add(params._source);
if (approval_id != null) {
state.groups[found_index]['approval_id'] = approval_id;
}
found = true;
break;
}
}
if (!found) {
HashMap nextInLine = ['docs': [params._source]];
if (approval_id != null) {
nextInLine['approval_id'] = approval_id;
}
state.groups.add(nextInLine);
}
""",
"combine_script": "return state",
"reduce_script": "return states"
}
}
}
}
returning the grouped events + the inferred approval_id:
"aggregations" : {
"log_groups" : {
"value" : [
{
"groups" : [
{
"docs" : [
{...}, {...}, {...}, {...}, {...}, {...}, {...}
],
"approval_id" : 22
},
{ ... }
]
}
]
}
}
Keep in mind that such scripts are going to be quite slow, esp. when run on large numbers of events.

Terraform output object with multiple attributes for each of `for` resources?

I have terraform with a resource being created with for. As is typical, each instance of this resource has several attributes. At the moment I have a series of map outputs for this resource group but each consists of only a single key-value pair. I would like my terraform output to include a list or map of maps or objects with all of the attributes grouped by resource instance. How do I do this without using flatten; zipmap etc to construct them from my current outputs? This example is with aws_route53_record but this is a generic query:
Current code
output "r53record_zonal_fqdn" {
value = {
for entry in aws_route53_record.zonal :
entry.name => entry.fqdn
}
}
output "r53record_zonal_records" {
value = {
for entry in aws_route53_record.zonal :
entry.name => entry.records
}
}
output "r53record_zonal_zone_id" {
value = {
for entry in aws_route53_record.zonal :
entry.name => entry.zone_id
}
}
As you would expect, this renders three maps with aws_route53_record.zonal.name as the key and the other attribute(s) as the value.
What I would like is to have these outputs grouped by resource with a predefined key for each value, e.g. (pseudocode):
output "r53record_zonal_zone_id" {
value = {
for entry in aws_route53_record.zonal : {
instance[count.index] {
"name" = entry.name
"fqdn" = entry.fqdn
"records" = entry.records
"zone_id" = entry.zone_id
}
}
}
}
Producing a map or list of maps for each instance.
How can this or something like it be done?
I created a random route53_record resource block with two "name" arguments in for_each loop and tried to output something close to what you were looking for.
Assuming "mydomain.com" is the domain in Route53 as example....
resource "aws_route53_record" "zonal" {
for_each=toset(["site1","site2"])
name = each.key
zone_id = "ABCDZONEIDSTRING"
type = "A"
ttl = "300"
records = ["192.168.1.10"]
}
output "test" {
value = {
for dns, details in aws_route53_record.zonal:
dns => ({"fqdn" = details.fqdn , "zone_id" = details.zone_id , "records" = details.records})
}
}
this will create output in this fashion..
test = {
"site1" = {
"fqdn" = "site1.mydomain.com"
"records" = [
"192.168.1.10",
]
"zone_id" = "Z0630117NTQNSYTXYQ4Z"
}
"site2" = {
"fqdn" = "site2.mydomain.com"
"records" = [
"192.168.1.10",
]
"zone_id" = "Z0630117NTQNSYTXYQ4Z"
}
}
Suppose you passed the name values with domain name, like below...
for_each=toset(["site1.mydomain.com","site2.mydomain.com"])
the output would look like
test = {
"site1.mydomain.com" = {
"fqdn" = "site1.mydomain.com"
"records" = [
"192.168.1.10",
]
"zone_id" = "ABCDMYZONEIDSTRING"
}
"site2.mydomain.com" = {
"fqdn" = "site2.mydomain.com"
"records" = [
"192.168.1.10",
]
"zone_id" = "ABCDMYZONEIDSTRING"
}
}

Elasticsearch: Multiply each nested element plus aggregation

Let's imagine an index composed of 2 documents like this one:
doc1 = {
"x":1,
"y":[{brand:b1, value:1},
{brand:b2, value:2}]
},
doc2 = {
"x":2,
"y":[{brand:b1, value:0},
{brand:b2, value:3}]
}
Is it possible to multiply each values of y by x for each document and then do sum aggregation based on brand term to get this result:
b1: 1
b2: 8
If not, could it be done with any other mapping types ?
This is a highly custom use-case so I don't think there's some sort of a pre-optimized mapping for it.
What I would suggest is the following:
Set up an index w/ y being nested:
PUT xy/
{"mappings":{"properties":{"y":{"type":"nested"}}}}
Ingest the docs from your example:
POST xy/_doc
{"x":1,"y":[{"brand":"b1","value":1},{"brand":"b2","value":2}]}
POST xy/_doc
{"x":2,"y":[{"brand":"b1","value":0},{"brand":"b2","value":3}]}
Use a scripted_metric aggregation to compute the products and add them up in a shared HashMap:
GET xy/_search
{
"size": 0,
"aggs": {
"multiply_and_add": {
"scripted_metric": {
"init_script": "state.by_brands = [:]",
"map_script": """
def x = params._source['x'];
for (def brand_pair : params._source['y']) {
def brand = brand_pair['brand'];
def product = x * brand_pair['value'];
if (state.by_brands.containsKey(brand)) {
state.by_brands[brand] += product;
} else {
state.by_brands[brand] = product;
}
}
""",
"combine_script": "return state",
"reduce_script": "return states"
}
}
}
}
which would yield something along the lines of
{
...
"aggregations":{
"multiply_and_add":{
"value":[
{
"by_brands":{ <----
"b2":8,
"b1":1
}
}
]
}
}
}
UPDATE
The combine_script could look like this:
def combined_states = [:];
for (def state : states) {
for (def brand_pair : state['by_brands'].entrySet()) {
def key = brand_pair.getKey();
def value = brand_pair.getValue();
if (combined_states.containsKey(key)) {
combined_states[key] += (float)value;
break;
}
combined_states[key] = (float)value
}
}

elasticsearch-painless - Manipulate date

I am trying to manipulate date in elasticsearch's scripting language painless.
Specifically, I am trying to add 4 hours, which is 14,400 seconds.
{
"script_fields": {
"new_date_field": {
"script": {
"inline": "doc['date_field'] + 14400"
}
}
}
}
This throws Cannot apply [+] operation to types [org.elasticsearch.index.fielddata.ScriptDocValues.Longs] and [java.lang.Integer].
Thanks
The solution was to use .value
{
"script_fields": {
"new_date_field": {
"script": {
"inline": "doc['date_field'].value + 14400"
}
}
}
}
However, I actually wanted to use it for reindexing, where the format is a bit different.
Here is my version for manipulating time in the _reindex api
POST _reindex
{
"source": {
"index": "some_index_v1"
},
"dest": {
"index": "some_index_v2"
},
"script": {
"inline": "def sf = new SimpleDateFormat(\"yyyy-MM-dd'T'HH:mm:ss\"); def dt = sf.parse(ctx._source.date_field); def calendar = sf.getCalendar(); calendar.setTime(dt); def instant = calendar.toInstant(); def localDateTime = LocalDateTime.ofInstant(instant, ZoneOffset.UTC); ctx._source.date_field = localDateTime.plusHours(4);"
}
}
Here is the inline script in a readable version
def sf = new SimpleDateFormat(\"yyyy-MM-dd'T'HH:mm:ss\");
def dt = sf.parse(ctx._source.date_field);
def calendar = sf.getCalendar();
calendar.setTime(dt);
def instant = calendar.toInstant();
def localDateTime = LocalDateTime.ofInstant(instant, ZoneOffset.UTC);
ctx._source.date_field = localDateTime.plusHours(4);
Here is the list of functions supported by painless, it was painful.
An addition. Converting date to a string, your first part I believe, can be done with:
def dt = String.valueOf(ctx._source.date_field);
Just spent a couple of hours playing with this.. so I can concantenate a date field (in UTC format with 00:00:00 added).. to a string with the time, to get a valid datetime to add to ES. Don't ask why it was split.. its an old Oracle system

Resources