Delete empty attributes in NiFi - apache-nifi

Because of this issue being still unresolved, I have an EvaluateJsonPath processor that sometimes outputs attributes with empty strings.
Is there a straight-forward way to delete attributes from a flowfile?
I tried using the UpdateAttributes processor, but it only is able to delete based on matching an attribute's name (I need to match on the attribute's value).

you can use ExecuteGroovyScript 1.5.0 processor with the following code:
def ff=session.get()
if(!ff)return
def emptyKeys = ff.getAttributes().findAll{it.value==null || it.value==''}.collect{it.key}
ff.removeAllAttributes(emptyKeys)
REL_SUCCESS<<ff

After EvaluateJsonPath processor use RouteonAttribute processor and check the attributes having isEmpty values in them using Expression Language
Routeonattribute configs:-
Add new property
emptyattribute
${anyAttribute("id","age"):isEmpty()}
by using or funtion
${id:isEmpty():or(${age:isEmpty()})}
in the above expression language we are checking any id, age attribute having empty values for them and routing them to emptyattribute relation.
${allAttributes("id","age"):isEmpty()}
By using and function
${id:isEmpty():and(${age:isEmpty()})}
this expression routes only when both id,age attributes are empty.
Use Empty Relationship and connect that to Update Attribute processor and delete the attributes that you want to delete.
UpdateAttributeConfigs:-
in delete attribute expression mentioned id,age attributes need to delete.
By using RouteonAttribute after evaljsonpath processor we can check the required attributes are having values or not, then by using updateattribute we can delete the attributes that having empty values.

You can use a jolt transform, but i can only get it to work for fields at the top level of the json. Any nested fields are lost, although perhaps some real jolt expert can improve on the solution to stop that happening.
[
{
"operation": "shift",
"spec": {
"*": {
"": "TRASH",
"*": {
"$": "&2"
}
}
}
},
{
"operation": "remove",
"spec": {
"TRASH": ""
}
}
]

once you validate the required attribute values having empty strings, then make use of UpdateAttribute Advanced Usage and check the required attribute values having empty strings, then change the value to null.For Advance usage of Update attribute refer to this link community.hortonworks.com/questions/141774/… Add Rule: idnull Conditions:- ${id:isEmpty():or(${id:isNull()})} Actions:- id(attribute) null(value) – Shu Feb 3 '18 at 3:08
The approach does not remove attribute but just set the attribute value as null.

Related

Elasticsearch query subfield directly without prefix

If have a object like this in elastic search, where a is a object with some fields (dynamically mapped)
{
"a": {
"b": "b_value",
"c": "c_value"
},
}
How can use query 'b:b_value' to get matched documents without have to specify 'a.b:b_value'?
I tried searching online but none of them work, is this possible?
You can use field alias.
An alias mapping defines an alternate name for a field in the index. The alias can be used in place of the target field in search requests, and selected other APIs like field capabilities.
https://www.elastic.co/blog/introducing-field-aliases-in-elasticsearch

How to copy a value from one field to other if a field exists by using ingestnode pipeline

I want to create a new field called kubernetes.pod.name if fields called prometheus.labels.pod exists in the logs. I found out that from the set processor I could copy the value which is present in prometheus.labels.pod to a new field kubernetes.pod.name but I need to do this conditionally as the pod name keeps on changing.
How do i set a condition such that if field prometheus.labels.pod exists then only I need to add a new field called kubernetes.pod.name (both has the same value)
ctx.prometheus?.labels?.namespace== "name_of_namespace"
could be do similarly can we do
ctx.prometheus?.labels?.pod== "*"
to check if this field exists or not?
If the text is a string and if its required to set a condition that if it exists then best way is to use the below condition in set processor.
ctx.prometheus?.labels?.namespace!=null
This is how I implemented the above scenario by using ingest node pipeline.
"set": {
"field": "kubernetes.pod.name",
"copy_from": "prometheus.labels.pod",
"if": "ctx.prometheus?.labels?.pod!=null",
"ignore_failure": true
}

Kibana scripted field which loops through an array

I am trying to use the metricbeat http module to monitor F5 pools.
I make a request to the f5 api and bring back json, which is saved to kibana. But the json contains an array of pool members and I want to count the number which are up.
The advice seems to be that this can be done with a scripted field. However, I can't get the script to retrieve the array. eg
doc['http.f5pools.items.monitor'].value.length()
returns in the preview results with the same 'Additional Field' added for comparison:
[
{
"_id": "rT7wdGsBXQSGm_pQoH6Y",
"http": {
"f5pools": {
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
},
"pool.MemberCount": [
7
]
},
If I try
doc['http.f5pools.items']
Or similar I just get an error:
"reason": "No field found for [http.f5pools.items] in mapping with types []"
Googling suggests that the doc construct does not contain arrays?
Is it possible to make a scripted field which can access the set of values? ie is my code or the way I'm indexing the data wrong.
If not is there an alternative approach within metricbeats? I don't want to have to make a whole new api to do the calculation and add a separate field
-- update.
Weirdly it seems that the number values in the array do return the expected results. ie.
doc['http.f5pools.items.ratio']
returns
{
"_id": "BT6WdWsBXQSGm_pQBbCa",
"pool.MemberCount": [
1,
1
]
},
-- update 2
Ok, so if the strings in the field have different values then you get all the values. if they are the same you just get one. wtf?
I'm adding another answer instead of deleting my previous one which is not the actual question but still may be helpful for someone else in future.
I found a hint in the same documentation:
Doc values are a columnar field value store
Upon googling this further I found this Doc Value Intro which says that the doc values are essentially "uninverted index" useful for operations like sorting; my hypotheses is while sorting you essentially dont want same values repeated and hence the data structure they use removes those duplicates. That still did not answer as to why it works different for string than number. Numbers are preserved but strings are filters into unique.
This “uninverted” structure is often called a “column-store” in other
systems. Essentially, it stores all the values for a single field
together in a single column of data, which makes it very efficient for
operations like sorting.
In Elasticsearch, this column-store is known as doc values, and is
enabled by default. Doc values are created at index-time: when a field
is indexed, Elasticsearch adds the tokens to the inverted index for
search. But it also extracts the terms and adds them to the columnar
doc values.
Some more deep-dive into doc values revealed it a compression technique which actually de-deuplicates the values for efficient and memory-friendly operations.
Here's a NOTE given on the link above which answers the question:
You may be thinking "Well that’s great for numbers, but what about
strings?" Strings are encoded similarly, with the help of an ordinal
table. The strings are de-duplicated and sorted into a table, assigned
an ID, and then those ID’s are used as numeric doc values. Which means
strings enjoy many of the same compression benefits that numerics do.
The ordinal table itself has some compression tricks, such as using
fixed, variable or prefix-encoded strings.
Also, if you dont want this behavior then you can disable doc-values
OK, solved it.
https://discuss.elastic.co/t/problem-looping-through-array-in-each-doc-with-painless/90648
So as I discovered arrays are prefiltered to only return distinct values (except in the case of ints apparently?)
The solution is to use params._source instead of doc[]
The answer for why doc doesnt work
Quoting below:
Doc values are a columnar field value store, enabled by default on all
fields except for analyzed text fields.
Doc-values can only return "simple" field values like numbers, dates,
geo- points, terms, etc, or arrays of these values if the field is
multi-valued. It cannot return JSON objects
Also, important to add a null check as mentioned below:
Missing fields
The doc['field'] will throw an error if field is
missing from the mappings. In painless, a check can first be done with
doc.containsKey('field')* to guard accessing the doc map.
Unfortunately, there is no way to check for the existence of the field
in mappings in an expression script.
Also, here is why _source works
Quoting below:
The document _source, which is really just a special stored field, can
be accessed using the _source.field_name syntax. The _source is loaded
as a map-of-maps, so properties within object fields can be accessed
as, for example, _source.name.first.
.
Responding to your comment with an example:
The kyeword here is: It cannot return JSON objects. The field doc['http.f5pools.items'] is a JSON object
Try running below and see the mapping it creates:
PUT t5/doc/2
{
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
GET t5/_mapping
{
"t5" : {
"mappings" : {
"doc" : {
"properties" : {
"items" : {
"properties" : {
"monitor" : { <-- monitor is a property of items property(Object)
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
}

Nifi - atttributes to json - not able to generate the required json from an attribute

The flowfile content is
{
"resourceType": "Patient",
"myArray": [1, 2, 3, 4]
}
I use EvaluateJsonPath processor to load the "myArray" to an attrribute myArray.
Then I use the processor AttributesToJSON to create a json from myArray.
But in the flowfile content, what I get is
{"myArray":"[1,2,3,4]"}
I expected the flowfile to have the following content.
{"myArray":[1,2,3,4]}
Here are the flowfile attributes
How can I get "myArray" as an array again in the content?
Use record oriented processors like Convert Record processor instead of using EvaluateJsonPath,AttributesToJSON processors.
RecordReader as JsonPathReader
JsonPathReader Configs:
AvroSchemaRegistry:
{
"namespace": "nifi",
"name": "person",
"type": "record",
"fields": [
{ "name": "myArray", "type": {
"type": "array",
"items": "int"
}}
]
}
JsonSetWriter:
Use the same AvroSchemaRegistry controller service to access the schema.
To access the AvroSchema you need to set up schema.name attribute to the flowfile.
Output flowfile content would be
[{"myArray":[1,2,3,4]}]
please refer to this link how to configure ConvertRecord processor
(or)
if your deserved output is {"myArray":[1,2,3,4]} without [](array) then use
ReplaceText processor instead of AttributesToJson Processor.
ReplaceText Configs:
Not all credit goes to me but I was pointed to a better simpler way to achieve this. There are 2 ways.
Solution 1 - and the simplest and elegant
Use Nifi JoltTransformJSON Processor. The processor can make use of Nifi expression language and attributes in both left or right hand side of the specification syntax. This allows you to quickly use the JOLT default spec to add new fields (from flow-file attributes) to a new or existing JSON.
Ex:
{"customer_id": 1234567, "vckey_list": ["test value"]}
both of those fields values are stored in flow-file attributes as a result of a EvaluateJSONPath operation. Assume "customer_id_attr" and ""vckey_list_attr". We can simply generate a new JSON from those flow-file attributes with the "default" jolt spec and the right hand syntax. You can even add addition expression language functions to the processing
[
{
"operation": "default",
"spec": {
"customer_id": ${customer_id_attr},
"vckey_list": ${vckey_list_attr:toLower()}
}
}
]
This worked for me even when storing the entire JSON, path of "$", in a flow-file attribute.
Solution 2 - complicated and uglier
Use a sequence Nifi ReplaceText Processor. First use a ReplaceText processor to append the desired flow-file attribute to the file-content.
replace_text_processor_1
If you are generating a totally new JSON, this would do it. If you are trying to modify an existing one, you would need to first append the desired keys, than use ReplaceText again to properly format as a new key in the existing JSON, from
{"original_json_key": original_json_obj}{"customer_id": 1234567, "vckey_list": ["test value"]}
to
{"original_json_key": original_json_obj, "customer_id": 1234567, "vckey_list": ["test value"]}
using
replace_text_processor_2
Then use JOLT to do further processing (that's why Sol 1 always makes sense)
Hope this helps, spent about half a day figuring out the 2nd Solution and was pointed to Solution 1 by someone with more experience in Nifi

Aggregating nested fields of varying datatypes in Elasticsearch

I have an index based on Products and one of the fields declared in the mapping is Attributes. This field is a nested type as it will contain two values - key and value. The problem I have is that the depending on the context of the attribute the datatype of value can vary between an integer and string.
For example:
{"attributes":[{"key":"StrEx","value":"Red"},{"key":"IntEx","value":2}]}
It seems the datatype for every instance of 'value' within all future nested documents within Attributes is decided based on the first data entered. I need to be able to store it as a integer/long datatype so I can perform range queries.
Any help or alternative ideas would be greatly appreciated.
You need a mapping like this one, for the value field:
"value": {
"type": "string",
"fields": {
"as_number": {
"type": "integer",
"ignore_malformed": true
}
}
}
Basically, your field is string but using fields you can attempt to format it as a numeric field.
When you want to use range queries then use value.as_number, for anything else use value.

Resources