Nifi MergeRecord Processor to merge null values - apache-nifi

I am splitting the list of fields and trying to merge them at the end. I have 2 kind of fields, standard field and custom field. The way I process custom fields is different than standard fields.
{
"standardfield1" : "fieldValue1",
"customField1" : "customValue"
}
These has to be translated into
{
"standardfield1" : "fieldValue1",
"customFields" : [
{ "type" : "customfield",
"id" : 1212 //this is id of the customField1, retrieved at run time
"value" : "customValue"
} ]
}
My mergeRecord Schema is set to
{
"name": "custom field",
"namespace": "nifi",
"type": "record",
"fields": [
{ "name": "id", "type": "string" },
{ "name": "type", "type": "string" },
{ "name": "value", "type": "string" }
]
}
And as per my need I am setting the content of the standard field to the new flowfile attribute as I can extract it from it, and put the empty value in the flowfile content.
So, both custom fields and standard fields are connected to mergeRecord processor.
it works pretty fine as long as custom fields are available in the payload. If there is only standard fields and no custom fields then mergeRecord processor wont merge any thing and also wont fail, it just throws NullPointerException and there by flowfile stuck in the queue forever.
I want to make mergeRecord processor to merge even the empty content flow files.
Any help would be appreciated

I'm not sure I fully understand your use case, but for your input above, if you have extracted/populated the ID for customField1 into an attribute (let's call it myId), then you could use JoltTransformJSON to get your desired output above, using this Chain spec:
[
{
"operation": "shift",
"spec": {
"standardfield1": "standardfield1",
"customField*": {
"#": "customFields.[&(1,1)].value",
"#customfield": "customFields.[&(1,1)].type",
"#${myId}": "customFields.[&(1,1)].id"
}
}
},
{
"operation": "remove",
"spec": {
"customFields": {
"0": ""
}
}
},
{
"operation": "modify-overwrite-beta",
"spec": {
"customFields": {
"*": {
"id": "=toInteger"
}
}
}
}
]
This will create the customFields array if there is a customField present, and populate it with the values you have above (including the value of the myId attribute). You can tweak things (like adding a Default spec to the above Chain) to add an empty array for customFields if you wish (to keep the schema happy, e.g.).
If I've misunderstood what you're trying to do, please let me know and I will do my best to help.

Related

Good practice for ElasticSearch index mapping

Im new to Elasticsearch and I would like to know if there are any good practices for the use case I have.
I have heterogeneous data sent from an API that I save into a database (as a JSON) then save in Elasticsearch for search purposes. The data in sent in this format (because it's heterogeneous, the users can send any type of data, some metadata can be multivalued, other single values and the name of the key in the JSON may vary :)
{
"indices":{
"MultipleIndices":[
{
"index":"editors",
"values":[
"The Editing House",
"Volcan Editing"
]
},
{
"index":"colors",
"values":[
"Red",
"Blue"
]
}
],
"SimpleIndices":[
{
"index":"AuthorName",
"value": "George R. R. Martin"
},
{
"index":"NumberOfPages",
"value":"2898"
},
{
"index":"BookType",
"value":"Fantasy"
}
]
}
}
Once we receive this JSON, its formatted in the code and stored as a JSON in a database with this format :
{
"indices":{
"editors":[
"The Editing House",
"Volcan Editing"
],
"colors":[
"Red",
"Blue"
],
"AuthorName" : "George R. R. Martin"
"NumberOfPages" : "2898",
"BookType" : "Fantasy"
}
}
I then want to save this data into Elasticsearch, what's the best way I can map it ? Store it as a JSON in one field ? Will the search be efficilent if I do it this way ?
You must mapping each field individually.
You can take a look at field types to understand which type is ideal for your schema.
Another suggestion is to study the text analysis because it is responsible for the process of structuring the text to optimize the search.
My suggestion map:
PUT indices
{
"mappings": {
"properties": {
"editors": {
"type": "keyword"
},
"colors":{
"type": "keyword"
},
"author_name":{
"type": "text"
},
"number_pages":{
"type": "integer"
},
"book_type":{
"type": "keyword"
}
}
}
}
I think in your case, you don't have much choice apart from dynamic mapping, which Elasticsearch will generate for you as soon as first document is index in a particular index.
However, you can improve the process by using the dynamic template so that you can optimize your mapping, there is good examples of that in the official link I provided.

Elasticsearch custom mapping definition

I have to upload data to elk in the following format:
{
"location":{
"timestamp":1522751098000,
"resources":[
{
"resource":{
"name":"Node1"
},
"probability":0.1
},
{
"resource":{
"name":"Node2"
},
"probability":0.01
}]
}
}
I'm trying to define a mapping this kind of data and I produced he following mapping:
{
"mappings": {
"doc": {
"properties": {
"location": {
"properties" : {
"timestamp": {"type": "date"},
"resources": []
}
}
}
}
}
I have 2 questions:
how can I define the "resources" array in my mapping?
is it possible to define a custom type (e.g. resource) and use this type in my mapping (e.g "resources": [{type:resource}]) ?
There is a lot of things to know about the Elasticsearch mapping. I really highly suggest to read through at least some of their documentation.
Short answers first, in case you don't care:
Elasticsearch automatically allows storing one or multiple values of defined objects, there is no need to specify an array. See Marker 1 or refer to their documentation on array types.
I don't think there is. Since Elasticsearch 6 only 1 type per index is allowed. Nested objects is probably the closest, but you define them in the same file. Nested objects are stored in a separate index (internally).
Long answer and some thoughts
Take a look at the following mapping:
"mappings": {
"doc": {
"properties": {
"location": {
"properties": {
"timestamp": {
"type": "date"
},
"resources": { [1]
"type": "nested", [2]
"properties": {
"resource": {
"properties": {
"name": { [3]
"type": "text"
}
}
},
"probability": {
"type": "float"
}
}
}
}
}
}
}
}
This is how your mapping could look like. It can be done differently, but I think it makes sense this way - maybe except marker 3. I'll come to these right now:
Marker 1: If you define a field, you usually give it a type. I defined resources as a nested type, but your timestamp is of type date. Elasticsearch automatically allows storing one or multiple values of these objects. timestamp could actually also contain an array of dates, there is no need to specify an array.
Marker 2: I defined resources as a nested type, but it could also be an object like resource a little below (where no type is given). Read about nested objects here. In the end I don't know what your queries would look like, so not sure if you really need the nested type.
Marker 3: I want to address two things here. First, I want to mention again that resource is defined as a normal object with property name. You could do that for resources as well.
Second thing is more a thought-provoking impulse: Don't take it too seriously if something absolutely doesn't fit your case. Just take it as an opinion.
This mapping structure looks very inspired by a relational database approach. I think you usually want to define document structures for elasticsearch more for the expected searches. Redundancy is not a problem, but nested objects can make your queries complicated. I think I would omit the whole resources part and do it something like this:
"mappings": {
"doc": {
"properties": {
"location": {
"properties": {
"timestamp": {
"type": "date"
},
"resource": {
"properties": {
"resourceName": {
"type": "text"
}
"resourceProbability": {
"type": "float"
}
}
}
}
}
}
}
}
Because as I said, in this case resource can contain an array of objects, each having a resourceName and a resourceProbability.

Elasticsearch: Duplicate properties in a single record

I have to find every document in Elasticsearch that has duplicate properties. My mapping looks something like this:
"type": {
"properties": {
"thisProperty": {
"properties" : {
"id":{
"type": "keyword"
},
"other_id":{
"type": "keyword"
}
}
}
The documents I have to find have a pattern like this:
"thisProperty": [
{
"other_id": "123",
"id": "456"
},
{
"other_id": "123",
"id": "456"
},
{
"other_id": "4545",
"id": "789"
}]
So, I need to find any document by type that has repeat property fields. Also I cannot search by term because I do not what the value of either Id field is. So far the API hasn't soon a clear way to do this via query and the programmatic approach is possible but cumbersome. Is it possible to get this result set in a elasticsearch query? If so, how?
(The version of Elasticsearch is 5.3)

How to generate N FlowFiles and set the content of each FlowFile according to the data in Elastic?

In Elasticsearch I have this index and mapping:
PUT /myindex
{
"mappings": {
"myentries": {
"_all": {
"enabled": false
},
"properties": {
"yid": {"type": "keyword"},
"days": {
"properties": {
"Type1": { "type": "date" },
"Type2": { "type": "date" }
}
},
"directions": {
"properties": {
"name": {"type": "keyword"},
"recorder": { "type": "keyword" },
"direction": { "type": "integer" }
}
}
}
}
}
}
I want to generate N FlowFiles, 1 for each combination of values of recorder and direction in the mapping directions. How can I do it in Nifi? I was thinking to use GenerateFlowFile, but how can I apply this logic related to Elasticsearch?
One possible workaround might be to generate N FlowFiles using GenerateFlowFile, where Batch field could be hardcoded and set to 10 (the number of entries in Elastic). But then I don't know what should be the next step?
GenerateFlowFile is probably not the right tool here, as it doesn't accept incoming connections, so you would not be able to parameterize it with the count. You can use SplitJson, which will split the flowfile into multiple flowfiles given a JSONPath expression that returns an array from the JSON content.
Update
Here is a great tool you can use to evaluate JSONPath dynamically and see what it matches. In your example, let's say you received data like the following:
{
"yid": "nifi",
"days" : [{"Type1": "09/07/2017"},{"Type2":"10/07/2017"}],
"directions": [
{
"name": "San Francisco",
"recorder" : "Samsung",
"direction": "0"
},
{
"name": "Santa Monica",
"recorder" : "iPhone",
"direction": "270"
},
{
"name": "San Diego",
"recorder" : "Razr",
"direction": "180"
},
{
"name": "Santa Clara",
"recorder" : "Android",
"direction": "0"
}
]
}
The JSONPath expression $.directions[*].direction would return:
[
"0",
"270",
"180",
"0"
]
This would allow SplitJson to create four flowfiles with the derived content and fragment attributes to correlate them back to the original flowfile.
If you actually need to perform permutation logic on the resulting direction & recorder values, you may want to use ExecuteScript with a simple Groovy/Ruby/Python script to do that operation inline and split out the resulting values.

extract text from field arrays

One of the fields called "resources" has the following 2 inner documents.
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
"accountId": "934331768510612",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::sms_vild"
}
I need to split the ARN field and get the last part of it. i.e. "reports_201706.schema" preferably using scripted field.
What I have tried:
1) I checked the fileds list and found only 2 entries resources.accountId and resources.type
2) I tried with date-time field and it worked correctly in the scripted filed option (expression).
doc['eventTime'].value
3) But the same does not work with other text fields for e.g.
doc['eventType'].value
Getting this error:
"caused_by":{"type":"script_exception","reason":"link error","script_stack":["doc['eventType'].value","^---- HERE"],"script":"doc['eventType'].value","lang":"expression","caused_by":{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields by default. Set fielddata=true on [eventType] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."}}},"status":500}
It means I need to change the mapping. Is there any other way to extract text from nested arrays in an object?
Update:
Please visit sample kibana here...
https://search-accountact-phhofxr23bjev4uscghwda4y7m.us-east-1.es.amazonaws.com/_plugin/kibana/
search for "ebs_attach.png" and then check resources field. You will see 2 nested arrays like this...
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::datameetgeo/ebs_attach.png"
},
{
"accountId": "513469704633",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::datameetgeo"
}
I need to split ARN field and extract the last part that is again "ebs_attach.png"
If I can some-how display it as scripted field, then I can see the bucket name and the file name side-by-side on discovery tab.
Update 2
In other words, I am trying to extract the text shown in this image as a new field on discovery tab.
While you can use scripting for this, I highly encourage you to extract those kind of information at index time. I have provided two examples here, which are far from failsafe (you need to test with different path or with this field missing at all), but it should provide a base to start with
PUT foo/bar/1
{
"resources": [
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
"accountId": "934331768510612",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::sms_vild"
}
]
}
# this is slow!!!
GET foo/_search
{
"script_fields": {
"document": {
"script": {
"inline": "return params._source.resources.stream().filter(r -> 'AWS::S3::Object'.equals(r.type)).map(r -> r.ARN.substring(r.ARN.lastIndexOf('/') + 1)).findFirst().orElse('NONE')"
}
}
}
}
# Do this on index time, by adding a pipeline
PUT _ingest/pipeline/my-pipeline-id
{
"description" : "describe pipeline",
"processors" : [
{
"script" : {
"inline": "ctx.filename = ctx.resources.stream().filter(r -> 'AWS::S3::Object'.equals(r.type)).map(r -> r.ARN.substring(r.ARN.lastIndexOf('/') + 1)).findFirst().orElse('NONE')"
}
}
]
}
# Store the document, specify the pipeline
PUT foo/bar/1?pipeline=my-pipeline-id
{
"resources": [
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
"accountId": "934331768510612",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::sms_vild"
}
]
}
# lets check the filename field of the indexed document by getting it
GET foo/bar/1
# We can even search for this file now
GET foo/_search
{
"query": {
"match": {
"filename": "reports_201706.schema"
}
}
}
Note: Considered "resources" is kind of array
NSArray *array_ARN_Values = [resources valueForKey:#"ARN"];
Hope it will work for you!!!

Resources