removing brackets in json using Apache NIFI - apache-nifi

I am trying to remove the {} from the series ID.
"seriesID" : "CUUR0000SA0"
}, {
"seriesID" : "CUSR0000SA0"
}, {
"seriesID" : "LNS14000000"
}, {
"seriesID" : "CES0000000001"
}, {
"seriesID" : "CUUR0000SA0L1E"
}, {
"seriesID" : "CES0500000003"
}, {
"seriesID" : "WPUFD4"
}, {
"seriesID" : "LNS12000000"
}, {
"seriesID" : "WPSFD4"
}, {
"seriesID" : "CUSR0000SA0L1E"
}, {
"seriesID" : "WPUFD49104"
}, {
"seriesID" : "WPSFD49104"
}, {
"seriesID" : "LNS13000000"
}, {
"seriesID" : "LNS11300000"
}, {
i tried using a jolt and a replace text in NIFI but i am not able to remove these brackets, anything helps

You won't be able to do that with a JSON library as the result is not valid JSON. Without array brackets around the input, that is not valid JSON either.
If you had array brackets around the input, you could use the ConvertRecord processor and choose a FreeFormTextRecordSetWriter. There you can choose to output things like:
"seriesID": "${seriesID}",
Which will give you what you described in the output (except for the extra comma at the end).
Alternatively you should be able to replace something like \n}, { with , in ReplaceText, although you will likely need another ReplaceText to get rid of the very first and last curly braces.
Lastly, you could always use ExecuteScript to read in the input and write it out as you see fit.

The solution you are looking for is not valid for Jolt transformation, I can suggest you 2 possible solution for your query with Jolt :
Input :
[
{
"seriesID": "CUUR0000SA0"
},
{
"seriesID": "CUSR0000SA0"
}
]
To add additional number in each seriesID key to differentiate it from one another eg seriesID1,seriesID2.
Spec for this :
[
{
"operation": "shift",
"spec": {
"*": {
"seriesID": "seriesID&1"
}
}
}
]
To create a Json array with seriesID with all values in it.
[{
"operation": "shift",
"spec": {
"*": {
"seriesID": "seriesID.[&1]"
}
}}]

Related

Enriching the Data in Elastic Search

We will be ingesting data into an Index (Index1), however one of the fields in the document(field1) is an ENUM value, which needs to be converted into a value (string) using a lookup through a rest api call.
the rest api call gives a JSON in response like this which has string values for all the ENUMS.
{
values : {
"ENUMVALUE1" : "StringValue1",
"ENUMVALUE2" : "StringValue2"
}
}
I am thinking of making an index from this response document and use that for the lookup.
The incoming document has field1 as ENUMVALUE1 or ENUMVALUE2 (only one of them) and we want to eventually save StringValue1 or StringValue2 in the document under field1 and not ENUMVALUE1.
I went through the documentation of enrichment processor however I am not sure if that is the correct approach to handle this scenario.
While forming the match enrich policy I am not sure how match_field and enrich_fields should be configured.
Could you please advise if this can be done in Elastic and if yes what possible options do I have if the above one is not an optimal approach.
OK, 150-200 enums might not be enough to use an enrich index, but here is a potential solution.
You first need to build the source index containing all enum mappings, it would look like this:
POST enums/_doc/_bulk
{"index":{}}
{"enum_id": "ENUMVALUE1", "string_value": "StringValue1"}
{"index":{}}
{"enum_id": "ENUMVALUE2", "string_value": "StringValue2"}
Then you need to create an enrich policy out of this index:
PUT /_enrich/policy/enum-policy
{
"match": {
"indices": "enums",
"match_field": "enum_id",
"enrich_fields": [
"string_value"
]
}
}
POST /_enrich/policy/enum-policy/_execute
Once it's built (with 200 values it should take a few seconds), you can start building your ingest pipeline using an ingest processor:
PUT _ingest/pipeline/enum-pipeline
{
"description": "Enum enriching pipeline",
"processors": [
{
"enrich" : {
"policy_name": "enum-policy",
"field" : "field1",
"target_field": "tmp"
}
},
{
"set": {
"if": "ctx.tmp != null",
"field": "field1",
"value": "{{tmp.string_value}}"
}
},
{
"remove": {
"if": "ctx.tmp != null",
"field": "tmp"
}
}
]
}
Testing this pipeline, we get this:
POST _ingest/pipeline/enum-pipeline/_simulate
{
"docs": [
{
"_source": {
"field1": "ENUMVALUE1"
}
},
{
"_source": {
"field1": "ENUMVALUE4"
}
}
]
}
Results =>
{
"docs" : [
{
"doc" : {
"_source" : {
"field1" : "StringValue1" <--- value has been replaced
}
}
},
{
"doc" : {
"_source" : {
"field1" : "ENUMVALUE4" <--- value has NOT been replaced
}
}
}
]
}
For the sake of completeness, I'm sharing the other solution without enrich index, so you can test both and use whichever makes most sense for you.
In this second option, we're simply going to use an ingest pipeline with a script processor whose parameters contain a map of your enums. field1 will be replaced by whatever value is mapped to the enum value it contains, or will keep its value if there's no corresponding enum value.
PUT _ingest/pipeline/enum-pipeline
{
"description": "Enum enriching pipeline",
"processors": [
{
"script": {
"source": """
ctx.field1 = params.getOrDefault(ctx.field1, ctx.field1);
""",
"params": {
"ENUMVALUE1": "StringValue1",
"ENUMVALUE2": "StringValue2",
... // add all your enums here
}
}
}
]
}
Testing this pipeline, we get this
POST _ingest/pipeline/enum-pipeline/_simulate
{
"docs": [
{
"_source": {
"field1": "ENUMVALUE1"
}
},
{
"_source": {
"field1": "ENUMVALUE4"
}
}
]
}
Results =>
{
"docs" : [
{
"doc" : {
"_source" : {
"field1" : "StringValue1" <--- value has been replaced
}
}
},
{
"doc" : {
"_source" : {
"field1" : "ENUMVALUE4" <--- value has NOT been replaced
}
}
}
]
}
So both solutions would work for your case, you just need to pick up the one that is the best fit. Just know that in the first option, if your enums change, you'll need to rebuild your source index and enrich policy, while in the second case, you just need to modify the parameters map of your pipeline.

Nifi MergeRecord Processor to merge null values

I am splitting the list of fields and trying to merge them at the end. I have 2 kind of fields, standard field and custom field. The way I process custom fields is different than standard fields.
{
"standardfield1" : "fieldValue1",
"customField1" : "customValue"
}
These has to be translated into
{
"standardfield1" : "fieldValue1",
"customFields" : [
{ "type" : "customfield",
"id" : 1212 //this is id of the customField1, retrieved at run time
"value" : "customValue"
} ]
}
My mergeRecord Schema is set to
{
"name": "custom field",
"namespace": "nifi",
"type": "record",
"fields": [
{ "name": "id", "type": "string" },
{ "name": "type", "type": "string" },
{ "name": "value", "type": "string" }
]
}
And as per my need I am setting the content of the standard field to the new flowfile attribute as I can extract it from it, and put the empty value in the flowfile content.
So, both custom fields and standard fields are connected to mergeRecord processor.
it works pretty fine as long as custom fields are available in the payload. If there is only standard fields and no custom fields then mergeRecord processor wont merge any thing and also wont fail, it just throws NullPointerException and there by flowfile stuck in the queue forever.
I want to make mergeRecord processor to merge even the empty content flow files.
Any help would be appreciated
I'm not sure I fully understand your use case, but for your input above, if you have extracted/populated the ID for customField1 into an attribute (let's call it myId), then you could use JoltTransformJSON to get your desired output above, using this Chain spec:
[
{
"operation": "shift",
"spec": {
"standardfield1": "standardfield1",
"customField*": {
"#": "customFields.[&(1,1)].value",
"#customfield": "customFields.[&(1,1)].type",
"#${myId}": "customFields.[&(1,1)].id"
}
}
},
{
"operation": "remove",
"spec": {
"customFields": {
"0": ""
}
}
},
{
"operation": "modify-overwrite-beta",
"spec": {
"customFields": {
"*": {
"id": "=toInteger"
}
}
}
}
]
This will create the customFields array if there is a customField present, and populate it with the values you have above (including the value of the myId attribute). You can tweak things (like adding a Default spec to the above Chain) to add an empty array for customFields if you wish (to keep the schema happy, e.g.).
If I've misunderstood what you're trying to do, please let me know and I will do my best to help.

Elasticsearch indexed search template generates empty strings in array

First of all, this is taken from documentation:
Passing an array of strings
GET /_search/template
{
"template": {
"query": {
"terms": {
"status": [
"{{#status}}",
"{{.}}",
"{{/status}}"
]
}
}
},
"params": {
"status": [ "pending", "published" ]
}
}
which is rendered as:
{
"query": {
"terms": {
"status": [ "pending", "published" ]
}
}
However, in my scenario I've done exactly the same template (at least I think so), but it produces a slightly different output for me:
.."filter" : {
"bool" : {
"must" : [{
"terms" : {
"myTerms" : [
"{{#myTerms}}",
"{{.}}",
"{{/myTerms}}"
],
"_cache" : true
}
}
]
}
}..
That's how I call it later:
GET /passport/_search/template
{
"template": {
"id": "myTemplate"
},
"params": {
"myTerms": ["1", "2"]
}
}
However it's rendered as:
.."myTerms" : ["", "1", "2", ""]..
That wouldn't a be a issue, but myTerms are stored as integers and I would like to keep it this way (but if only this is solution, then fine, I can live with it), but then query throws exception that it cannot convert "" into integer type, which is expected behaviour
NumberFormatException[For input string: \"\"]
How should I deal with that? I don't want to store my templates as files, I prefer them being indexed.
This SO question was promising: Pass an array of integers to ElasticSeach template but it's not clear and answer didn't solve my issue (I wasn't allowed to store my template like that).
Elasticsearch version used: 1.6.0
Please advice.
I've seen this requirement before and the solution looks hacky, but it works. Basically, the commas in the template are the problem because Mustache will go over the array and for each element in the array will put the element - {{.}} - but also the comma you are specifying inside {{#myTerms}} and {{/myTerms}}.
And, also, in your case you shouldn't use double quotes - "{{.}}" because the element itself will be surrounded with double quotes. That's why you are seeing "1" in the result. But, if you want to match numbers that should be a list of numbers, not strings.
So, first of all, get rid of the double quotes. This means, surrounding the template with double quotes and escape any double quotes that should make it in the final result (you'll understand shortly by seeing the example below).
Secondly, the hacky part is to simulate the commas in the result and to skip the last comma. Meaning, 1,2,3, shouldn't contain the last comma. The solution is to provide the parameters as a list of tuples - one element of the tuple is the value itself, the other element is a boolean: [{"value":1,"comma":true},{"value":2,"comma":true},{"value":4}]. If comma is true then Mustache should put the ,, otherwise not (this case is for the last element in the array).
POST /_search/template/myTemplate
{"template":"{\"filter\":{\"bool\":{\"must\":[{\"terms\":{\"myTerms\":[{{#myTerms}}{{value}}{{#comma}},{{/comma}}{{/myTerms}}],\"_cache\":true}}]}}}"}
And this is how you should pass the parameters:
{
"template": {
"id": "myTemplate"
},
"params": {
"myTerms": [{"value":1,"comma":true},{"value":2,"comma":true},{"value":4}]
}
}
What this does is to generate something like this:
{
"filter": {
"bool": {
"must": [
{
"terms": {
"myTerms": [1,2,4],
"_cache": true
}
}
]
}
}
}
try this out! (using 'toJson' function)
GET /_search/template
{
"template": {
"query": {
"terms": {
"status": {{#toJson}}status{{/toJson}}
}
}
},
"params": {
"status": [ "pending", "published" ]
}
}

Elasticsearch mapping - different data types in same field

I am trying to to create a mapping that will allow me to have a document looking like this:
{
"created_at" : "2014-11-13T07:51:17+0000",
"updated_at" : "2014-11-14T12:31:17+0000",
"account_id" : 42,
"attributes" : [
{
"name" : "firstname",
"value" : "Morten",
"field_type" : "string"
},
{
"name" : "lastname",
"value" : "Hauberg",
"field_type" : "string"
},
{
"name" : "dob",
"value" : "1987-02-17T00:00:00+0000",
"field_type" : "datetime"
}
]
}
And the attributes array must be of type nested, and dynamic, so i can add more objects to the array and index it by the field_type value.
Is this even possible?
I have been looking at the dynamic_templates. Can i use that?
You actually can index multiple datatypes into the same field using a multi-field mapping and the ignore_malformed parameter, if you are willing to query the specific field type if you want to do type specific queries (like comparisons).
This will allow elasticsearch to populate the fields that are pertinent for each input, and ignore the others. It also means you don’t need to do anything in your indexing code to deal with the different types.
For example, for a field called user_input that you want to be able to do date or integer range queries over if that is what the user has entered, or a regular text search if the user has entered a string, you could do something like the following:
PUT multiple_datatypes
{
"mappings": {
"_doc": {
"properties": {
"user_input": {
"type": "text",
"fields": {
"numeric": {
"type": "double",
"ignore_malformed": true
},
"date": {
"type": "date",
"ignore_malformed": true
}
}
}
}
}
}
}
We can then add a few documents with different user inputs:
PUT multiple_datatypes/_doc/1
{
"user_input": "hello"
}
PUT multiple_datatypes/_doc/2
{
"user_input": "2017-02-12"
}
PUT multiple_datatypes/_doc/3
{
"user_input": 5
}
And when you search for these, and have ranges and other type-specific queries work as expected:
// Returns only document 2
GET multiple_datatypes/_search
{
"query": {
"range": {
"user_input.date": {
"gte": "2017-01-01"
}
}
}
}
// Returns only document 3
GET multiple_datatypes/_search
{
"query": {
"range": {
"user_input.numeric": {
"lte": 9
}
}
}
}
// Returns only document 1
GET multiple_datatypes/_search
{
"query": {
"term": {
"user_input": {
"value": "hello"
}
}
}
}
I wrote about this as a blog post here
No - you cannot have different datatypes for the same field within the same type.
e.g. the field index/type/value can not be both a string and a date.
A dynamic template can be used to set the datatype and analyzer based on the format of the field name
For example:
set all fields with field names ending in "_dt" to type datetime.
But this won't help in your scenario, once the datatype is set you can't change it.

ElasticSearch filter by array item

I have the following record in ES:
"authInput" : {
"uID" : "foo",
"userName" : "asdfasdfasdfasdf",
"userType" : "External",
"clientType" : "Unknown",
"authType" : "Redemption_regular",
"uIDExtensionFields" :
[
{
"key" : "IsAccountCreation",
"value" : "true"
}
],
"externalReferences" : []
}
"uIDExtensionFields" is an array of key/value pairs. I want to query ES to find all records where:
"uIDExtensionFields.key" = "IsAccountCreation"
AND "uIDExtensionFields.value" = "true"
This is the filter that I think I should be using but it never returns any data.
GET devdev/authEvent/_search
{
"size": 10,
"filter": {
"and": {
"filters": [
{
"term": {
"authInput.uIDExtensionFields.key" : "IsAccountCreation"
}
},
{
"term": {
"authInput.uIDExtensionFields.value": "true"
}
}
]
}
}
}
Any help you guys could give me would be much appreciated.
Cheers!
UPDATE: WITH THE HELP OF THE RESPONSES BELOW HERE IS HOW I SOLVED MY PROBLEM:
lowercase the value that I was searching for. (changed "IsAccoutCreation" to "isaccountcreation")
Updated the mapping so that "uIDExtensionFields" is a nested type
Updated my filter to the following:
_
GET devhilden/authEvent/_search
{
"size": 10,
"filter": {
"nested": {
"path": "authInput.uIDExtensionFields",
"query": {
"bool": {
"must": [
{
"term": {
"authInput.uIDExtensionFields.key": "isaccountcreation"
}
},
{
"term": {
"authInput.uIDExtensionFields.value": "true"
}
}
]
}
}
}
}
}
There are a few things probably going wrong here.
First, as mconlin points out, you probably have a mapping with the standard analyzer for your key field. It'll lowercase the key. You probably want to specify "index": "not_analyzed" for the field.
Secondly, you'll have to use nested mappings for this document structure and specify the key and the value in a nested filter. That's because otherwise, you'll get a match for the following document:
"uIDExtensionFields" : [
{
"key" : "IsAccountCreation",
"value" : "false"
},
{
"key" : "SomeOtherField",
"value" : "true"
}
]
Thirdly, you'll want to be using the bool-filter's must and not and to ensure proper cachability.
Lastly, you'll want to put your filter in the filtered-query. The top-level filter is for when you want hits to be filtered, but facets/aggregations to not be. That's why it's renamed to post_filter in 1.0.
Here's a few resources you'll want to check out:
Troubleshooting Elasticsearch searches, for Beginners covers the first two issues.
Managing Relations in ElasticSearch covers nested docs (and parent/child)
all about elasticsearch filter bitsets covers and vs. bool.

Resources