Apache NiFi: merge SQL lines into Json - apache-nifi

I have a SQL database, and I extract some lines, transfrom them into Json to feed a MongoDB. I'm stuck with the transformation step. I have tried this flow:
The process is stalled on the MergeRecord processor, I don't knwo why.
The aim is to transform this kind of (simplified) SQL query result:
ID ROUTE_CODE STATUS SITE_ID SITE_CODE
379619 1801300001 10 220429 100001
379619 1801300001 10 219414 014037
379619 1801300001 10 220429 100001
379620 1801300002 10 220429 100001
379620 1801300002 10 219454 014075
379620 1801300002 10 220429 100001
To this json:
[
{
"routeId": "379619",
"routeCode": "1901300001",
"routeStatus": "10",
sites: [
{ "siteId": "220429", "siteCode" : "100001" },
{ "siteId": "219414", "siteCode" : "014037" }
]
},
{
"routeId": "379620",
"routeCode": "1901300002",
"routeStatus": "10",
sites: [
{ "siteId": "220429", "siteCode" : "100001" },
{ "siteId": "219454", "siteCode" : "014075" }
]
}
]
The MergeRecord should group by the "routeId", also I don't know yet the correct Jolt transform to group the sites as array...

The flow is stuck because back-pressure has engaged on the queue between ConvertAvrToJson and MergeRecord, which can be seen by the red indicator showing that the queue has reached its max size of 10k flow files. This means the ConvertAvroToJson processor will no longer execute until the queue's threshold has been reduced, except MergeRecord is likely waiting for more files so the queue isn't going to reduce.
You could change the settings on the queue to increase the threshold to be higher than the number of records you are waiting for, or you could implement the flow differently...
After ExecuteSql it looks like 3 processors are being used to basically split, convert to json, and remerge back together. This could be done a lot more efficiently by not splitting and just using ConvertRecord with an Avro reader and a JSON writer, this way you can go ExecuteSQL -> ConvertRecord -> JOLT.
Also, you may want to look at JoltTransformRecord as an alternative to JoltTransformJson.

After ExecuteSQL (or ExecuteSQLRecord), you can then use PartitionRecord with the following user-defined properties added (property name is left of =, value to the right):
routeId = /ROUTE_ID
routeCode = /ROUTE_CODE
routeStatus = /STATUS
PartitionRecord should use a JSON writer, then you can use JoltTransformJson with the following spec:
[
{
"operation": "shift",
"spec": {
"*": {
"ID": "routeId",
"ROUTE_CODE": "routeCode",
"STATUS": "routeStatus",
"SITE_ID": "sites[#2].siteId",
"SITE_CODE": "sites[#2].siteCode"
}
}
},
{
"operation": "modify-overwrite-beta",
"spec": {
"routeId": "=firstElement(#(1,routeId))",
"routeCode": "=firstElement(#(1,routeCode))",
"routeStatus": "=firstElement(#(1,routeStatus))"
}
}
]
That will group each of the site IDs/codes into the sites field. Then you just need MergeRecord to patch them back together. Unfortunately PartitionRecord doesn't yet support the fragment.* attributes (I have written up NIFI-6139 to cover this improvement), so MergeRecord won't be able to guarantee that all the transformed records from the original input file will be in the same merged flow file. However each merged flow file will contain records with the sites array for some number of unique routeId/routeCode/routeStatus values.

Related

Remove ECS data from metricbeat for smaller documents

I use the graphite beat to get graphite protocol metrics into es.
The metric document is much bigger than the metric data itself (timestamp, value, metric name).
I also get all the ECS data inserted and I think it will make my queries much slower (and my documents much bigger) and I don't need this data.
Can I remove the ECS data somehow in the metricbeat configuration?
You might be able to use Metricbeat's drop_fields processor, but it might not be able to remove all the fields you specify as some are added after the processor chain.
So, acting on the ES side will guarantee you that you can change the event source the way you like. Also if you have many Beats deployed, you only need to configure this in a single place.
One way to achieve this is to create an index template for Metricbeat events and attach an ingest pipeline to it.
PUT _index_template/my-template
{
"index_patterns" : [
"metricbeat-*"
],
"template" : {
"settings" : {
"index" : {
"lifecycle" : {
"name" : "metric-lifecycle"
},
"codec" : "best_compression",
"default_pipeline" : "metric-pipeline"
}
},
...
Then the metric-pipeline would simply look like this and remove all the fields listed in the field array:
PUT _ingest/pipeline/metric-pipeline
{
"processors": [
{
"remove": {
"field": ["agent", "host", "..."]
}
}
]
}

How to handle arrays with QueryRecord?

I'm working in Apache NiFi and I've a question: how to handle nested arrays in JSON with QueryRecord processor? For example I've a JSON:
{
"offerName":"Viatti Strada Asimmetrico V-130 205/55 R16 91V",
"detailedStats":[
{
"type":"mobile",
"clicks":4,
"spending":"2.95"
}
]
}
How can I extract array to get the following result:
{
"offerName": "Viatti Strada Asimmetrico V-130 205/55 R16 91V",
"type": "mobile",
"clicks": 4,
"spending": "2.95"
}
I read about RPATH, but didnt find good examples.
Tried with:
SELECT RPATH(detailedStats, '/detailedStats[1]')
FROM FLOWFILE
But it throws error. How can i get expected result with RPATH?
You can select like below via QueryRecord . However it seems you are having an issue while writing. I used JsonRecordSetWriter with Inherent Record Schema. this is a good tutorial If you prefer avro schema
SELECT offerName,
RPATH_STRING(detailedStats, '/type') type,
RPATH_INT(detailedStats, '/clicks') clicks,
RPATH_STRING(detailedStats, '/spending') spending
FROM FLOWFILE
result is an array, so you should split it with $.* at the downstream
An alternative method might be adding a JoltTransformJSON processor with (shift type) specification, which's reached from the Advanced button of Settings tab, with the following code
[
{
"operation": "shift",
"spec": {
"detailedStats": {
"*": {
"#(2,offerName)": "offerName",
"*": "&"
}
}
}
}
]
in order to extract your desired result.

How to iterate json in each Flowfile in Nifi?

For example,there are 8 FFs,and then i’ve convert json to attribute for each FF,as follows:
I've add 5 Properties and Value with EvaluateJsonPath in pic.
If i need to convert 1000 multi-attribute,to set 1000 P/V with EvaluateJsonPath is too trouble.
What can i do to this easily?
Any help is appreciate!TIA
You don't have to (and shouldn't) split the individual JSON objects out of the array if you intend to keep them as a group (i.e. merge them back in). In most cases the split-transform-merge pattern has been replaced by record-based processors such as UpdateRecord or JoltTransformRecord. In your case since the data is JSON you can use JoltTransformJSON with the following spec to change the ID field to ID2 without splitting up the array:
[
{
"operation": "shift",
"spec": {
"*": {
"ID": "[#2].ID2",
"*": "[#2].&"
}
}
}
]
Note that you can also do this (especially for non-JSON input/output) with JoltTransformRecord, the only difference being that the spec is applied to each object in the array rather than JoltTransformJSON which applies the spec to the entire array. The JoltTransformRecord spec would look like this:
[
{
"operation": "shift",
"spec": {
"ID": "ID2",
"*": "&"
}
}
]

How to extract and visualize values from a log entry in OpenShift EFK stack

I have an OKD cluster setup with EFK stack for logging, as described here. I have never worked with one of the components before.
One deployment logs requests that contain a specific value that I'm interested in. I would like to extract just this value and visualize it with an area map in Kibana that shows the amount of requests and where they come from.
The content of the message field basically looks like this:
[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}
This plz is a German zip code, which I would like to visualize as described.
My problem here is that I have no idea how to extract this value.
A nice first success would be if I could find it with a regexp, but Kibana doesn't seem to work the way I think it does. Following its docs, I expect this /\"plz\":\"[0-9]{5}\"/ to deliver me the result, but I get 0 hits (time interval is set correctly). Even if this regexp matches, I would only find the log entry where this is contained and not just the specifc value. How do I go on here?
I guess I also need an external geocoding service, but at which point would I include it? Or does Kibana itself know how to map zip codes to geometries?
A beginner-friendly step-by-step guide would be perfect, but I could settle for some hints that guide me there.
It would be possible to parse the message field as the document gets indexed into ES, using an ingest pipeline with grok processor.
First, create the ingest pipeline like this:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
}
]
}
Then, when you index your data, you simply reference that pipeline:
PUT plz/_doc/1?pipeline=parse-plz
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}"""
}
And you will end up with a document like the one below, which now has a field called plz with the 12345 value in it:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345"
}
When indexing your document from Fluentd, you can specify a pipeline to be used in the configuration. If you can't or don't want to modify your Fluentd configuration, you can also define a default pipeline for your index that will kick in every time a new document is indexed. Simply run this on your index and you won't need to specify ?pipeline=parse-plz when indexing documents:
PUT index/_settings
{
"index.default_pipeline": "parse-plz"
}
If you have several indexes, a better approach might be to define an index template instead, so that whenever a new index called project.foo-something is created, the settings are going to be applied:
PUT _template/project-indexes
{
"index_patterns": ["project.foo*"],
"settings": {
"index.default_pipeline": "parse-plz"
}
}
Now, in order to map that PLZ on a map, you'll first need to find a data set that provides you with geolocations for each PLZ.
You can then add a second processor in your pipeline in order to do the PLZ/ZIP to lat,lon mapping:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
},
{
"script": {
"lang": "painless",
"source": "ctx.location = params[ctx.plz];",
"params": {
"12345": {"lat": 42.36, "lon": 7.33}
}
}
}
]
}
Ultimately, your document will look like this and you'll be able to leverage the location field in a Kibana visualization:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345",
"location": {
"lat": 42.36,
"lon": 7.33
}
}
So to sum it all up, it all boils down to only two things:
Create an ingest pipeline to parse documents as they get indexed
Create an index template for all project* indexes whose settings include the pipeline created in step 1

Using elastic search to build flow/funnel results based on unique identifiers

I want to be able to return a set of counts of individual documents from a single index based on a previous set of results, and am wondering if there is a way to do it without running a separate query for each.
So, given a data set like this (simplified version of my ES documents):
{
"name": "visit",
"sessionId": "session1"
},
{
"name": "visit",
"sessionId": "session2"
},
{
"name": "visit",
"sessionId": "session3"
},
{
"name": "click",
"sessionId": "session1"
},
{
"name": "click",
"sessionId": "session3"
}
What I would like to do is be able to search for name: visit and give a count of all those. That part is easy. But I would also like to be able to now count my name: click docs that have the sessionId of the name: visit result set and return a count of how many of those name: click there were as well as the name: visit.
Is there an easy way to do this? I have looked at aggregation APIs but they all seem to not quite fit my needs. There also seems to be a parent/child relationship but it doesn't apply to my situation since both documents I want to individually get counts of are of the same type.
Expected result would be something like this:
{
"count": {
// total number of visit events since this is my start point
"visit": 3,
// the amount of click results that have sessionId
// matching my previous search's sessionId values
"click": 2
}
}
At first glance, you need to do this in two queries:
the first aggregation query to retrieve the sessionIds and
a second aggregation query filtered with those sessionIds to find the count of clicks.
I don't think it's a big deal to run those two queries, but that depends on how much data you have and how many sessionIds you want to retrieve at once.

Resources