NiFi's PutHDFS, Custom File Name based on content - apache-nifi

In NiFi, PutHDFS processor is used to ingest data into HDFS directory. There are 100+ variants of file types possible (all in json format). The json starts with file type. This file type should be made the file name prefix. How to achieve this? Please advise.
{
"FILE_TYPE_1": [
{
"ORG_FIELD_1" : 38,
"ORG_FIELD_2" : 1,
"ORG_FIELD_3" : "Per Km",
"ORG_FIELD_4" : "x1",
"ORG_FIELD_5" : 1,
"ORG_FIELD_6" : 10.0,
"ORG_FIELD_7" : 0.0,
"ORG_FIELD_8" : 0.0,
"ORG_FIELD_9" : 96.0,
"ORG_FIELD_10" : 0
}
]
}

You can use EvaluateJsonPath to retrieve the name prefix to attribut then use UpdateAttribute to change the name

Related

Kafka Connect Sink Connector - Single Message Transforms - Filter by value(s)

Context:
I am writing a custom Kafka JDBC Sink Connector with filter transform (SMT). I want to filter the records by a particular field and value(s).
Static way (This works) -
{
"connector.class" : "io.confluent.connect.jdbc.JdbcSinkConnector",
...
"transforms" : "filterUnits",
"transforms.filterUnits.type" : "io.confluent.connect.transforms.Filter$Value",
"transforms.filterUnits.filter.condition": "$[?(#.units in [30, 35, 40, 45, 50]",
"transforms.filterUnits.filter.type" : "include"
}
I want to read filtering values from a file like below:
Dynamic way (Read from file) -
{
"connector.class" : "io.confluent.connect.jdbc.JdbcSinkConnector",
...
"transforms" : "filterUnits",
"transforms.filterUnits.type" : "io.confluent.connect.transforms.Filter$Value",
"transforms.filterUnits.filter.condition": "$[?(#.units in ${file:/data/app.properties:FILTER_VALUES})]",
"transforms.filterUnits.filter.type" : "include"
}
Question: Is there a way to read filtering values from a file or environment variable like above ?
References:
https://docs.confluent.io/platform/current/connect/transforms/filter-confluent.html
https://docs.confluent.io/platform/current/connect/transforms/filter-ak.html
https://github.com/json-path/JsonPath#filter-operators

Databricks Autoloader is getting stuck and does not pass to the next batch

I have a simple job scheduled every 5 min. Basically it listens to cloudfiles on storage account and writes them into delta table, extremely simple. The code is something like this:
df = (spark
.readStream
.format("cloudFiles")
.option('cloudFiles.format', 'json')
.load(input_path, schema = my_schema)
.select(cols)
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", f"{output_path}/_checkpoint")
.trigger(once = True)
.start(output_path))
Sometimes there are new files, sometimes not. After 40-60 batches it gets stuck on one particular batchId, as if there are no new files in the folder. If i run the script manually i get the same result: it points to the last actually processed batch.
{
"id" : "xxx,
"runId" : "xxx",
"name" : null,
"timestamp" : "2022-01-13T15:25:07.512Z",
"batchId" : 64,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"latestOffset" : 663,
"triggerExecution" : 1183
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "CloudFilesSource[/mnt/source/]",
"startOffset" : {
"seqNum" : 385,
"sourceVersion" : 1,
"lastBackfillStartTimeMs" : 1641982820801,
"lastBackfillFinishTimeMs" : 1641982823560
},
"endOffset" : {
"seqNum" : 385,
"sourceVersion" : 1,
"lastBackfillStartTimeMs" : 1641982820801,
"lastBackfillFinishTimeMs" : 1641982823560
},
"latestOffset" : null,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"metrics" : {
"numBytesOutstanding" : "0",
"numFilesOutstanding" : "0"
}
} ],
"sink" : {
"description" : "DeltaSink[/mnt/db/table_name]",
"numOutputRows" : -1
}
}
But if I run only the readStream part - it correctly reads the entire list of files ( and starts a new batchId: 0 ). The strangest part is: I have absolutely no Idea what causes it and why it takes around 40-60 batches to get this kind of error. Can anyone help? Or give me some suggestion?
I was thinking about using ForeachBatch() to append new data. Or using trigger .trigger(continuous='5 minutes')
I'm new to AutoLoader
Thank you so much!
I resolved it by using
.option('cloudFiles.useIncrementalListing', 'false')
My filenames are composed of flowname + timestamp, like this:
flow_name_2022-01-18T14-19-50.018Z.json
So my guess is: some combination of dots make the rocksdb go into non-existing directory, that's why the it reports that "found no new files". Once I disabled incremental listing rocksdb stopped making its mini checkpoints based on filenames and now reads the whole directory. This is the only explanation that I have.
If anyone is having the same issue try changing the filename

NiFi Route on JSON Attribute

Trying to use NiFi to route on an attribute.
I am attempting to take a json file, where two of the json records contain the following attributes (there are other json documents with different attributes in this file):
{
"ts" : "2020-010-07T12:00:00.448392Z",
"uid" : "CHh3F30dkfueLhnxSk",
"id.orig_h" : "10.10.10.10",
"id.orig_p" : 19726,
"id.resp_h" : "172.10.10.20",
"id.resp_p" : 443,
"proto" : "tcp",
"conn_state" : "SH",
"local_orig" : false,
"local_resp" : false,
"missed_bytes" : 0,
"history" : "F",
"orig_pkts" : 1,
"orig_ip_bytes" : 52,
"resp_pkts" : 0,
"resp_ip_bytes" : 0}
{
"ts" : "2020-10-10T12:00:00.461880Z",
"uid" : "CdoiLnRscrxO1BSYb",
"id.orig_h" : "10.10.17.777",
"id.orig_p" : 40433,
"id.resp_h" : "172.10.10.77",
"id.resp_p" : 443,
"version" : "TLSv12",
"cipher" : "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
"curve" : "secp777r1",
"server_name" : "connect-stackoverflow.questions.com",
"resumed" : false,
"established" : true,
"cert_chain_fuids" : [ "FR84qjkl2342SZLwV7", "Ffweqiof48b8j" ],
"client_cert_chain_fuids" : [ ],
"subject" : "CN=connect-ulastm.bentley.com",
"issuer" : "CN=Let's Encrypt Authority X3,O=Let's Encrypt,C=US",
"validation_status" : "ok"}
I want to specifically route on the $.conn_state attribute but it is not working. I have tried to match the expression with the evaluateJSONpath processor and passed it to routeOnAttribute. Here are my settings:
evaluateJSONpath processor:
The above processor does not match the json and forward the document
and the evaluateJSONpath is followed by the routeOnAttribute processor:
I have attempted to routeOnAttribute directly from my jsonRecord, but it does not appear to pull out or identify the attribute for routing...
How would I do this?
#Advent United
The JSON sample is more than one object, your evaluateJson is going to expect a single record in matching $.conn_state. For multiple record work against json or any data stream you should use QueryRecord. Once configured w/ record reader and record writer within in that processor you just click + to create the route and the value is the select statement where conn_state is not null. Then you can drag that route to the next processor.

FHIR - Extending a operation parameter (Additional parameters in Patient Resource)

https://www.hl7.org/fhir/parameters.html
is it right to add the additional parameters in extended operation or can we add the add the parameters in patient resource type . because if we have multiple values we are not able to map the patient data with the extended operation parameter.
how to add additional parameters in patient resource type???
Short Answer:
Every element in a resource can have extension child elements to represent additional information that is not part of the basic definition of the resource.
Here is the post on HL7 FHIR with detailed info and samples on Extensibility
Every element in a resource or data type includes an optional "extension" child element that may be present , So we can add the additional parameter with an extension
Eg:
{
"resourceType" : "Patient",
"extension" : [{
"url" : "http://hl7.org/fhir/StructureDefinition/patient-citizenship",
"extension" : [{
"url" : "code",
"valueCodeableConcept" : {
"coding" : [{
"system" : "urn:iso:std:iso:3166",
"code" : "DE"
}]
}
}, {
"url" : "period",
"valuePeriod" : {
"start" : "2009-03-14"
}
}]
}]
}

How to store nested document as String in elastic search

Context:
1) We are building a CDC pipeline (using kafka & connect framework)
2) We are using debezium for capturing mysql Tx logs
3) We are using Elastic Search connector to add documents to ES index
Sample change event generated by Debezium:
{
"source" : {
"before" : {
"Id" : 97,
"name" : "Northland",
"code" : "NTL",
"country_id" : 6,
"is_business_mapped" : 0
},
"after" : {
"Id" : 97,
"name" : "Northland",
"code" : "NTL",
"country_id" : 6,
"is_business_mapped" : 1
},
"source" : {
"version" : "0.7.5",
"name" : "__",
"server_id" : 252639387,
"ts_sec" : 1547805940,
"gtid" : null,
"file" : "mysql-bin-changelog.000570",
"pos" : 236,
"row" : 0,
"snapshot" : false,
"thread" : 614,
"db" : "bazaarify",
"table" : "state"
},
"op" : "u",
"ts_ms" : 1547805939683
}
What we want :
We want to visualize only 3 columns in kibana :
1) before - containing the nested JSON as string
2) after - containing the nested JSON as string
3) source - containing the nested JSON as string
I can think below possibilities here :
a) Either converting nested JSON as string
b) Combining column data in elastic search
I am a newbie to elastic search . Can someone please guide me how to do that.
I tried defining custom mapping as well but it is giving me exception.
You can always view your document as a Raw JSON in Kibana.
You don't need to manipulate it before indexing in elastic.
As this is related to visualization, handle this in Kibana only.
Check this link for a screenshot.
Refer this to add the columns which you want to see onto the results
I don't fully understand your use case, but if you would like to turn some json's to their representing strings, then you can use logstash for that, or even Elasticsearch ingest capabilities to convert an object (json) to a string.
From the link above, an example:
PUT _ingest/pipeline/my-pipeline-id { "description": "converts the
content of the id field to an integer", "processors" : [
{
"convert" : {
"field" : "source",
"type": "string"
}
} ] }

Resources