How to get hold of the content? - apache-nifi

I have spent several hours now trying to figure out the expression language to get hold of the flowfile content.
Have a simple test flow to try and learn Nifi where I have:
GetMongo -> LogAttributes -> Put Slack
-----------------------LOG1-----------------------
Standard FlowFile Attributes
Key: 'entryDate'
Value: 'Wed Sep 28 23:58:36 GMT 2016'
Key: 'lineageStartDate'
Value: 'Wed Sep 28 23:58:36 GMT 2016'
Key: 'fileSize'
Value: '70'
FlowFile Attribute Map Content
Key: 'filename'
Value: '43546945658800'
Key: 'path'
Value: './'
Key: 'uuid'
Value: 'd1e10623-0e90-44af-a620-6bed9776ed62'
-----------------------LOG1-----------------------
{ "_id" : { "$oid" : "57ec27ec35a0759d54fb465d" }, "keyA" : "valueA" }
In the putSlack expression for test I have tried:
${flowfile.content}
${message}
${payload}
${msg}
${flowfile-content}
${content}

There is no expression language that accesses the content of the flow file. The attributes and the content are purposely stored very differently in order to facilitate moving around a Flow File that could represent a large payload. Expression language operates on attributes only.
The ExtractText processor can be used to extract the whole content of the Flow File into an attribute, just keep in mind that should only be done when you know the content will have no problem fitting in memory.

Related

Nifi InvokeHTTP processor not triggering response relationship

Consider the following flow which authenticates via HTTP to a service. I'm seeing an HTTP status code of 201 (created) come back, which should trigger the response relationship/flow. However as you can see in the log below, only the original relationship is triggered.
The Flow
Green lines indicate "response" flow. Magenta indicates "original" flow.
POST /token properties
Log
You can see here that the original relationship is triggered, but the response is not -- even though the status code, 201, is in the "success" range.
2023-01-29 15:22:08,341 INFO [Timer-Driven Process Thread-7] o.a.n.processors.standard.LogAttribute LogAttribute[id=fe0ace38-0185-1000-376d-8737d0e020f8] logging for flow file StandardFlowFileRecord[uuid=6b9f010a-f287-449c-8bef-94840c5cfa2f,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1674862641879-1, container=default, section=1], offset=13494, length=107],offset=0,name=6b9f010a-f287-449c-8bef-94840c5cfa2f,size=107]
---------------------ORIGINAL---------------------
FlowFile Properties
Key: 'entryDate'
Value: 'Sun Jan 29 15:22:07 UTC 2023'
Key: 'lineageStartDate'
Value: 'Sun Jan 29 15:22:07 UTC 2023'
Key: 'fileSize'
Value: '107'
FlowFile Attribute Map Content
Key: 'filename'
Value: '6b9f010a-f287-449c-8bef-94840c5cfa2f'
Key: 'invokehttp.request.duration'
Value: '738'
Key: 'invokehttp.request.url'
Value: '...'
Key: 'invokehttp.response.url'
Value: '...'
Key: 'invokehttp.status.code'
Value: '201'
Key: 'invokehttp.status.message'
Value: ''
Key: 'invokehttp.tx.id'
Value: 'efca13ac-16a1-4a27-a8e1-d04110d48523'
Key: 'mime.type'
Value: 'application/json'
Key: 'path'
Value: './'
Key: 'responseBody'
Value: '...'
Key: 'uuid'
Value: '6b9f010a-f287-449c-8bef-94840c5cfa2f'
---------------------ORIGINAL---------------------
The only thing I though of which might be causing an issue is that I'm writing the response body to an attribute. I tried to test by setting this attribute name to empty string but that just gives me an error in the log. I assumed that without the attribute name set, the response body would be the FlowFile sent to the response relationship, but that doesn't seem to be working.
Update: I created a second InvokeHTTP processor and replaced the relationships / disabled the old one. The flow worked correctly until I set the Response Body Attribute Name, and then the response relationship stopped triggering. I need to set this attribute though, so I can extract the error message from the response in the case of failure. I think I'll have to enable the Response Generation Required option, and check the status code in the response relationship flow. This is not ideal, though.
When you use Response Body Attribute Name, only original route is triggered. It's InvokeHTTP's behaviour, you can check documentation.
FlowFile attribute name used to write an HTTP response body for FlowFiles transferred to the Original relationship.
You can use this way for your problem,
InvokeHTTP (original route)-> RouteOnAttribute - (Success - ${invokehttp.status.code.ge(200):and(${invokehttp.status.code.le(299)})})
When you set Response Body Attribute Name attribute, it means that you don't want new flowfile, you want just add a new attribute to existing flowfile.

Using terraform yamldecode to access multi level element

I have a yaml file (also used in a azure devops pipeline so needs to be in this format) which contains some settings I'd like to directly access from my terraform module.
The file looks something like:
variables:
- name: tenantsList
value: tenanta,tenantb
- name: unitName
value: canary
I'd like to have a module like this to access the settings but I can't see how to get to the bottom level:
locals {
settings = yamldecode(file("../settings.yml"))
}
module "infra" {
source = "../../../infra/terraform/"
unitname = local.settings.variables.unitName
}
But the terraform plan errors with this:
Error: Unsupported attribute
on canary.tf line 16, in module "infra":
16: unitname = local.settings.variables.unitName
|----------------
| local.settings.variables is tuple with 2 elements
This value does not have any attributes.
It seems like the main reason this is difficult is because this YAML file is representing what is logically a single map but is physically represented as a YAML list of maps.
When reading data from a separate file like this, I like to write an explicit expression to normalize it and optionally transform it for more convenient use in the rest of the Terraform module. In this case, it seems like having variables as a map would be the most useful representation as a Terraform value, so we can write a transformation expression like this:
locals {
raw_settings = yamldecode(file("${path.module}/../settings.yml"))
settings = {
variables = tomap({
for v in local.raw_settings.variables : v.name => v.value
})
}
}
The above uses a for expression to project the list of maps into a single map using the name values as the keys.
With the list of maps converted to a single map, you can then access it the way you originally tried:
module "infra" {
source = "../../../infra/terraform/"
unitname = local.settings.variables.unitName
}
If you were to output the transformed value of local.settings as YAML, it would look something like this, which is why accessing the map elements directly is now possible:
variables:
tenantsList: tenanta,tenantb
unitName: canary
This will work only if all of the name strings in your input are unique, because otherwise there would not be a unique map key for each element.
(Writing a normalization expression like this also doubles as some implicit validation for the shape of that YAML file: if variables were not a list or if the values were not all of the same type then Terraform would raise a type error evaluating that expression. Even if no transformation is required, I like to write out this sort of expression anyway because it serves as some documentation for what shape the YAML file is expected to have, rather than having to study all of the references to it throughout the rest of the configuration.)
With my multidecoder for YAML and JSON you are able to access multiple YAML and/or JSON files with their relative paths in one step.
Documentations can be found here:
Terraform Registry -
https://registry.terraform.io/modules/levmel/yaml_json/multidecoder/latest?tab=inputs
GitHub:
https://github.com/levmel/terraform-multidecoder-yaml_json
Usage
Place this module in the location where you need to access multiple different YAML and/or JSON files (different paths possible) and pass
your path/-s in the parameter filepaths which takes a set of strings of the relative paths of YAML and/or JSON files as an argument. You can change the module name if you want!
module "yaml_json_decoder" {
source = "levmel/yaml_json/multidecoder"
version = "0.2.1"
filepaths = ["routes/nsg_rules.yml", "failover/cosmosdb.json", "network/private_endpoints/*.yaml", "network/private_links/config_file.yml", "network/private_endpoints/*.yml", "pipeline/config/*.json"]
}
Patterns to access YAML and/or JSON files from relative paths:
To be able to access all YAML and/or JSON files in a folder entern your path as follows "folder/rest_of_folders/*.yaml", "folder/rest_of_folders/*.yml" or "folder/rest_of_folders/*.json".
To be able to access a specific YAML and/or a JSON file in a folder structure use this "folder/rest_of_folders/name_of_yaml.yaml", "folder/rest_of_folders/name_of_yaml.yml" or "folder/rest_of_folders/name_of_yaml.json"
If you like to select all YAML and/or JSON files within a folder, then you should use "*.yml", "*.yaml", "*.json" format notation. (see above in the USAGE section)
YAML delimiter support is available from version 0.1.0!
WARNING: Only the relative path must be specified. The path.root (it is included in the module by default) should not be passed, but everything after it.
Access YAML and JSON entries
Now you can access all entries within all the YAML and/or JSON files you've selected like that: "module.yaml_json_decoder.files.[name of your YAML or JSON file].entry". If the name of your YAML or JSON file is "name_of_your_config_file" then access it as follows "module.yaml_json_decoder.files.name_of_your_config_file.entry".
Example of multi YAML and JSON file accesses from different paths (directories)
first YAML file:
routes/nsg_rules.yml
rdp:
name: rdp
priority: 80
direction: Inbound
access: Allow
protocol: Tcp
source_port_range: "*"
destination_port_range: 3399
source_address_prefix: VirtualNetwork
destination_address_prefix: "*"
---
ssh:
name: ssh
priority: 70
direction: Inbound
access: Allow
protocol: Tcp
source_port_range: "*"
destination_port_range: 24
source_address_prefix: VirtualNetwork
destination_address_prefix: "*"
second YAML file:
services/logging/monitoring.yml
application_insights:
application_type: other
retention_in_days: 30
daily_data_cap_in_gb: 20
daily_data_cap_notifications_disabled: true
logs:
# Optional fields
- "AppMetrics"
- "AppAvailabilityResults"
- "AppEvents"
- "AppDependencies"
- "AppBrowserTimings"
- "AppExceptions"
- "AppExceptions"
- "AppPerformanceCounters"
- "AppRequests"
- "AppSystemEvents"
- "AppTraces"
first JSON file:
test/config/json_history.json
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
main.tf
module "yaml_json_multidecoder" {
source = "levmel/yaml_json/multidecoder"
version = "0.2.1"
filepaths = ["routes/nsg_rules.yml", "services/logging/monitoring.yml", test/config/*.json]
}
output "nsg_rules_entry" {
value = module.yaml_json_multidecoder.files.nsg_rules.aks.ssh.source_address_prefix
}
output "application_insights_entry" {
value = module.yaml_json_multidecoder.files.monitoring.application_insights.daily_data_cap_in_gb
}
output "json_history" {
value = module.yaml_json_multidecoder.files.json_history.glossary.title
}
Changes to Outputs:
nsg_rules_entry = "VirtualNetwork"
application_insights_entry = 20
json_history = "example glossary"

How to fetch multiline with ExtractGrok processor in ApacheNifi?

I am going to convert a log file events (which is recorded by LogAttribute processor) to JSON.
I am using ExtractGrok with this configuration:
STACK pattern in pattern file is (?m).*
Each log has this format:
2019-11-21 15:26:06,912 INFO [Timer-Driven Process Thread-4] org.apache.nifi.processors.standard.LogAttribute LogAttribute[id=143515f8-1f1d-1032-e7d2-8c07f50d1c5a] logging for flow file StandardFlowFileRecord[uuid=02eb9f21-4587-458b-8cee-ad052cb8e634,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1574339166853-1, container=default, section=1], offset=0, length=0],offset=0,name=0df20cc1-3f93-49df-81b1-dac18318ccd9,size=0]
------------- request was received----------
Standard FlowFile Attributes
Key: 'entryDate'
Value: 'Thu Nov 21 15:26:06 AST 2019'
Key: 'lineageStartDate'
Value: 'Thu Nov 21 15:26:06 AST 2019'
Key: 'fileSize'
Value: '0'
FlowFile Attribute Map Content
Key: 'filename'
Value: '0df20cc1-3f93-49df-81b1-dac18318ccd9'
Key: 'http.context.identifier'
Value: '9552bd22-ec3b-4ada-93a9-a5ce9b27de25'
Key: 'path'
Value: './'
Key: 'uuid'
Value: '02eb9f21-4587-458b-8cee-ad052cb8e634'
-------------- request was received----------
I expect rest of the message after first line saved in log, but I get only first line:
-------------- request was received----------
I check the expression in Grok Debugger and it works. but it doesn't work with NiFi.
How to config ExtractGrok to get all lines in log value?
I found the solution, I replace (?m).* with this one (?s).* and it works.

How can I timestamp messages in nifi?

Disclaimer: I know absolutely nothing about nifi.
I need to receive messages from the ListenHTTP processor, and then convert each message into a timestamped json message.
So, say I receive the message hello world at 5 am. It should transform it into {"timestamp": "5 am", "message":"hello world"}.
How do I do that?
Each flowfile has attributes, which are pieces of metadata stored in key/value pairs in memory (available for rapid read/write). When any operation occurs, pieces of metadata get written by the NiFi framework, both to the provenance events related to the flowfile, and sometimes to the flowfile itself. For example, if ListenHTTP is the first processor in the flow, any flowfile that enters the flow will have an attribute entryDate with the value of the time it originated in the format Thu Jan 24 15:53:52 PST 2019. You can read and write these attributes with a variety of processors (i.e. UpdateAttribute, RouteOnAttribute, etc.).
For your use case, you could a ReplaceText processor immediately following the ListenHTTP processor with a search value of (?s)(^.*$) (the entire flowfile content, or "what you received via the HTTP call") and a replacement value of {"timestamp_now":"${now():format('YYYY-MM-dd HH:mm:ss.SSS Z')}", "timestamp_ed": "${entryDate:format('YYYY-MM-dd HH:mm:ss.SSS Z')}", "message":"$1"}.
The example above provides two options:
The entryDate is when the flowfile came into existence via the ListenHTTP processor
The now() function gets the current timestamp in milliseconds since the epoch
Those two values can differ slightly based on performance/queuing/etc. In my simple example, they were 2 milliseconds apart. You can format them using the format() method and the normal Java time format syntax, so you could get "5 am" for example by using h a (full example: now():format('h a'):toLower()).
Example
ListenHTTP running on port 9999 with path contentListener
ReplaceText as above
LogAttribute with log payload true
Curl command: curl -d "helloworld" -X POST http://localhost:9999/contentListener
Example output:
2019-01-24 16:04:44,529 INFO [Timer-Driven Process Thread-6] o.a.n.processors.standard.LogAttribute LogAttribute[id=8246b0a0-0168-1000-7254-2c2e43d136a7] logging for flow file StandardFlowFileRecord[uuid=5e1c6d12-298d-4d9c-9fcb-108c208580fa,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1548374015429-1, container=default, section=1], offset=3424, length=122],offset=0,name=5e1c6d12-298d-4d9c-9fcb-108c208580fa,size=122]
--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
Value: 'Thu Jan 24 16:04:44 PST 2019'
Key: 'lineageStartDate'
Value: 'Thu Jan 24 16:04:44 PST 2019'
Key: 'fileSize'
Value: '122'
FlowFile Attribute Map Content
Key: 'filename'
Value: '5e1c6d12-298d-4d9c-9fcb-108c208580fa'
Key: 'path'
Value: './'
Key: 'restlistener.remote.source.host'
Value: '127.0.0.1'
Key: 'restlistener.remote.user.dn'
Value: 'none'
Key: 'restlistener.request.uri'
Value: '/contentListener'
Key: 'uuid'
Value: '5e1c6d12-298d-4d9c-9fcb-108c208580fa'
--------------------------------------------------
{"timestamp_now":"2019-01-24 16:04:44.518 -0800", "timestamp_ed": "2019-01-24 16:04:44.516 -0800", "message":"helloworld"}
So, I added an ExecuteScript processor with this code:
import org.apache.commons.io.IOUtils
import java.nio.charset.StandardCharsets
import java.time.LocalDateTime
flowFile = session.get()
if(!flowFile)return
def text = ''
// Cast a closure with an inputStream parameter to InputStreamCallback
session.read(flowFile, {inputStream ->
text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
// Do something with text here
} as InputStreamCallback)
def outputMessage = '{\"timestamp\":\"' + LocalDateTime.now().toString() + '\", \"message:\":\"' + text + '\"}'
flowFile = session.write(flowFile, {inputStream, outputStream ->
text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
outputStream.write(outputMessage.getBytes(StandardCharsets.UTF_8))
} as StreamCallback)
session.transfer(flowFile, REL_SUCCESS)
and it worked.

API Blueprint and Dredd - Required field missing from response, but tests still pass

I am using a combination of API Blueprint and Dredd to test an API my application is dependent on. I am using attributes in API blueprint to define the structure of the response's body.
Apparently I'm missing something though because the tests always pass even though I've purposefully defined a fake "required" parameter that I know is missing from the API's response. It seems that Dredd is only testing whether the type of the response body (array) rather than the type and the parameters within it.
My API Blueprint file:
FORMAT: 1A
HOST: http://somehost.net
# API Title
## Endpoints [GET /endpoint/{date}]
+ Parameters
+ date: `2016-09-01` (string, required) - Date
+ Response 200 (application/json; charset=utf-8)
+ Attributes (array[Data])
## Data Structures
### Data
- realParameter: 2432432 (number)
- realParameter2: `some string` (string, required)
- realParameter3: `Something else` (string, required)
- realParameter4: 1 (number, required)
- fakeParam: 1 (number, required)
The response body:
[
{
"realParameter": 31,
"realParameter2": "some value",
"realParameter3": "another value",
"realParameter4": 8908
},
{
"realParameter": 54,
"realParameter2": "something here",
"realParameter3": "and here too",
"realParameter4": 6589
}
]
And my Dredd config file:
reporter: apiary
custom:
apiaryApiKey: somekey
apiaryApiName: somename
dry-run: null
hookfiles: null
language: nodejs
sandbox: false
server: null
server-wait: 3
init: false
names: false
only: []
output: []
header: []
sorted: false
user: null
inline-errors: false
details: false
method: []
color: true
level: info
timestamp: false
silent: false
path: []
blueprint: myApiBlueprintFile.apib
endpoint: 'http://ahost.com'
Does anyone have any idea why Dredd ignores the fact that "fakeParameter" doesn't actually show up in the response body and still allows the test to pass?
You've run into a limitation of MSON, the language API Blueprint uses for describing attributes. In many cases, MSON describes what MAY be present in the data structure rather than what MUST exactly be present.
The most prominent case are arrays, where basically any content of the array is optional and thus the underlying generated JSON Schema doesn't put any constraints on array contents. Dredd just respects that, so indirectly it becomes a Dredd issue too, however there's not much Dredd can do about it.
There's an issue for the problem: apiaryio/mson#66 You can follow and comment under the issue to get updated about this. Dredd is usually very prompt in getting the latest API Blueprint parser, so once it's implemented in the language itself, it won't take long to appear in Dredd.
Obvious (but tedious) workaround is to specify your own JSON Schema with stricter rules using the + Schema section alongside the + Attributes section.

Resources