How do I use FreeFormTextRecordSetWriter - apache-nifi

I my Nifi controller I want to configure the FreeFormTextRecordSetWriter, but I have no Idea what I should put in the "Text" field. I'm getting the text from my source (in my case GetSolr), and just want to write this, period.
Documentation and mailinglist do not seem to tell me how this is done, any help appreciated.
EDIT: Here the sample input + output I want to achieve (as you can see: not ransformation needed, plain text, no JSON input)
EDIT: I now realize, that I can't tell GetSolr to return just CSV data - but I have to use Json
So referencing with attribute seems to be fine. What the documentation omits is, that the ${flowFile} attribute should containt the complete flowfile that is returned.
Sample input:
{
"responseHeader": {
"zkConnected": true,
"status": 0,
"QTime": 0,
"params": {
"q": "*:*",
"_": "1553686715465"
}
},
"response": {
"numFound": 3194,
"start": 0,
"docs": [
{
"id": "{402EBE69-0000-CD1D-8FFF-D07756271B4E}",
"MimeType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"FileName": "Test.docx",
"DateLastModified": "2019-03-27T08:05:00.103Z",
"_version_": 1629145864291221504,
"LAST_UPDATE": "2019-03-27T08:16:08.451Z"
}
]
}
}
Wanted output
{402EBE69-0000-CD1D-8FFF-D07756271B4E}
BTW: The documentation says this:
The text to use when writing the results. This property will evaluate the Expression Language using any of the fields available in a Record.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
I want to use my source's text, so I'm confused

You need to use expression language as if the record's fields are the FlowFile's attributes.
Example:
Input:
{
"t1": "test",
"t2": "ttt",
"hello": true,
"testN": 1
}
Text property in FreeFormTextRecordSetWriter:
${t1} k!${t2} ${hello}:boolean
${testN}Num
Output(using ConvertRecord):
test k!ttt true:boolean
1Num
EDIT:
Seems like what you needed was reading from Solr and write a single column csv. You need to use CSVRecordSetWriter. As for the same,
I should tell you to consider to upgrade to 1.9.1. Starting from 1.9.0, the schema can be inferred for you.
otherwise, you can set Schema Access Strategy as Use 'Schema Text' Property
then, use the following schema in Schema Text
{
"name": "MyClass",
"type": "record",
"namespace": "com.acme.avro",
"fields": [
{
"name": "id",
"type": "int"
}
]
}
this should work
I'll edit it into my answer. If it works for you, please choose my answer :)

Related

Streamsets Data Collector: Replace a Field With Its Child Value

I have a data structure like this
{
"id": 926267,
"updated_sequence": 2304899,
"published_at": {
"unix": 1589574240,
"text": "2020-05-15 21:24:00 +0100",
"iso_8601": "2020-05-15T20:24:00Z"
},
"updated_at": {
"unix": 1589574438,
"text": "2020-05-15 21:27:18 +0100",
"iso_8601": "2020-05-15T20:27:18Z"
},
}
I want to replace the updated_at field with its unix field value using Streamsets Data Collector. As far as I know, it can be done using field replacer. But I'm still didn't get it how to make a mapping expression. How can I achieve that?
In Field Replacer, set Fields to /rec/updated_at and New value to ${record:value('/rec/updated_at/unix')} and it will replace the value. See below.
Cheers,
Dash

How to use the Nifi JoltJSONTransform spec?

I wish to use the JoltTransformJSON spec that can be used to convert the input to output.
I have tried to use map to List and other syntax, but was not been successful so far.
Expected input:
{
"params": "sn=GH6747246T4JLR6AZ&c=QUERY_RECORD&p=test_station_name&p=station_id&p=result&p=mac_addresss"
}
Expected output:
{
"queryType": "scan",
"dataSource": "xyz",
"resultFormat": "list",
"columns": ["test_station_name", "station_id", "result", "mac_address"],
"intervals": ["2018-01-01/2018-02-09"],
"filter": {
"type": "selector",
"dimension": "sn",
"value": "GH6747246T4JLR6AZ"
}
}
Except for the content inside Columns and dimension and value attributes rest of the fields are hardcoded.
As all of the data is contained in a single JSON key/value, I don't think JoltTransformJSON is the best option here. I actually think writing a simple script in Python/Groovy/Ruby to split the querystring value and write it out as JSON is easier and less complicated to maintain. I would recommend Groovy specifically (you can use the specialized ExecuteGroovyScript processor), as it is the most performant & robust in Apache NiFi and has excellent JSON handling.

How to get name/confidence individually from classify_text?

Most of the other methods in the language api, such as analyze_syntax, analyze_sentiment etc, have the ability to return the constituent elements like
sentiment.score
sentiment.magnitude
token.part_of_speech.tag
etc etc etc....
but I have not found a way to return name and confidence in isolation from classify_text. It doesn't look like it's possible but that seems weird. Am missing something? Thanks
The language.documents.classifyText method returns a ClassificationCategory object which contains name and confidence. If you only want one of the fields you can filter by categories/name or categories/confidence. As an example I executed:
POST https://language.googleapis.com/v1/documents:classifyText?fields=categories%2Fname&key={YOUR_API_KEY}
{
"document": {
"content": "this is a test for a StackOverflow question. I get an error because I need more words in the document and I don't know what else to say",
"type": "PLAIN_TEXT"
}
}
Which returns:
{
"categories": [
{
"name": "/Science/Computer Science"
},
{
"name": "/Computers & Electronics/Programming"
},
{
"name": "/Jobs & Education"
}
]
}
Direct link to API explorer for interactive testing of my example (change content, filters, etc.)

Which is the better design for this API response

I'm trying to decide upon the best format of response for my API. I need to return a reports response which provides information on the report itself and the fields contained on it. Fields can be of differing types, so there can be: SelectList; TextArea; Location etc..
They each use different properties, so "SelectList" might use "Value" to store its string value and "Location" might use "ChildItems" to hold "Longitude" "Latitude" etc.
Here's what I mean:
"ReportList": [
{
"Fields": [
{
"Id": {},
"Label": "",
"Value": "",
"FieldType": "",
"FieldBankFieldId": {},
"ChildItems": [
{
"Item": "",
"Value": ""
}
]
}
]
}
The problem with this is I'm expecting the users to know when a value is supposed to be null. So I'm expecting a person looking to extract the value from "Location" to extract it from "ChildItems" and not "Value". The benefit to this however, is it's much easier to query for things than the alternative which is the following:
"ReportList": [
{
"Fields": [
{
"SelectList": [
{
"Id": {},
"Label": "",
"Value": "",
}
]
"Location": [
{
"Id": {},
"Label": "",
"Latitude": "",
"Longitude": "",
"etc": "",
}
]
}
]
}
So this one is a reports list that contains a list of fields which on it contains a list of fieldtype for every fieldtype I have (15 or something like that). This is opposed to just having a list of reports which has a list of fields with a "fieldtype" enum which I think is fairly easy to manipulate.
So the Question: Which format is best for a response? Any alternatives and comments appreciated.
EDIT:
To query all fields by fieldtype in a report and get values with the first way it would go something like this:
foreach(field in fields)
{
switch(field.fieldType){
case FieldType.Location :
var locationValue = field.childitems;
break;
case FieldType.SelectList:
var valueselectlist = field.Value;
break;
}
The second one would be like:
foreach(field in fields)
{
foreach(location in field.Locations)
{
var latitude = location.Latitude;
}
foreach(selectList in field.SelectLists)
{
var value= selectList.Value;
}
}
I think the right answer is the first one. With the switch statement. It makes it easier to query on for things like: Get me the value of the field with the id of this guid. It just means putting it through a big switch statement.
I went with the first one because It's easier to query for the most common use case. I'll expect the client code to put it into their own schema if they want to change it.

Remove fields by their name pattern

We currently are using logstash with elasticsearch to log some of out application events.
some of event holds fields that are dynamically named.
We want to apply a filter that will removed or merged them before entering to elasticsearch.
for example :
{
"Root": {
"EventType": "Info",
"Timestamp": 20150713153757.758
},
"Event": {
"Message": "itemsViews Created in 1 mSec",
"Cache_11542": true,
"Cache_10242": false,
"Cache_55240": 124
}
}
In this case we would like to remove all the fields starting with "Cache_" under the object Event.
so the output to elasticsearch will be
{
"Root": {
"EventType": "Info",
"Timestamp": 20150713153757.758
},
"Event": {
"Message": "itemsViews Created in 1 mSec"
}
}
Is there a way to define a filler in the logstash configuration file to achieve this ?
Many thanks in advance.
Looks like the Ruby filter solution that #magnus-bäck points out might be your solution. I had originally suggested the the mutate filter using the "remove_field" array in conjunction with the gsub filter. Gsub to regex match your Cache* fields that can then be renamed into a variable for use in mutate. However, since you have n-number of Cache fields, I like the ruby script better. :)

Resources