I hope someone will be able to help me. I am trying to learn Apache Nifi by doing some project where I have json files in the following format:
{
"network": "reddit",
"posted": "2021-12-24 10:46:51 +00000",
"postid": "rnjv0z",
"title": "A gil commission artwork of my friends who are in-game couples!",
"text": "A gil commission artwork of my friends who are in-game couples! ",
"lang": "en",
"type": "status",
"sentiment": "neutral",
"image": "https://a.thumbs.redditmedia.com/ShKq9bu4_ZIo4k5QIBYotstmyGidRgn8046RcqPo_p0.jpg",
"url": "http://www.reddit.com/r/ffxiv/comments/rnjv0z/a_gil_commission_artwork_of_my_friends_who_are/",
"user": {
"userid": "Suhteeven",
"name": "Suhteeven",
"url": "http://www.reddit.com/user/Suhteeven"
},
"popularity": [
{
"name": "ups",
"count": 1
},
{
"name": "comments",
"count": 0
}
]
}
I want to remove all non-alphanumeric characters from "text" attribute. I want only this one attribute to be modified, while the rest of the filename remains the same.
I tried using EvaluateJsonPath processor where I added text attribute. Then I created ReplaceText processor.
This configuration cleaned special characters from the text but as a result I have only value from text attribute. I don't want to loose other information, my goal is to have all attributes in the output with text attribute's value modified.
I tried also UpdateAttribute processor but this processor didn't do anything with my json (output is the same as input).
Can you please tell me what processors I should use with what configurations? I tried many different things but I am stucked.
It's possible with a processor ScriptedTransformProcessor
Record Reader: JsonTreeReader
Record Writer: JsonRecordSetWriter
Script Language (default): Groovy
Script Body
record.setValue("text", attributes['text'])
record
Data flow: EvaluateJsonPath (evaluate text attribute) -> UpdateAttribute (modify text attribute) -> ScriptedTransformProcessor (add text to record)
Related
The problem I am having is :
Sharepoint Get File Files (Properties Only) can only do one filter for ODATA, not a a second AND clause so I need to use Filter Array to make secondary filter work. And it does work....
But now I need to take my filtered array and somehow get the {FullPath} property and get the file content via passing a path and I get this error...
[ {
"#odata.etag": ""1"",
"ItemInternalId": "120",
"ID": 120,
"Modified": "2022-03-21T15:03:31Z",
"Editor": {
"#odata.type": "#Microsoft.Azure.Connectors.SharePoint.SPListExpandedUser",
"Claims": "i:0#.f|membership|dev#email.com",
"DisplayName": "Bob dole",
"Email": "dev#email.com",
"Picture": "https://company.sharepoint.us/sites/devtest/_layouts/15/UserPhoto.aspx?Size=L&AccountName=dev#email.com",
"Department": "Information Technology",
"JobTitle": "Senior Applications Developer II"
},
"Editor#Claims": "data",
"Created": "2022-03-21T15:03:31Z",
"Author": {
"#odata.type": "#Microsoft.Azure.Connectors.SharePoint.SPListExpandedUser",
"Claims": "i:0#.f|membership|dev#email.com",
"DisplayName": "Bob Dole",
"Email": "dev#email.com",
"Picture": "https://company.sharepoint.us/sites/devtest/_layouts/15/UserPhoto.aspx?Size=L&AccountName=dev#email.com",
"Department": "Information Technology",
"JobTitle": "Senior Applications Developer II"
},
"Author#Claims": "i:0#.f|membership|dev#email.com",
"OData__DisplayName": "",
"{Identifier}": "Shared%2bDocuments%252fSDS%252fFiles%252fA10_NICKEL%2bVANADIUM%2bPRODUCT_PIS-USA_French.pdf",
"{IsFolder}": false,
"{Thumbnail}": ...DATA,
"{Link}": "https://company.sharepoint.us/sites/devtest/Shared%20Documents/SDS/Files/A10_NICKEL%20VANADIUM%20PRODUCT_PIS-USA_French.pdf",
"{Name}": "A10_NICKEL VANADIUM PRODUCT_PIS-USA_French",
"{FilenameWithExtension}": "A10_NICKEL VANADIUM PRODUCT_PIS-USA_French.pdf",
"{Path}": "Shared Documents/SDS/Files/",
"{FullPath}": "Shared Documents/SDS/Files/A10_NICKEL VANADIUM PRODUCT_PIS-USA_French.pdf",
"{IsCheckedOut}": false,
"{VersionNumber}": "1.0" } ]
So from what I can see, I think it's what I thought. Even though you're filtering an array down to a single element, you need to treat it like an array.
I'm going to make an assumption that you're always going to retrieve a single item as a result of your filter step.
I created a variable (SharePoint Documents) to store your "filtered" array so I could then do the work to extract the {FullPath} property.
I then created variable that is initialised with the first (again, I'm making the assumption that your filter will only ever return a single element) and used this expression ...
variables('SharePoint Documents')?[0]['{FullPath}']
This is the result and you can use that in your next step to get the file content from SharePoint ...
If my assumption is wrong and you can have more than one then you'll need to throw it in a loop and do the same sort of thing ...
This is the expression contained within ...
items('For_Each_in_Array')['{FullPath}']
Result ...
I actually ended up doing this and it works.
The flat file has the following data without a header which needs to be loaded into the MySQL table.
101,AAA,1000,10
102,BBB,5000,20
I use GetFile or GetSFTP processor to read the data. Once the data is read, the flow file contains the above data. I want to only load the 1st column, 2nd column, and 4th column into the MySQL table. The output I expect in MySQL table is as below.
101,AAA,10
102,BBB,20
Can you please help me with how to extract only a few columns from an incoming flow file in nifi and load it into MySQL?
This is just one way to do it, but there are several other ways. This method uses Records, and otherwise avoids modifying the underlying data - it simply ignores the fields you don't want during the insert. This is beneficial when integrating with a larger Flow, where the data is used by other Processors that might expect the original data, or where you are already using Records.
Let's say your Table has the columns
id | name | value
and your data looks like
101,AAA,1000,10
102,BBB,5000,20
You could use a PutDatabaseRecord processor with Unmatched Field Behavior and Unmatched Column Behavior set to Ignore Unmatched... and add a CSVReader as the Record Reader.
In the CSVReader you could set the Schema Access Strategy to Use 'Schema Text' Property. Then set the Schema Text property to the following:
{
"type": "record",
"namespace": "nifi",
"name": "db",
"fields": [
{ "name": "id", "type": "string" },
{ "name": "name", "type": "string" },
{ "name": "ignoredField", "type": "string" },
{ "name": "value", "type": "string" }
]
}
This would match the NiFi Record fields against the DB Table columns, which would match fields 1,2 and 4 while ignoring fields 3 (as it did not match a column name).
Obviously, amend the field names in the Schema Text schema to match the Column names of your DB Table. You can also do data types checking/conversion here.
PutDatabaseRecord
CSVReader
Another method could be to use convert your flowfile to a record, with the help of ConvertRecord.
It helps transforming to an CSV format to whatever you prefer, you can still keep CSV format.
But with your flowfile beeing a record you can now use additionnal processors like:
QueryRecord, so you can run SQL like command on the flow file:
"SELECT * FROM FLOWFILE"
and in your case, you can do :
"SELECT col1,col2,col3 FROM FLOWFILE"
you can also directly apply filtering :
"SELECT col1,col2,col3 FROM FLOWFILE WHERE col1>500"
I recommand you the following reading:
Query Record tutorial
Thank you very much pdeuxa and Sdairs for your reply. your inputs were helpful. I have tried to use a similar method as both of you did. I used convertRecord and configured CSVRecordReader and CSVSetRecordWriter. CSVRecordReader has the following schema to read the data
{
"type": "record",
"namespace": "nifi",
"name": "db",
"fields": [
{ "name": "id", "type": "string" },
{ "name": "name", "type": "string" },
{ "name": "Salary", "type": "string" },
{ "name": "dept", "type": "string" }
]
}
while the CSVSetRecordWriter has the following output schema. There are 4 fields in Input schema while the output Schema only has 3 columns.
{
"type": "record",
"namespace": "nifi",
"name": "db",
"fields": [
{ "name": "id", "type": "string" },
{ "name": "name", "type": "string" },
{ "name": "dept", "type": "string" }
]
}
I was able to successfully run this. Thanks for your input guys.
I have syslogs lines like:
<333>1 2020-10-09T09:03:00 Myv2 Myv2 - - - {"_id": "authentication", "a_device": {"hostname": null, "ip": "10.10.10.10", "location": {"city": "Lviv", "country": "Ukraine", "state": "Lviv"}}, "alias": "example#email.com", "application": {"key": "XXXXXXXXXXXXX", "name": "Name"}, "auth_device": {"ip": "10.10.10.10", "location": {"city": "Lviv", "country": "Ukraine", "state": "Lviv"}, "name": "+380 00 000 000"}
I need convert the JSON part from logs to attributes, so will be "key": "value" format like:
"_id": "authentication",
"a_device_ip": "10.10.10.10",
"location_city": "Lviv"
e t c
I am using below flow with ExtractGrok module: and options for ExtractGrok:
but with 'flowfile-content' option ExtractGrok adding extra string pattern name 'GREEDYDATA' in my json and escape characters like:
and after EvaluateJsonPath module getting incorrect JSON and returning empty results:
If I select 'flowfile-attribute' in ExtractGrok then it is working fine without extra stuff but I don't see how to send that attribute value to EvaluateJsonPath module it is working with only with flow content but not attribute value
Please help with issue or suggest alternative flow:
After your ExtractGrok processor, you could add an AttributesToJSON processor, which creates already a result JSON containing your GROK-parsed fields inside.
I'm using Nifi 1.8 and trying a simple workflow
GenerateFlowFile -> ConvertRecord -> ConvertAvroToJSON
The generated flow file contains 5 lines but the view data provenance output claim in ConvertAvroToJSON shows only one line
Is it expected behaviour or did I make a mistake ?
My initial file in Generate workflow is
1,Millard,McKinley
2,Warren,Hoover
3,Dwight,Kennedy
4,Martin,Ford
5,Martin,Roosevelt
In Convert record I have
a CSVRecordreader using ${avro.schema}
cvsreader_properties
a AvrRecordSetWriter using ${avro.schema}
avro recordset writer properties
my variable avro.schema defines the Following schema:
{
"type": "record",
"name": "LongList",
"fields" : [
{"name": "id", "type": "long"},
{"name": "firstname", "type": "string"},
{"name": "lastname", "type": "string"}
]
}
I reuse the same schema text in my ConvertAvroToJson processor
avrotojson schema
I would expect to get 5 lines in the output of my pipeline
(output claim of avrotojson)
However I get only the first line:
[{"id": 1, "firstname": "Millard", "lastname": "McKinley"}]
Did I get Something wrong ?
Is this expected behaviour ?
One workaround is per line before hand but is there a way to do without splitting ?
Thanks
I am trying to experiment with a tutorial I came across online, and here is its template:
While the template ended with converting CSV to JSON, i want to go ahead and dump this into a MySQL Table.
So i create a new processor "ConvertJSONToSQL".
Here are its properties:
And these are the controller services:
When i run this, i get the following error:
Here is the sample input file:
Here is the MySQL Table Structure:
Sample JSON Generated shown below:
[{
"id": 1,
"title": "miss",
"first": "marlene",
"last": "shaw",
"street": "3450 w belt line rd",
"city": "abilene",
"state": "florida",
"zip": "31995",
"gender": "F",
"nationality": "US"
},
{
"id": 2,
"title": "ms",
"first": "letitia",
"last": "jordan",
"street": "2974 mockingbird hill",
"city": "irvine",
"state": "new jersey",
"zip": "64361",
"gender": "F",
"nationality": "US"
}]
I don't understand the error description. There is no field called "CURRENT_CONNECTIONS", would appreciate your inputs here please..
In your case, you want to use the PutDatabaseRecord processor instead of ConvertJSONToSQL. This is because the output of ConvertRecord - CSVtoJSON is a record-oriented flow file (that is, a single flow file containing multiple records and a defined schema). ConvertJSONToSQL, from its documentation, would expect a single JSON element:
The incoming FlowFile is expected to be "flat" JSON message, meaning that it consists of a single JSON element and each field maps to a simple type
Record-oriented processors are designed to work together in a data flow that operates on structured data. They do require defining (or inferring) a schema for the data in your flowfiles, which is what the Controller Services are doing in your case, but the power is they allow you to encode/decode, operate on, and manipulate multiple records in a single flow file, which is much more efficient!
Additional resources that may be helpful:
An introduction to effectively using the record-oriented processors together, such as ConvertRecord and PutDatabaseRecord:
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
An example that includes PutDatabaseRecord: https://gist.github.com/ijokarumawak/b37db141b4d04c2da124c1a6d922f81f