Validate CSV schema for CSV validation - apache-nifi

I am converting a JSON to csv and while pushing it to putfile I want to validate my csv before pushing. How should I give Schema in ValidateCSV processor to do that.
Below is the sample JSON and need validate it before pushing it to putfile.
[
{
"name": "Tony",
"age": 20,
"regdate": "2022-07-01 02:15:15",
"due_date": "2021-05-01 03:30:33",
"start_date": "2021-05-01 03:30:33"
},
{
"name": "Steve",
"age": 21,
"regdate": "2022-03-01 05:22:15",
"due_date": "2022-03-01 05:22:15",
"start_date": "2022-04-01 02:30:33"
},
{
"name": "Peter",
"age": 23,
"regdate": "2021-08-06 02:20:15",
"due_date": "2022-01-03 05:30:33",
"start_date": "2022-01-03 05:30:33"
}
]
I have given schema in JSON and CSV values but below is the error
my schema is name,age,start_date,regdate,due_date
Suggest a valid schema for me to proceed further.

Check the documentation of the ValidateCsv processor:
The cell processors cannot be nested (except with Optional which gives the possibility to define a CellProcessor for values that could be null) and must be defined in a comma-delimited string as the Schema property.
Input to processor (after ConvertRecord Json -> CSV):
name,age,regdate,due_date,start_date
Tony,20,2022-07-01 02:15:15,2021-05-01 03:30:33,2021-05-01 03:30:33
Steve,21,2022-03-01 05:22:15,2022-03-01 05:22:15,2022-04-01 02:30:33
Peter,23,2021-08-06 02:20:15,2022-01-03 05:30:33,2022-01-03 05:30:33
For your use case probably schema:
StrNotNullOrEmpty(),ParseInt(),ParseDate("yyyy-MM-dd HH:mm:ss"),ParseDate("yyyy-MM-dd HH:mm:ss"),ParseDate("yyyy-MM-dd HH:mm:ss")

Related

Is it possible to use cockroach gen_random_uuid() function inside JSON data while inserting into JSON datatype in cockroachDB

I am new to cockroach DB and was wondering if the below ask is possible
One of the columns in my table is of JSON type and the sample data in it is as follows
{
"first_name": "Lola",
"friends": 547,
"last_name": "Dog",
"location": "NYC",
"online": true,
"Education": [
{
"id": "4ebb11a5-8e9a-49dc-905d-fade67027990",
"UG": "UT Austin",
"Major": "Electrical",
"Minor": "Electronics"
},
{
"id": "6724adfa-610a-4efe-b53d-fd67bd3bd9ba",
"PG": "North Eastern",
"Major": "Computers",
"Minor": "Electrical"
}
]
}
Is there a way to replace the "id" field in JSON as below to get the id generated dynamically?
"id": gen_random_uuid(),
Yes, this should be possible. To generate JSON data that includes a randomly-generated UUID, you can use a query like:
root#:26257/defaultdb> select jsonb_build_object('id', gen_random_uuid());
jsonb_build_object
--------------------------------------------------
{"id": "d50ad318-62ba-45c0-99a4-cb7aa32ad1c3"}
If you want to update in place JSON data that already exists, you can use the jsonb_set function (see JSONB Functions).

How to extract only few columns from Nifi Flow File after reading the data from a flat file

The flat file has the following data without a header which needs to be loaded into the MySQL table.
101,AAA,1000,10
102,BBB,5000,20
I use GetFile or GetSFTP processor to read the data. Once the data is read, the flow file contains the above data. I want to only load the 1st column, 2nd column, and 4th column into the MySQL table. The output I expect in MySQL table is as below.
101,AAA,10
102,BBB,20
Can you please help me with how to extract only a few columns from an incoming flow file in nifi and load it into MySQL?
This is just one way to do it, but there are several other ways. This method uses Records, and otherwise avoids modifying the underlying data - it simply ignores the fields you don't want during the insert. This is beneficial when integrating with a larger Flow, where the data is used by other Processors that might expect the original data, or where you are already using Records.
Let's say your Table has the columns
id | name | value
and your data looks like
101,AAA,1000,10
102,BBB,5000,20
You could use a PutDatabaseRecord processor with Unmatched Field Behavior and Unmatched Column Behavior set to Ignore Unmatched... and add a CSVReader as the Record Reader.
In the CSVReader you could set the Schema Access Strategy to Use 'Schema Text' Property. Then set the Schema Text property to the following:
{
"type": "record",
"namespace": "nifi",
"name": "db",
"fields": [
{ "name": "id", "type": "string" },
{ "name": "name", "type": "string" },
{ "name": "ignoredField", "type": "string" },
{ "name": "value", "type": "string" }
]
}
This would match the NiFi Record fields against the DB Table columns, which would match fields 1,2 and 4 while ignoring fields 3 (as it did not match a column name).
Obviously, amend the field names in the Schema Text schema to match the Column names of your DB Table. You can also do data types checking/conversion here.
PutDatabaseRecord
CSVReader
Another method could be to use convert your flowfile to a record, with the help of ConvertRecord.
It helps transforming to an CSV format to whatever you prefer, you can still keep CSV format.
But with your flowfile beeing a record you can now use additionnal processors like:
QueryRecord, so you can run SQL like command on the flow file:
"SELECT * FROM FLOWFILE"
and in your case, you can do :
"SELECT col1,col2,col3 FROM FLOWFILE"
you can also directly apply filtering :
"SELECT col1,col2,col3 FROM FLOWFILE WHERE col1>500"
I recommand you the following reading:
Query Record tutorial
Thank you very much pdeuxa and Sdairs for your reply. your inputs were helpful. I have tried to use a similar method as both of you did. I used convertRecord and configured CSVRecordReader and CSVSetRecordWriter. CSVRecordReader has the following schema to read the data
{
"type": "record",
"namespace": "nifi",
"name": "db",
"fields": [
{ "name": "id", "type": "string" },
{ "name": "name", "type": "string" },
{ "name": "Salary", "type": "string" },
{ "name": "dept", "type": "string" }
]
}
while the CSVSetRecordWriter has the following output schema. There are 4 fields in Input schema while the output Schema only has 3 columns.
{
"type": "record",
"namespace": "nifi",
"name": "db",
"fields": [
{ "name": "id", "type": "string" },
{ "name": "name", "type": "string" },
{ "name": "dept", "type": "string" }
]
}
I was able to successfully run this. Thanks for your input guys.

Apache Nifi - ConvertJSONToSQL - JSON Does not have a value for the required column

I am trying to experiment with a tutorial I came across online, and here is its template:
While the template ended with converting CSV to JSON, i want to go ahead and dump this into a MySQL Table.
So i create a new processor "ConvertJSONToSQL".
Here are its properties:
And these are the controller services:
When i run this, i get the following error:
Here is the sample input file:
Here is the MySQL Table Structure:
Sample JSON Generated shown below:
[{
"id": 1,
"title": "miss",
"first": "marlene",
"last": "shaw",
"street": "3450 w belt line rd",
"city": "abilene",
"state": "florida",
"zip": "31995",
"gender": "F",
"nationality": "US"
},
{
"id": 2,
"title": "ms",
"first": "letitia",
"last": "jordan",
"street": "2974 mockingbird hill",
"city": "irvine",
"state": "new jersey",
"zip": "64361",
"gender": "F",
"nationality": "US"
}]
I don't understand the error description. There is no field called "CURRENT_CONNECTIONS", would appreciate your inputs here please..
In your case, you want to use the PutDatabaseRecord processor instead of ConvertJSONToSQL. This is because the output of ConvertRecord - CSVtoJSON is a record-oriented flow file (that is, a single flow file containing multiple records and a defined schema). ConvertJSONToSQL, from its documentation, would expect a single JSON element:
The incoming FlowFile is expected to be "flat" JSON message, meaning that it consists of a single JSON element and each field maps to a simple type
Record-oriented processors are designed to work together in a data flow that operates on structured data. They do require defining (or inferring) a schema for the data in your flowfiles, which is what the Controller Services are doing in your case, but the power is they allow you to encode/decode, operate on, and manipulate multiple records in a single flow file, which is much more efficient!
Additional resources that may be helpful:
An introduction to effectively using the record-oriented processors together, such as ConvertRecord and PutDatabaseRecord:
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
An example that includes PutDatabaseRecord: https://gist.github.com/ijokarumawak/b37db141b4d04c2da124c1a6d922f81f

Validate the list elements in evaluateJsonPath in apache nifi

Using apache nifi, I want to get the all the input twitter files which have the hashtags for "tech"
The input json is
{
"created_at": "Sun Mar 25 18:00:43 +0000 2018",
"id": 977968537028481025,
"id_str": "977968537028481025",
"text": "#bby__nim You know like datttt",
"entities": {
"hashtags": [
{
"text": "tech",
"indices": [
12,
17
]
},
{
"text": "BusinessPlan",
"indices": [
48,
61
]
}
],
"urls": [
],
"user_mentions": [
{
"screen_name": "bby__nim",
"name": "bbynim\ud83d\udc7d",
"id": 424356807,
"id_str": "424356807",
"indices": [
0,
9
]
}
],
"symbols": [
]
},
"favorited": false,
"retweeted": false,
"filter_level": "low",
"lang": "en",
"timestamp_ms": "1522000843661"
}
Under EvaulateJsonPath, validating whether the hashtags is present or not using $.{entities.hashtags:jsonPath('$[0]')}, which is successfully validating
But in RouteOnAttribute, can someone tell me how to validate whether the entities.hashtags has the value called tech?
Use the evaluate json path processor configs as shown below,
Now we are extracting all the text values from the hashtags array and keeping as flowfile attribute.
In addition you have to change the properties Destination to flowfile-attribute and Return Type as Json
Now we are having all the text values in an array then use RouteOnAttribute processor to validate the entities.hastags attribute having a tech value in it (or) not.
RouteOnAttribute configs:-
Add a new property as
contains tech
${entities.hashtags:contains("tech")} //checks as sub string if the array having tech
We are using contains function in our expression language will evaluate the array having tech substring in it or not.
But we need to check the values in the array so use the below expression language for that
contains tech
${anyDelineatedValue("${entities.hashtags:replace('[',''):replace(']','')}",","):equals('"tech"')} //checks values in the array
we are using anyDelineatedValue,replace,equals functions in our expression language will evaluate the array having tech values in it or not.
In addition if you want to check the first text value in hashtags array then your evaluatejson path would be
entities.hashtags
$.entities.hashtags[0].text

Edit parsed JSON

I have a JSON file contact.txt that has been parsed into an object called JSONObj that is structured like this:
[
{
"firstName": "John",
"lastName": "Smith",
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021"
},
"phoneNumbers": [
{ "type": "home", "number": "212 555-1234" },
{ "type": "fax", "number": "646 555-4567" }
]
},
{
"firstName": "Mike",
"lastName": "Jackson",
"address": {
"streetAddress": "21 Barnes Street",
"city": "Abeokuta",
"state": "Ogun",
"postalCode": "10122"
},
"phoneNumbers": [
{ "type": "home", "number": "101 444-0123" },
{ "type": "fax", "number": "757 666-5678" }
]
}
]
I envision editing the file/object by taking in data from a form so as to add more contacts. How can I do this?
The following method for adding a new contact to the JSONObj's array doesn't seem to be working, what's the problem?:
var newContact = {
"firstName": "Jaseph",
"lastName": "Lamb",
"address": {
"streetAddress": "25 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "13021"
},
"phoneNumbers": [
{ "type": "home", "number": "312 545-1234" },
{ "type": "fax", "number": "626 554-4567" }
]
}
var z = contact.JSONObj.length;
contact.JSONObj.push(newContact);
It depends on what technology you're using. The basic process is to read the file in, convert it to whatever native datatypes (hash, dict, list, etc.) using a JSON parsing library, modify or add data to the native object, then convert it back to JSON and store it to the file.
In Python, using the simplejson library it would look like this:
import simplejson
jsonobj = simplejson.loads(open('contact.txt'))
#python's dict syntax looks almost like JSON
jsonobj.append({
'firstName': 'Steve',
'lastName': 'K.',
'address': {
'streetAddress': '123 Testing',
'city': 'Test',
'state': 'MI',
'postalCode': '12345'
},
'phoneNumbers': [
{ 'type': 'home', 'number': '248 555-1234' }
]
})
simplejson.dump(jsonobj, open('contact.txt', 'w'), indent=True)
The data in this example is hardcoded strings, but it could come from another file or a web application request / form data, etc. If you're doing this in a web app though I would advise against reading and writing to the same file (in case two requests come in at the same time).
Please provide more information if this doesn't answer your question.
In response to "isn't there way to do this using standard javascript?":
To parse a JSON string in Javascript you can either eval it (not safe) or use a JSON parser like this one's JSON.parse. Once you have the converted JSON object you can perform whatever modifications you want to it in standard JS. You can then use that same library to convert a JS object to a JSON string (JSON.stringify). Javascript does not allow file access (unless you're doing serverside JS), so that would prevent you from reading & writing to your contact.txt file directly. You'd have to use a serverside language (like Python, Java, etc.) to read and write the file.
Once you have read in the JSON, you just have an associative array - or rather you have a pseudo-associative array, since this is Javascript. Either way, you can treat the thing as one big list of dictionaries. You can access it by key and index.
So, to play with this object:
var firstPerson = JSONObj[0];
var secondPerson = JSONObj[1];
var name = firstPerson['firstName'] + ' ' + firstPerson['lastName'];
Since you will usually have more than two people, you probably just want to loop through each dictionary in your list and do something:
for(var person in jsonList) {
alert(person['address']);
}
If you want to edit the JSON and save it back to a file, then read it into memory, edit the list of dictionaries, and rewrite back to the file.
Your JSON library will have a function for turning JSON into a string, just as it turns a string into JSON.
p.s. I suggest you observe JavaScript conventions and use camelcase for your variable names, unless you have some other customs at your place of employment. http://javascript.crockford.com/code.html

Resources