NiFi - How can i change ReplacementValue on ReplaceText? - apache-nifi

I'm using ExecuteSQL,SplitAvro,ConvertAvroToJSON,EvaluateJsonPath,ReplaceText,ExecuteSQL.
I am trying to replace the content in the flowfile using replaceText processor.
Now, i can replace like this. -> INSERT INTO values (${id},'${name}') . ReplaceText process send excute sql like this :
INSERT into x values(1,'xx')
INSERT into x values(2,'yy')
INSERT into x values(3,'zz')
I'm sending INSERT query for each line.
But i want to send executesql process like this
INSERT INTO x values (1,'xx'),(2,'yy'),(3,'zz')

i'm not sure it's the best approach but you could do this:
SplitAvro # i guess you are splitting the records here (here you should get fragment.* attributes)
ConvertAvroToJSON # converting each record to json
EvaluateJsonPath # getting id,name values from json
ReplaceText # (${id}, '${name}')
MergeContent # merge rows back to single file with header and delimiter
Binary Concatenation
Header = Insert into X values
Demarcator = ,
ExecuteSQL

Although it won't generate exactly the SQL you're looking for, take a look at ConvertRecord and/or JoltTransformRecord -> PutDatabaseRecord. The former is used to get each of your Avro records into the form you want (id, name), and PutDatabaseRecord will use a PreparedStatement in batches to send the records to the database. It might not be quite as efficient as a single INSERT, but should be much more efficient than a Split -> Convert -> ExecuteSQL with separate INSERTs per FlowFile.
To truly get the SQL you want, you'll likely need a scripted processor such as InvokeScriptedProcessor with a RecordReader, I have a blog post on the subject.

Related

trying to add a field based on conditionals with NifI

I'm new to Apache NiFi and currently using it to route message data to various locations. I'm looking to add some fields based on a set of conditionals.
Currently I have a GetFile Processor that reads log files ---> ExtractGrok that applies a grok pattern to parse ---> ConvertRecord to convert from Grok to Json. The next part is where I'm stumped/not sure what to do next.
In my json I have a field refresh_time I need to create 2 new fields based on some conditions about the field refresh_time
something along the lines of if refresh_time < 10 then cache = 1; else if refresh_time > 10 then reprocess = 1
The end goal here is numeric fields cache and refresh_time that can be used down the road in aggregations.
What would be the best way to add 2 numerical fields based on a condition. Is there a processor for adding additional fields or updating the record to include new fields?
Thanks.
There's a couple ways you could achieve what you want to.
One option (More readable)
A QueryRecord would let you write a SQL statement across your Records and let you split them by the result. E.g.
Add a dynamic property called cache with a value SELECT * FROM FLOWFILE WHERE refresh_time < 10.
Add a dynamic property called refresh with a value SELECT * FROM FLOWFILE WHERE refresh_time > 10.
The QueryRecord will now have the relationships failure, original, cache and refresh.
Branching off from cache and refresh will be one UpdateRecord each, with Replacement Value Strategy set to Literal Value.
For the cache relationship, you can add a new dynamic property called cache with a value 1. For the refresh relationship, you can add a new dynamic property called refresh with a value 1.
Similar option (Possibly more performant)
If you want to avoid the additional UpdateRecord, you can add fields in the QueryRecord with something like this:
Two dynamic properties set as:
cache = SELECT *, 1 AS cache FROM FLOWFILE WHERE REFRESH < 10
reprocess = SELECT *, 1 AS reprocess FROM FLOWFILE WHERE REFRESH > 10
This option may be more performant due to fewer disk reads.
This gist is an example of the second option, you can import it to NiFi to try it out.
Also, FYI there is a GrokReader that you could use in ConvertRecord to parse with Grok straight to JSON, potentially skipping the ExtractGrok.

How to set start and end row or interval rows for CSV in Nifi?

I want to get particular part of excel file in Nifi. My Nifi template like that;
GetFileProcessor
ConvertExcelToCSVProcessor
PutDatabaseRecordProcessor
I should parse data between step 2 and 3.
Is there a solution for getting specific rows and columns ?
Note:If there is a option for cutting ConvertExcelToCSVProcessor, it will work for me.
You can use Record processors between ConvertExcelToCSV and PutDatabaseRecord.
to remove or override a column use UpdateRecord. this processor can receive your data via CSVReader and prepare an output for PutDatabaseRecord or QueryRecord . check View usage -> Additional Details...
in order to filter by column use QueryRecord.
here an example. this example receives data through CSVReader and makes some aggregations, you can as well do some filtering according to doc
also this post had helped me to understand Records in Nifi

Passing parameter from different source into insert statement using Nifi

I'm still new in NiFi. What I want to achieve is to pass a parameter from a different source.
Scenario:
I have 2 datasource which is Json data and record id (from oracle function). I declared record id using extract text as "${recid}" and json string default is "$1" .
How to insert into table using sql statement insert into table1 (json,recid) value ('$1','${recid}')
After I run the processor. I'm not able to get both attribute into one insert statement.
Please help.
Nifi flowfile
Flowfile after mergecontent
you should merge these 2 flowfiles to make one.
Use mergeFlowfile processor with Attribute Strategy set to Keep All Unique Attributes
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.MergeContent/index.html
Take a look at LookupAttribute with a SimpleDatabaseLookupService. You can pass your JSON flow file into that, look up the recid into an attribute, then do the ExtractText -> ReplaceText to get it into SQL form.

Nifi add attribute from DB

I am currently getting files from FTP in Nifi, but I have to check some conditions before I fetch the file. The scenario goes some thing like this.
List FTP -> Check Condition -> Fetch FTP
In the Check Condition part, I have fetch some values from DB and compare with the file name. So can I use update attribute to fetch some records from DB and make it like this?
List FTP -> Update Attribute (from DB) -> Route on Attribute -> Fetch FTP
I think your flow looks something like below
Flow:
1.ListFTP //to list the files
2.ExecuteSQL //to execute query in db(sample query:select max(timestamp) db_time from table)
3.ConvertAvroToJson //convert the result of executesql to json format
4.EvaluateJsonPath //keep destination as FlowfileAttribute and add new property as db_time as $.db_time
5.ROuteOnAttribute //perform check filename timestamp vs extracted timestamp by using nifi expresson language
6.FetchFile //if condition is true then fetch the file
RouteOnAttribute Configs:
I have assumed filename is something like fn_2017-08-2012:09:10 and executesql has returned 2017-08-2012:08:10
Expression:
${filename:substringAfter('_'):toDate("yyyy-MM-ddHH:mm:ss"):toNumber()
:gt(${db_time:toDate("yyyy-MM-ddHH:mm:ss"):toNumber()})}
By using above expression we are having filename value same as ListFTP filename and db_time attribute is added by using EvaluateJsonPath processor and we are changing the time stamp to number then comparing.
Refer to this link for more details regards to NiFi expression language.
So if I understand your use case correctly, it is like you are using the external DB only for tracking purpose. So I guess only the latest processed timestamp is enough. In that case, I would suggest you to use DistributedCache processors and ControllerServices offered by NiFi instead of relying on an external DB.
With this method, your flow would be like:
ListFile --> FetchDistributedMapCache --(success)--> RouteOnAttribute -> FetchFile
Configure FetchDistributedMapCache
Cache Entry Identifier - This is the key for your Cache. Set it to something like lastProcessedTime
Put Cache Value In Attribute - Whatever name you give here will be added as a FlowFile attribute with its value being the Cache value. Provide a name, like latestTimestamp or lastProcessedTime
Configure RouteOnAttribute
Create a new dynamic relationship by clicking the (+) button in the Properties tab. Give it a name, like success or matches. Let's assume, your filenames are of the format somefile_1534824139 i.e. it has a name and an _ and the epoch timestamp appended.
In such case, you can leverage NiFi Expression Language and make use of the functions it offer. So for the new dynamic relation, you can have an expression like:
success - ${filename:substringAfter('_'):gt(${lastProcessedTimestamp})}
This is with the assumption that, in FetchDistributedMapCache, you have configured the property Put Cache Value In Attribute with the value lastProcessedTimestamp.
Useful Links
https://community.hortonworks.com/questions/83118/how-to-put-data-in-putdistributedmapcache.html
https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#dates

How can I do a double delimiter in Hive?

let's say I have some sample rows of data
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
site1^http://article1.com?test=yes
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
I want to create a table like so
create table clicklogs (sitename string, url string)
ROW format delimited fields terminated by '^';
As you can see I have some data in the url parameter I'd like to extract, namely
datacoll=5|4|3|2|1
I also want to work with those individual elements seperated by pipes so I can do group bys on them to show for example how many urls had a 2nd position of "4" which would be 2 rows in this case. So in this case I have the "url" field that has additional data I'd like to parse out and use in my queries.
The question is, what is the best way to do that in hive?
thanks!
First, use parse_url(string urlString, string partToExtract [, string keyToExtract]) to grab the data in question:
parse_url('http://article1.com?datacoll=5|4|3|2|1&test=yes', 'QUERY', 'datacol1')
This returns '5|4|3|2|1', which gets us halfway there. Now, use split(string str, string pat) to break those out of each sub-delimiter into an array:
split(parse_url(url, 'QUERY', 'datacol1'), '\|')
With the result of this, you should be able to grab the columns that you want.
See the UDF documentation for more built-in functions.
Note: I wasn't able to verify this works in Hive from where I am, sorry if there are some minor issues.
This looks very similar to something I've done a couple weeks ago, I think the best approach in your case would be to apply a pre-processing step (possibly with hadoop streaming), and change the prototype of your table to be:
create table clicklogs(sitename string, datacol Array<int>) row format delimited fields terminated by '^' collection items terminated by '|'
Once you have that you can easily manipulate your data in Hive using lateral views and the builtin explode. The following code should help you get the counts of URLs per col.
select col, count(1) from clicklogs lateral view explode(datacol) dataTable as col group by col

Resources