ExtractText : Split Flowfile into multiple copies based on dynamic attribute generated from each regex match? - apache-nifi

Is there a way to split incoming flowfile into multiple flowfiles (each carrying their parent attributes) for each matching regex captures?
Example:
Incoming flowfile contains below data:
Datafeed-Manifest-Version: 1.0
Lookup-Files: 1
Data-Files: 5
Total-Records: 2848792
Lookup-File: inventory-030000-lookup_data.tar.gz
MD5-Digest: fb7b275e624fb36f19eeedcdfa1aab09
File-Size: 37648783
Data-File: 01-inventory_20230110-030000.tsv.gz
MD5-Digest: 46b54b81c7103b45cbc8ab90b6119605
File-Size: 84247165
Record-Count: 355842
Data-File: 02-inventory_20230110-030000.tsv.gz
MD5-Digest: 8d1be438f98a172d0ff7e2d91ca7157e
File-Size: 85464370
Record-Count: 357974
Data-File: 03-inventory_20230110-030000.tsv.gz
MD5-Digest: c0b7a21a50a3cc43f32ad3d839cbb900
File-Size: 85037037
Record-Count: 354455
Data-File: 04-inventory_20230110-030000.tsv.gz
MD5-Digest: e5c8bc72108e1cb638dcdce080f32fa2
File-Size: 80764351
Record-Count: 339897
ExtractText is able to extract regex matches into dynamic groups successfully using below regex :
But the output is only the parent single flowfile carrying the first match (despite using 'use repeating capture group') carrying below attributes as expected:
Attribute ValuesShow modified attributes only
datafilename
01-inventory_20230110-040000.tsv.gz
No value set
datafilename.1
01-inventory_20230110-040000.tsv.gz
No value set
datafilename.2
02-inventory_20230110-040000.tsv.gz
No value set
datafilename.3
03-inventory_20230110-040000.tsv.gz
No value set
datafilename.3
04-inventory_20230110-040000.tsv.gz
No value set
How or what is the best way to dynamically split the flowfile into multiple copies based on the datafile.x attribute before sending to the downstream processors ? I see RouteOnText can do something similar but not sure if that would be efficient.
NOTE: The content of the original flowfile is not relevant as long as all unique datafilename.X can be extracted and used for the new flowfiles carrying existing attributes.
desired output:
Incoming flowfile1 > ExtractText > flowfile1.1, flowfile1.2, flowfile1.3 (as many regex matches true)

I figured out a way combining RouteText and SplitText. Its working but if anyone can suggest a better way please share.
Flowfile content:
Data-File: 01-inventory_20230110-060000.tsv.gz
Data-File: 02-inventory_20230110-060000.tsv.gz
Data-File: 03-inventory_20230110-060000.tsv.gz
Data-File: 04-inventory_20230110-060000.tsv.gz
Data-File: 05-inventory_20230110-060000.tsv.gz
Data-File: 06-inventory_20230110-060000.tsv.gz
Data-File: 07-inventory_20230110-060000.tsv.gz
Data-File: 08-inventory_20230110-060000.tsv.gz
Now each flowfile is only carrying a single row of Data-file entry which is sent to extractText which adds unique attribute datafilename for each matching regex of the original flowfile.

Related

Forward only one flowfile from multiple same flowfiles

I have multiple flowfile as an output of spitjson processor , all are same flowfile with same content . I just want to forward one flowfile . Is there a way to do that in nifi
yes it should be possible.
splitjson processor write fragment.index attribute to each flowfile
A one-up number that indicates the ordering of the split FlowFiles
that were created from a single parent FlowFile
each flowfile will receive a unique fragment.index starting from 1 to number of flowfile
so you can use route on attribute to only keeper flowfile with fragment.index equal to 1
Flow :
input data :
[{"name":"Gayle Hays"},{"name":"Merritt Calhoun"},{"name":"Tamara Lane"},{"name":"Contreras Heath"},{"name":"Martinez Watson"}]
RouteOnAttribute processor configuration:
Documentation : https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.12.1/org.apache.nifi.processors.standard.SplitJson/index.html

How to extract key-value pairs from CSV using Talend

I have data for one column in my CSV file as :
`column1`
row1 : {'name':'Steve Jobs','location':'America','status':'none'}
row2 : {'name':'Mark','location':'America','status':'present'}
row3 : {'name':'Elan','location':'Canada','status':'present'}
I want as the output for that column as :
`name` `location` `status`
Steve jobs America none
Mark America present
Elan Canada present
But sometimes I have row value like {'name':'Steve Jobs','location':'America','status':'none'},{'name':'Mark','location':'America','status':'present'}
Please help !
You have to use tMap and tExtractDelimitedFields components.
Flow,
Below is the step by step explination,
Original data - row1 : {'name':'Steve Jobs','location':'America','status':'none'}
Substring the value inside the braces using below function
row1.Column0.substring(row1.Column0.indexOf("{")+1, row1.Column0.indexOf("}")-1)
Now the result is - 'name':'Steve Jobs','location':'America','status':'none'
3.Extract single columns to multiple using tExtractDelimitedFields. Since the columns are seperated be ,, delimiter should be provided as comma. And we have 3 fields in the data, so create 3 fields in the component schema. Below is the snipping of the tExtractDelimitedFields component configuration
Now the result is,
name location status
'name':'Steve Jobs' 'location':'America' 'status':'none'
'name':'Mark' 'location':'America' 'status':'present'
'name':'Elan' 'location':'Canada' 'status':'present'
Again using one more tMap, replace the column names and single quotes from the data,
row2.name.replaceAll("'name':", "").replaceAll("'", "")
row2.location.replaceAll("'location':", "").replaceAll("'", "")
row2.status.replaceAll("'status':", "").replaceAll("'", "")
Your final result is below,

Loop over attribute values for executing SQL in Nifi

I would like to know how may I accomplish the following use case in Nifi Flow:
I would like to execute SQL query for date range over a loop. The date range are provided from list of attribute values.
For example: If my list of attributes are : 2013-01-01 2013-02-01 2013-03-01, I would like to execute SQL operation over a loop such that:
select * from where startdate>=2013-01-01 and enddate<2013-02-01
followed by:
select * from where startdate>=2013-02-01 and enddate<2013-03-01
Therefore, for the same, I roughly know the idea but cant implement concretely:
UpdateAttribute (containing list of date values) -> SplitText-> RouteOnAttribute -> ExecuteSQL
Thanks
In NiFi 1.8.0, you can use DuplicateFlowFile for this (via NIFI-5454). You can start with UpdateAttribute to add the count of delineated values in your list (let's assume it is an attribute called datelist), perhaps set list.count to
${allDelineatedValues(${datelist}, " "):count()}
Then in DuplicateFlowFile you can set Number of Copies to ${list.count:minus(1)}. Each flow file downstream will have a copy.index attribute set (the original having index 0), so you can use that in ReplaceText in conjunction with getDelimitedValue(), perhaps setting the content to the following:
select * from myTable where
startdate >= ${datelist:getDelimitedField(${copy.index:plus(1)})} and
enddate < ${datelist:getDelimitedField(${copy.index:plus(2)})}

How to split a Webix datatable column into multiple columns?

In my webix datatable, I am showing multiple values in the cells for some columns.
To identify which values belong to which header, I have separated the column headers by a '|' (pipe) and similarly the values under them as well.
Now, in place of delimiting the columns by '|' , I need to split the columns into some editable columns with the same name.
Please refer to this snippet : https://webix.com/snippet/8ce1148e
In this above snippet, for example the Scores column will be split into two more editable columns as Rank and Vote. Similarly for Place column into Type and Name.
How the values of the first array elements is shown under each of them will remain as is.
How can this be done ?
Thanks
While creating the column configuration for webix, you can provide array to the header field for the first column along with the colspan like below:
var columns = [];
columns[0] =
{"id":"From", "header":[{"text":"Date","colspan":2},{"text":"From"}]};
columns[1] =
{"id":"To","header":[null, {"text":"To"}]};
column[0] will create Date and From and column[1] will be creating the To.

Manipulate row data in hadoop to add missing columns

I have log files from IIS stored in hdfs, but due to webserver configuration some of the logs do not have all the columns or they appear in different order. I want to generate files that have a common schema so I can define a Hive table over them.
Example good log:
#Fields: date time s-ip cs-method cs-uri-stem useragent
2013-07-16 00:00:00 10.1.15.8 GET /common/viewFile/1232 Mozilla/5.0+Chrome/27.0.1453.116
Example log with missing columns (cs-method and useragent missing):
#Fields: date time s-ip cs-uri-stem
2013-07-16 00:00:00 10.1.15.8 /common/viewFile/1232
The log with missing columns needs to be mapped to the full schema like this:
#Fields: date time s-ip cs-method cs-uri-stem useragent
2013-07-16 00:00:00 10.1.15.8 null /common/viewFile/1232 null
The bad logs can have any combination of columns enabled and in different order.
How can I map the available columns to the full schema according to the Fields row within the log file?
Edit:
Normally I would approach this by defining my column schema as a dict mapping column name to index. ie: col['date']=0 col['time']=1 etc. Then I would read the #Fields row from the file and parse out the enabled columns and generate header dict mapping header name to column index in the file. Then for remaining rows of data I know its header by index, map that to my column schema by header=column name and generate new row in correct order inserting missing columns with null data. My issue is I do not understand how to do this within hadoop since each map executes alone and therefore how can I share the #Fields information with each map?
You can use this to apply the header to the columns creating a map. From there you can use a UDF like:
myudf.py
#!/usr/bin/python
#outputSchema('newM:map[]')
def completemap(M):
if M is None:
return None
to_add = ['A', 'D', 'F']
for item in to_add:
if item not in M:
M[item] = None
return M
#outputSchema('A:chararray, B:chararray, C:chararray, D:chararray, E:chararray, F:chararray')
def completemap_v2(M):
if M is None:
return (None,
None,
None,
None,
None,
None)
return (M.get('A', None),
M.get('B', None),
M.get('C', None),
M.get('D', None),
M.get('E', None),
M.get('F', None))
To add in the missing tuples to the map.
Sample Input:
csv1.in csv2.in
------- ---------
A|B|C D|E|F
Hello|This|is PLEASE|WORK|FOO
FOO|BAR|BING OR|EVERYTHING|WILL
BANG|BOSH BE|FOR|NAUGHT
Sample Script:
A = LOAD 'tests/csv' USING myudfs.ExampleCSVLoader('\\|') AS (M:map[]);
B = FOREACH A GENERATE FLATTEN(myudf.completemap_v2(M));
Output:
B: {null::A: chararray,null::B: chararray,null::C: chararray,null::D: chararray,null::E: chararray,null::F: chararray}
(,,,,,)
(,,,PLEASE,WORK,FOO)
(,,,OR,EVERYTHING,WILL)
(,,,BE,FOR,NAUGHT)
(,,,,,)
(Hello,This,is,,,)
(FOO,BAR,BING,,,)
(BANG,BOSH,,,,)

Resources