Using Apache Nifi to enrich raw unstructured text content with CSV - apache-nifi

I'm thinking of using Nifi to split text records into individual flow files. These flow files will contain unstructured text. However, I wanted to enrich the flow files with another csv files when we see the keywords in the string. For example.
Raw text
Text
$2 to $4 billion lost annually by an estimated 1.7 million people
who would be affected by their inability to afford healthcare. And
about $1.3 billion in Medicare payments go unpaid each year.
Enrichment CSV
Keywords
Enrichment field 1
Enrichment field 2
estimated
cat 1
user 1
Medicare
cat 2
user 2
Desired output
header 1
Enrichment field 1
Enrichment field 2
$2 to $4 billion lost annually by an estimated 1.7 million people
cat 1
user 1
who would be affected by their inability to afford healthcare. And
about $1.3 billion in Medicare payments go unpaid each year.
cat 2
user 2
Thanks in advance.
Not sure if anyone have similar experience on whether if this is achievable efficiently, as both flow files and enrichment csv are potentially large. Also what are the possible nifi processes are involved to run this. I was initially exploring using route on content but because the enrichment csv is too big and we have to use a lookup module instead.

Related

Apache Nifi - Split a large Json file into multiple files with a specified number of records

I am a newbie to Nifi and would like some guidance please. 
We want to split a large Json file into multiple files with a specified number of records. I am able to split a file into individual records using SplitJson and the Json Path Expression set as $..* I have also added an UpdateAttribute Processor with filename set to ${filename}_${fragment.index} so that we have the sequence of the files as order is important.
However, we might want to have say a 100,000 records split into 100 files of 1000 records each . What is the easiest way to do this ?
Thanks very much in advance
There is a SplitRecord processor. You can define the number of records to be split per file, such as:
Record Reader CSVReader
Record Writer CSVRecordSetWriter
Records Per Split 3
I have tested with the record,
id
1
...
8
and it is split into 3 files with the id = (1,2,3), (4,5,6), (7,8).

Apache Nifi - Consume Kafka + Merge Content + Put HDFS to avoid small files

I am having around 2000000 messages in Kafka topic and I want to put these records into HDFS using NiFi,so I am using PutHDFS processor for this along with ConsumeKafka_0_10 but it generates small files in HDFS, So I am using Merge Content processor for the merging the records before pushing the file.
Please help if the configuration needs changes This works fine for small number of messages but writes a single file for every record when it comes to topics with massive data.
Thank you!!
The Minimum Number of Entries is set to 1 which means it could have anywhere from 1 to the Max Number of Entries. Try making that something higher like 100k.

Nifi record counts

I am getting files from remote server using Nifi: my files are as follow:
timestamp (ms), nodeID,value
12345,x,12.4
12346,x,12.7
12348,x,13.4
12356,x,13,6
12355,y,12.0
I am now just get and fetch and split lines and send them to Kafka, but before hand, I need to apply a checksum approach on my records and aggregate them based on time stamp, what I need to do to add an additional column to my content and count the records based on aggregated time stamps, for example aggregation based on each 10 milliseconds and nodeID..
timestamp (ms), nodeID,value, counts
12345,x,12.4,3
12346,x,12.7,3
12348,x,13.4,3
12356,x,13,6,1
12355,y,12.0,1
How to do above process in NiFi. I am totally new to Nifi but need to add above functinality to my Nifi process. I am currently using below nifi process
This may not answer your question directly, but you should consider refactoring your flow to use the "record" processors. It would greatly simplify things and would probably get you closer to being able to do the aggregation.
The idea is to not split up the records, and instead process them in place. Given your current flow, the 4 processors after FetchSFTP would like change to a single ConvertRecord processor that converts CSV to JSON. You would first need to defined a simple Avro schema for your data.
Once you have the record processing setup, you might be able to use PartitionRecord to partition the records by the node id, and then from there the missing piece would be how to count by the timestamps.
Some additional resources...
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
https://www.slideshare.net/BryanBende/apache-nifi-record-processing

Approach to upload multiple interconnected csv files to HBase

I am new to HBase and still not sure which component of Hadoop ecosystem I will use in my case and how to analyse my data later so just exploring options.
I have an Excel sheet with a summary about all the customers like this but with ≈ 400 columns:
CustomerID Country Age E-mail
251648 Russia 27 boo#yahoo.com
487985 USA 30 foo#yahoo.com
478945 England 15 lala#yahoo.com
789456 USA 25 nana#yahoo.com
Also, I have .xls files created separately for each customer with an information about him (one customer = one .xls file), the number of columns and names of columns are the same in each file. Each of these files are named with a CustomerID. A one looks like this:
'customerID_251648.xls':
feature1 feature2 feature3 feature4
0 33,878 yes 789,598
1 48,457 yes 879,594
1 78,495 yes 487,457
0 94,589 no 787,475
I have converted all these files into .csv format and now feeling stuck which component of Hadoop ecosystem should I use for storing and querying such a data.
My eventual goal is to query some customerID and to get all the information about a customer from all the files.
I think that HBase fits perfectly for that because I can create such a schema:
row key timestamp Column Family 1 Column Family 2
251648 Country Age E-Mail Feature1 Feature2 Feature3 Feature4
What is the best approach to upload and query such a data in HBase? Should I first combine an information about a customer from different sources and then upload it to HBase? Or I can keep different .csv files for each customer and when uploading to HBase choose somehow which .csv to use for forming column-families?
For querying data stored in HBase I am going to write MapReduce tasks via Python API.
Any help would be very approciated!
You are correct with schema design, also remember that hbase loads the whole column family during scans, so if you need all the data at one time maybe its better to place everything in one column family.
A simple way to load the data will be to scan first file with customers and fetch the data from the second file on fly. Bulk CSV load could be faster in execution time, but you'll spend more time writing code.
Maybe you also need to think about the row key because HBase stores data in alphabetical order. If you have a lot of data, you'd better create table with given split-keys rather than let HBase do the splits because it can end up with unbalanced regions.

Concatenate thousands of files using EMR

I currently have a process which reads files from AWS S3 and concatenates them using EMR.
The input files have the following format: 1 header row and 1 data row.
Fields are comma-separated and wrapped in double-quotes.
Example:
"header-field1","header-field2","header-field3",...
"data-field1","data-field2","data-field3",...
The files vary in size between 90 and 200 bytes.
The output file has the following format:
"header-field1","header-field2","header-field3",...
"file1-data-field1","file1-data-field2","file1-data-field3",...
"file2-data-field1","file2-data-field2","file2-data-field3",...
"file3-data-field1","file3-data-field2","file3-data-field3",...
....
My current approach uses a default mapper and a single reducer to concatenate all the data rows and prepend 1 header row at the top of the final output file.
Because I want to have a single header row in the output final, I was forced to use only 1 single reducer in my EMR job. This I feel, drastically increases run-time.
Early tests ran great with tens of files.
However, I am trying to scale this application to run for thousands of files with the final goal of concatenating 1 million.
My current process for 1000 files is still running after 30+ minutes, which is too long.
Do you have any suggestions on where I can improve my application to dramatically improve overall performance?
thank you.

Resources