How to define/process NiFi FlowFile content without a schema? - apache-nifi

I have been experimenting with Nifi (via Hortonworks HDF) and can't grasp something that seems basic. I start with a file..I want to filter records based on a value (e.g., field number 2 contains "xyz")...then write only the matching records to HDFS.
I have a non-filtered flow working but can't find any docs or examples showing how to apply a schema (or in any way understand the "content") of the file.
What am I missing?
(I have seen references to FlowFiles being "schema-less" - not sure how that is useful).
For example, a file like this:
[app1][field1][field2]
[app2][field1][field2]
[app2][field1][field2][field3]
[app3][field1]
I want to filter to select records containing "[app2]" and create an HDFS file that looks like this:
field1,field2
field1,field2,field3

Related

Apache NiFi how to check output from each processor

I am new to using Apache NiFi, and I trying to create a template that takes a JSON file and turns it into a set of SQL insert statements.
So far I have created a template that takes the JSON file and I have got it to the point of PutSQL. There is no database to connect to at the moment, but what I have not been able to check is the output. Can this be done? What I need to check is whether the array of JSON has been turned into a INSERT per element in the array
As far as inpecting the output, what does your flow look like? If you have something like ConvertJSONToSQL -> PutSQL, you can leave PutSQL stopped and run ConvertJSONToSQL, then you will see FlowFile(s) in the connection between the two processors. Then you can right-click on the connection and choose List Queue, then click the "eye" icon on the right for the FlowFile you wish to inspect. That will show you the contents of the FlowFile right before it goes into PutSQL.
Having said all that, if your JSON file contains fields that correspond to columns in your database, consider PutDatabaseRecord instead of ConvertJSONToSQL -> PutSQL. That can use a JsonTreeReader to parse each record, and it will generate and execute the necessary SQL as a prepared statement using the values in all records of the FlowFile. That way you don't need to generate the SQL yourself or worry about fragmented transactions or any of that.

I need to get latest data in data ingest template nifi

Hi sir,
In data ingest template i need to get this property
for ex i have data with date field
date data
12-07-2018 a
13-07-2018 b
14-07-2018 c
15-07-2018 d
In that , i would like to take latest one i.e, 15-07-2018
if date field got new data
16-07-2018 e
then i have to get 16-07-2018 by checking last updated date 15-07-2018 rather than checking from first one 12-07-2018
like that, if i got 17-08-2108 f then have to get 17-08-2018 by checking with last new date 16-07-2018 ..
how to achieve this , in which processor i have to do modifications or have to add new properties
When the feed runs again, how does it take the latest watermark and work from there
Two possible approach comes to my mind:
Write your own Spark app which would be used (ExecuteSparkJob) to read through the file which is getting ingested. In this case, you keep track of the max date and when you are done through the ingestion, persist it somewhere. If you're in HDP world, easy thing would be to insert the max date to a Hive (transactional) table. You can also leverage ZooKeeper znode to persist or even the PutDistributedMapCache processor that NiFi offers.
Write a custom NiFi processor which would basically do the same thing as the above one, except that you have to enable it yourself to work with data of different format (CSV, JSON). Spark, in this regard, comes packed with many thing built in.

How to query file in hdfs which has xml as one column

Context:
I have data in a table in mysql with xml as one column.
For Ex: Table application has 3 fields.
id(integer) , details(xml) , address(text)
(In real case i have 10-12 fields here).
Now we want to query the whole table with all the fields in mysql table using pig.
Transferred the data from mysql into hdfs using sqoop with
record delimiter '\u0005' and column delimiter as "`" to /x.xml.
Then Load the data from x.xml into the Pig using
app = LOAD '/x.xml' USING PigStorage('\u0005') AS (id:int , details:chararray , address:chararray);
What is the best way to query such data.
Solution that i could currently think about.
Use a custom loader and extend Loadfunc to read the data.
If there is some way to load a particular column using xmlpathloader and rest loading normally. Please suggest if this can be done.
As all the examples i have seen using xpath are using XML loader while loading the file.
For Ex:
A = LOAD 'xmls/hadoop_books.xml' using org.apache.pig.piggybank.storage.XMLLoader('BOOK') as (x:chararray);
Is it good to use pig for querying such kind of data, please suggest if there are any other alternative technologies, that does it effectively.
The size of data present is around 500 GB.
FYI i am new to hadoop ecosytem and i might be missing something trivial.
Load a specific column:
Some other StackOverflow answers suggesting preprocessing the data with awk (generate a new input contains only the xml part.)
A nicer work-a-round to generate the specific data with an extra FOREACH from the xml column, like:
B = FOREACH app GENERATE details;
and store it to be able to load with an XML loader.
Check the StreamingXMLLoader
(You can also check Apache Drill it may support this case out of the box)
Or use UDF for the XML processing and in pig you just hand over the related xml field.

How do I store data in multiple, partitioned files on HDFS using Pig

I've got a pig job that analyzes a large number of log files and generates a relationship between a group of attributes and a bag of IDs that have those attributes. I'd like to store that relationship on HDFS, but I'd like to do so in a way that is friendly for other Hive/Pig/MapReduce jobs to operate on the data, or subsets of the data without having to ingest the full output of my pig job, as that is a significant amount of data.
For example, if the schema of my relationship is something like:
relation: {group: (attr1: long,attr2: chararray,attr3: chararray),ids: {(id: chararray)}}
I'd really like to be able to partition this data, storing it in a file structure that looks like:
/results/attr1/attr2/attr3/file(s)
where the attrX values in the path are the values from the group, and the file(s) contain only ids. This would allow me to easily subset my data for subsequent analysis without duplicating data.
Is such a thing possible, even with a custom StoreFunc? Is there a different approach that I should be taking to accomplish this goal?
I'm pretty new to Pig, so any help or general suggestions about my approach would be greatly appreciated.
Thanks in advance.
Multistore wasn't a perfect fit for what I was trying to do, but it proved a good example of how to write a custom StoreFunc that writes multiple, partitioned output files. I downloaded the Pig source code and created my own storage function that parsed the group tuple, using each of the items to build up the HDFS path, and then parsed the bag of ids, writing one ID per line into the result file.

Get filename of Record in Hive

Is it possible to get the filename of a record in Hive? That would be incredibly helpful for debugging.
In my particular case, I've an incorrect values ​​in a table that is mapped to a folder with > 100 large files. To use grep is very inefficient
HIVE supports virtual columns, for example INPUT__FILE__NAME. It gives the input file's name for a mapper task.
Have a look at the documentation here. It provides some example on how to do this.
Unfortunately, I'm unable to test the same now. Let me know if this is working or not.

Resources