Hi sir,
In data ingest template i need to get this property
for ex i have data with date field
date data
12-07-2018 a
13-07-2018 b
14-07-2018 c
15-07-2018 d
In that , i would like to take latest one i.e, 15-07-2018
if date field got new data
16-07-2018 e
then i have to get 16-07-2018 by checking last updated date 15-07-2018 rather than checking from first one 12-07-2018
like that, if i got 17-08-2108 f then have to get 17-08-2018 by checking with last new date 16-07-2018 ..
how to achieve this , in which processor i have to do modifications or have to add new properties
When the feed runs again, how does it take the latest watermark and work from there
Two possible approach comes to my mind:
Write your own Spark app which would be used (ExecuteSparkJob) to read through the file which is getting ingested. In this case, you keep track of the max date and when you are done through the ingestion, persist it somewhere. If you're in HDP world, easy thing would be to insert the max date to a Hive (transactional) table. You can also leverage ZooKeeper znode to persist or even the PutDistributedMapCache processor that NiFi offers.
Write a custom NiFi processor which would basically do the same thing as the above one, except that you have to enable it yourself to work with data of different format (CSV, JSON). Spark, in this regard, comes packed with many thing built in.
Related
everyone. I'm learning about some NiFi processors.
I want to obtain all the data of several tables automatically.
So I used a ListDatabaseTable processor with the aim of getting the tables names that are in a specific catalog.
After that, I used other processors to generate the queries like GenerateTableFetch and
RemplaceText. Everything works perfectly since here.
Finally, ExecuteSQL processor plays a role, and here and error is displayed. It says that a datetime column can not be converted to Avro format.
The problem is that there are several tables so specify those columns would be complicated to cast them.
Is a possible solution to fix the error?
The connection is with Microsoft SQL Server.
Here is the image of my flow :
I am having a requirement to fetch data from oracle and upload into google cloud storage.
I am using executeSql proecssor but it is failing for large table and even for table with 1million records of approx 45mb size it is taking 2hrs to pull.
The table name are getting passed using restapi to listenHttp which passes them to executeSql. I cant use QueryDatabase because the number of table are dynamic and calls to start the fetch is also dynamic using a UI and Nifi RestUi.
Please suggest any tuning parameter in ExecuteSql Processor.
I believe you are talking about having the capability to have smaller flow files and possibly sending them downstream while the processor is still working on the (large) result set. For QueryDatabaseTable this was added in NiFi 1.6.0 (via NIFI-4836) and in an upcoming release (NiFi 1.8.0 via NIFI-1251) this capability will be available for ExecuteSQL as well.
You should be able to use GenerateTableFetch to do what you want. There you can set the Partition Size (which will end up being the number of rows per flow file) and you don't need a Maximum Value Column if you want to fetch the entire table each time a flow file comes in (which also allows you do handle multiple tables as you described). GenerateTableFetch will generate the SQL statements to fetch "pages" of data from the table, which should give you better, incremental performance on very large tables.
I am getting files from remote server using Nifi: my files are as follow:
timestamp (ms), nodeID,value
12345,x,12.4
12346,x,12.7
12348,x,13.4
12356,x,13,6
12355,y,12.0
I am now just get and fetch and split lines and send them to Kafka, but before hand, I need to apply a checksum approach on my records and aggregate them based on time stamp, what I need to do to add an additional column to my content and count the records based on aggregated time stamps, for example aggregation based on each 10 milliseconds and nodeID..
timestamp (ms), nodeID,value, counts
12345,x,12.4,3
12346,x,12.7,3
12348,x,13.4,3
12356,x,13,6,1
12355,y,12.0,1
How to do above process in NiFi. I am totally new to Nifi but need to add above functinality to my Nifi process. I am currently using below nifi process
This may not answer your question directly, but you should consider refactoring your flow to use the "record" processors. It would greatly simplify things and would probably get you closer to being able to do the aggregation.
The idea is to not split up the records, and instead process them in place. Given your current flow, the 4 processors after FetchSFTP would like change to a single ConvertRecord processor that converts CSV to JSON. You would first need to defined a simple Avro schema for your data.
Once you have the record processing setup, you might be able to use PartitionRecord to partition the records by the node id, and then from there the missing piece would be how to count by the timestamps.
Some additional resources...
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
https://www.slideshare.net/BryanBende/apache-nifi-record-processing
I have a scenario while was working on Hbase. Initially I have to bulkupload a csv file to Hbase table.Which I could do successfully by using Hbase bulkloading.
Now I want to update a particular field in hbase table by comparing to an new csv provided and if the value is updated have to maintain a flag which says the rowkey was updated. Any hint how I can do it easily.
Any help is really appreciated.
Thanks
HBase maintains versions for each cell. As long as you have the row key with you, you get a handle of the row, and you can just use put to add the updated column. Internally it maintains the versions, and you can have access to history of the updated values too.
However, you need comparing too, as I can see. So after bulk loading the fastest you can do it, use a map reduce as have HBase as source and sink. Look here at 7.2.2 section.
The idea is have mapreduce perform the scan, do comparision in map, and write the new updated put in output. Its like a basic fetch, modify and update sequence. But we are using map reduce parallel feature as we are dealing with large amount of data
I have a requirement of parsing both Apache access logs and tomcat logs one after another using map reduce. Few fields are being extracted from tomcat log and rest from Apache log.I need to merge /map extracted fields based on the timestamp and export these mapped fields into a traditional relational db ( ex. MySQL ).
I can parse and extract information using regular expression or pig. The challenge i am facing is on how to map extracted information from both logs into a single aggregate format or file and how to export this data to MYSQL.
Few approaches I am thinking of
1) Write output of map reduce from both parsed Apache access logs and tomcat logs into separate files and merge those into a single file ( again based on timestamp ). Export this data to MySQL.
2) Use Hbase or Hive to store data in table format in hadoop and export that to MySQL
3) Directly write the output of map reduce to MySQL using JDBC.
Which approach would be most viable and also please suggest any other alternative solutions you know.
It's almost always preferable to have smaller, simpler MR jobs and chain them together than to have large, complex jobs. I think your best option is to go with something like #1. In other words:
Process Apache httpd logs into a unified format.
Process Tomcat logs into a unified format.
Join the output of 1 and 2 using whatever logic makes sense, writing the result into the same format.
Export the resulting dataset to your database.
You can probably perform the join and transform (1 and 2) in the same step. Use the map to transform and do a reduce side join.
It doesn't sound like you need / want the overhead of random access so I wouldn't look at HBase. This isn't its strong point (although you could do it in the random access sense by looking up each record in HBase by timestamp, seeing if it exists, merging the record in, or simply inserting if it doesn't exist, but this is very slow, comparatively). Hive could be conveinnient to store the "unified" result of the two formats, but you'd still have to transform the records into that format.
You absolutely do not want to have the reducer write to MySQL directly. This effectively creates a DDOS attack on the database. Consider a cluster of 10 nodes, each running 5 reducers, you'll have 50 concurrent writers to the same table. As you grow the cluster you'll exceed max connections very quickly and choke the RDBMS.
All of that said, ask yourself if it makes sense to put this much data into the database, if you're considering the full log records. This amount of data is precisely the type of case Hadoop itself is meant to store and process long term. If you're computing aggregates of this data, by all means, toss it into MySQL.
Hope this helps.