How to perform windowing on timestamp column for 5 minute time interval in ADF mapping data flow - window

I want to apply 5 minute window operation on timestamp column in Mapping data flow.First I am using stream analytics to get the telemetry data from Event Hub and storing that data in csv files on Blob Storage. After that I want to perfrom windowing of 5 minute interval on the data stored in csv files through Mapping data flow and ther I want to perform some aggregation. I want to apply windowing of 5 minutes on timestamp column. How to do it?

Related

Incremental data mapping

How can we efficiently load the data from an incremental CSV file without reading the whole file repetitively?
I have used the timestamp information from the given file to load the data in every 5 minutes but in case the timestamp information is not available how can we make it work?

Nifi record counts

I am getting files from remote server using Nifi: my files are as follow:
timestamp (ms), nodeID,value
12345,x,12.4
12346,x,12.7
12348,x,13.4
12356,x,13,6
12355,y,12.0
I am now just get and fetch and split lines and send them to Kafka, but before hand, I need to apply a checksum approach on my records and aggregate them based on time stamp, what I need to do to add an additional column to my content and count the records based on aggregated time stamps, for example aggregation based on each 10 milliseconds and nodeID..
timestamp (ms), nodeID,value, counts
12345,x,12.4,3
12346,x,12.7,3
12348,x,13.4,3
12356,x,13,6,1
12355,y,12.0,1
How to do above process in NiFi. I am totally new to Nifi but need to add above functinality to my Nifi process. I am currently using below nifi process
This may not answer your question directly, but you should consider refactoring your flow to use the "record" processors. It would greatly simplify things and would probably get you closer to being able to do the aggregation.
The idea is to not split up the records, and instead process them in place. Given your current flow, the 4 processors after FetchSFTP would like change to a single ConvertRecord processor that converts CSV to JSON. You would first need to defined a simple Avro schema for your data.
Once you have the record processing setup, you might be able to use PartitionRecord to partition the records by the node id, and then from there the missing piece would be how to count by the timestamps.
Some additional resources...
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
https://www.slideshare.net/BryanBende/apache-nifi-record-processing

Sync database extraction with Hadoop

Lets say you have periodic task that extract data from a database and loads that data into Hadoop.
How does Apache Sqoop/Nifi mantain database sync between the source database (SQL or NoSQL) with destination storage(Hadoop HDFS or HBASE, even S3)?
For example, lets say that at time A the database has 500 records and at time B it has 600 records with some of the old records updated, does it have a mechanism that efficiently knows the difference between time A and time B that only updates rows that changed and add missing rows?
Yes,NiFi has QueryDatabaseTable processor which can store the state and incrementally fetches the records that got updated.
in your table if you are having some date column that can be updated when your records gets updated then you can use the same date column in Max value columns property then processor will pulls only the changes that got made from last state value.
Here is the awesome article regarding querydatabasetable processor
https://community.hortonworks.com/articles/51902/incremental-fetch-in-nifi-with-querydatabasetable.html

Spark data frame, JOIN two datasets and De-dup the records by a key and latest timestamp of a record

I need some help in a efficient way to JOIN two datasets and De-dup the records by a key and latest timestamp of a record.
use case: Need to run a daily incremental refresh for each table and provide an a snapshot of the extract everyday
For each table get a daily incremental file: 150 Million records need to run a De-dup process against a history full volume file (3 billion). The dedup process need to run by a composite primary key and get latest record by the timestamp. every record contains key and a timestamp. Files are available in ORC and parquet format using spark.

Apache Nifi ExecuteSQL Processor

I am trying to fetch data from oracle database using ExecuteSQL processor.I have some queries like suppose there are 15 records in my oracle database.Here when I run the ExecuteSQL processor,it will run continuously as a streaming process and store the whole records as a single file in HDFS and repeatedly do the same.Thus many files will be there in the hdfs location which will fetch the already fetched records from oracle db and these files contains the same data.How can i make this processor to run in such a way that it must fetch all the data from oracle db once and store as a single file and when ever new records is inserted into the db,it must ingest those to hdfs location?
Take a look at the QueryDatabaseTable processor:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html
You will need to tell this processor one or more columns to use to track new records, this is the Maximum Value Columns property. If your table has a one-up id column you can use that, and every time it runs it will track the last id that was seen, and start there on the next execution.

Resources