Fetch file last one minute ago from the current time using nifi - apache-nifi

Im throwing multiple csv files on my hdfs every minute using logstash.
I need to get the files from the past minute from the current time.
Im using nifi in this process.
For example right now is 11:30 AM, I need to get ONLY all the files that are saved 1 minute ago or 11:29AM.
What is the best approach here using nifi?
Thank you.

You can check following flow structure.
ListHDFS-->RouteOnAttribute-->FetchHDFS
You can use ListHDFS it lists all files from hdfs folder.
Use RouteOnAttribute to check datetime present in filename is previous minute or not by convert '08-23-17-11-29-AM' into milliseconds(toNumber()) .
Then check it to be equal to that milliseconds with previous minutes of current datetime like below.
${now():toNumber():minus(60000)}.
Here we have minus 1 minutes milliseconds("60000") with current date time.
If both is equals then proceed that queue into FetchHDFS processor it will fetch that particular file in which previous minute file.
Please let me know if you face any issues.

Related

NiFi: how to get maximum timestamp from first column?

NiFi version 1.5
i have a csv file arrives first time like:
datetime,a.DLG,b.DLG,c.DLG
2019/02/04 00:00,86667,98.5,0
2019/02/04 01:00,86567,96.5,0
used listfile -> fetchfile to get the csv file.
next 10 minutes, i get appended csv file:
datetime,a.DLG,b.DLG,c.DLG
2019/02/04 00:00,86667,98.5,0
2019/02/04 01:00,86567,96.5,0
2019/02/04 02:00,86787,99.5,0
2019/02/04 03:00,86117,91.5,0
here, how do we need to get only new records alone (last two records). i do not want to process first two records that is already been processed.
my thought process is, we need to get maximum datetime to store in attribute and use QueryRecord. but i do not know how to get maximum datetime using which processor.
is there any better solution.
This is currently an open issue (NIFI-6047) but there has been a community contribution to address it, so you may see the DetectDuplicateRecord processor in an upcoming release of NiFi.
There may be a workaround to split up the CSV rows and create a compound key using ExtractText, then using DetectDuplicate.
It doesn't seems to be a work that is best solved on Nifi as you need to keep a state of what you have processed. An alternative would be for you to delete what you have already processed. Then you can assume what is in the file is always not processed.
here, how do we need to get only new records alone (last two records).
i do not want to process first two records that is already been
processed.
From my understanding, actual question is 'how to process/ingest csv rows as it is written to the file?'.
Description of 'TailFile' processor from NiFi documentation:
"Tails" a file, or a list of files, ingesting data from the file as it
is written to the file. The file is expected to be textual. Data is
ingested only when a new line is encountered (carriage return or
new-line character or combination)
This solution is appropriate when you don't want to move/delete actual file.

NiFi: how to store maxTimestamp when using ListFile/GetFile processor?

i am using MiNiFi 0.3 and NiFi 1.5 version.
we have a requirement to pull the data(csv) from 'A' folder using MiNiFi and send to NiFi running in linux.
for instance, if file is arriving with 10 records on 1 am. we need to move(not copy) file from 'A' folder to NiFi hub.
After 10 minutes (1.10 am), the appended file will be arriving with the older 10 records and new 10 records. so, totally it will contain 20 records.
we need to send only the new 10 records to the NiFi hub.
i tried ListFile -> FetchFile, but since we need to move the data. this does not work.
then i tried with GetFile processor, but it captures the whole 20 records.
is there any way to achieve the scenario.
thanks in advance.
Using FetchFile, you can configure it using property Completion Strategy to Move File or even Delete File(and then you can PutFile it whenever you like).

Nifi Job to execute a spark submit command not giving correct results

I have a spark code that appends data from a hive table to parquet files partitioned on dates. The code runs absolutely correct when executed from the spark shell and the parquet files show the exact same number of rows as present in the hive table for the corresponding date.
However, when the same code is executed by putting the code in a jar file, which is called upon by a spark submit command, and the spark submit command is scheduled to execute daily at 9 AM via Nifi, the number of rows in the parquet partition files are coming out to be less. We are on the P_NO_SLA queue, and below are some of the facts and observations we have:
•Data on the source hive table gets updated by 4 AM approx
•Initially our Nifi job was scheduled to start running at 4:45 AM but the number of records did not match. On doing a manual update from the spark shell post 6 AM, the data was an exact match.
•Hence, we scheduled the job to run at 7 AM. On doing this, when the number of records were too less (approx. 20000 on weekends) as compared to weekdays (in the range of 150000 to >200000 records), the data got updated correctly via the Nifi Job. Again a manual run was done to backfill the missing data.
•Again, we postponed the job to 9 AM. Post doing this, there were 2 days when the number of records matched (between 160000 to 200000), however, since Jul-31, the data hasn't matched at all, irrespective of the number of records on any of the days, and we are having to do a manual backfill everyday.
We are unable to figure out any specific reason that maybe causing the code to run correctly from the spark shell at any time, but giving incorrect results from Nifi when Nifi is just schedculed to execute the spark submit command to run the jar file containing the same spark code.
Please help me with understanding why this would be happening and how I can fix this.
P.S.: I have checked the Nifi log files, and could not find any of the scheduled jobs giving an error.

Shell script for hourly run to pull data if exists

I am trying to optimize our batch process to pull and insert data into a database. Currently, we have a data source that we pull our data from, create a text file, and load into our reporting database. We have that on a time schedule in Autosys, since most of the time, data is available by a certain time. However, lately, the data source has been late and we are not able to pull the data during the scheduled time and we have to manually run the shell script.
I want to have a shell script that runs the queries every hour and if the data exists, spools to a file to be loaded into the DB. If the data isn't there, then try again next hour so that we can eliminate any manual intervention.
I know I can set up a file trigger in Autosys to run the load into the database if the file exists, but I am having issues setting up the shell script only pull the data once it is available and not repeat the next hour if the file has already been spooled. I am new to UNIX so I am not sure how to proceed. Any help would be great.
You haven't stated your priority clearly. The priorities could be:
load the data as soon as it is available
load the data at least once every x minutes or hours
eliminate any need for manual intervention (which is clear from your question)
This is what you could do, assuming there is no need to load the data as soon as it is available:
increase the frequency of the Autosys job (instead of hourly, may be make it once in 30 or 15 minutes)
change the script so that:
it attempts to load only if it has been x minutes since last successful load, otherwise does nothing and ends in success
stores the last successful load timestamp in a file (which would be touched only upon a successful load)
if data doesn't arrive even after x + some buffer minutes, it might make more sense for the load job to fail so that it gets the required attention.

How to keep a state in Hadoop jobs?

I'm working on a hadoop program which is scheduled to run once a day. It takes a bunch of json documents and each document has a time-stamp which shows when the document has been added. My program should only process those documents that are added since its last run. So, I need to keep a state which is a time-stamp showing the last time my hadoop job has run. I was thinking of storing this state in a SQL Server and query that in the driver program of my job. Is it a good solution or might be a better solution ?
p.s. my hadoop job is running on HDInsight. Having said that it is still possible to query the SQL server from my driver program?
We had solved this problem for our workflows running in AWS (Amazon Web Services), for the data stored in S3.
Our setup:
Data store: AWS S3
Data ingestion mechanism: Flume
Workflow management: Oozie
Storage for file status: MySQL
Problem:
We were ingesting data into Amazon S3, using Flume. All the ingested data was in same folder (S3 is a key/value store and has no concept of folder. Here folder means, all the data had same prefix. For e.g. /tmp/1.txt, /tmp/2.txt etc. Here /tmp/ is the key prefix).
We had a ETL workflow, which was scheduled to run once in an hour. But, since all the data was ingested into same folder, we had to distinguish between the Processed and Un-Processed files.
For e.g. for the 1st hour data ingested is:
/tmp/1.txt
/tmp/2.txt
When the workflow starts for the first time, it should process data from "1.txt" and "2.txt" and mark them as Processed.
If for the second hour, the data ingested is:
/tmp/3.txt
/tmp/4.txt
/tmp/5.txt
Then, the total data in the folder after 2 hours will be:
/tmp/1.txt
/tmp/2.txt
/tmp/3.txt
/tmp/4.txt
/tmp/5.txt
Since, "1.txt" and "2.txt" were already processed and marked as Processed, during the second run, the job should just process "3.txt", "4.txt" and "5.txt".
Solution:
We developed a library (let's call it as FileManager), for managing the list of processed files. We plugged in this library into the Oozie work flow, as a Java action. This was the first step in the workflow.
This library also took care of ignoring the files, which are currently being written into by Flume. When Flume is writing data into a file, those files had "_current" suffix. So, those files were ignored for processing, till they are completely written into.
The ingested files were generated with timestamp as a suffix. For e.g. "hourly_feed.1234567". So, the file names were in ascending order of their creation.
For getting the list of unprocessed files, we used S3's feature of querying using markers (For e.g. if you have 10,000 files in a folder, if you specify marker as the name of the 5,000th file, then S3 will return you files from 5001 to 10,000).
We had following 3 states for each of the files:
SUCCESS - Files which were successfully processed
ERROR - Files which were picked up for processing, but there was an error in processing these files. Hence, these files need to be picked up again for processing
IN_PROGRESS - Files which have been picked up for processing and are currently being processed by a job
For each file, we stored following details in the MySQL DB:
File Name
Last Modified Time - We used this to handle some corner cases
Status of the file (IN_PROGRESS, SUCCESS, ERROR)
The FileManager exposed following interfaces:
GetLatestFiles: Return the list of latest Un-Processed files
UpdateFileStatus: After processing the files, update the status of the files
Following are the steps followed to identify the files, which were not yet processed:
Query the database (MySql), to check the last file which had status of SUCCESS (query: order by created desc).
If the first step returns a file, then query S3 with the file marker set to the last successfully processed file. This will return all the files, ingested after the last successfully processed file.
Also query the DB to check if there are any files in ERROR status. These files need to be re-processed, because previous workflow did not process them successfully.
Return the list of files obtained from Steps 2 and 3 (Before returning them, mark their status as IN_PROGRESS).
After the job is completed successfully update the state of all the processed file as SUCCESS. If there was an error in processing the files, then update the status of all the files as ERROR (so that they can be picked up for processing next time)
We used Oozie for workflow management. Oozie workflow had following steps:
Step 1: Fetch next set of files to be processed, mark each of their state as IN_PROGRESS and pass them to the next stage
Step 2: Process the files
Step 3: Update the status of the processing (SUCCESS or ERROR)
De-duplication:
When you implement such a library, there is a possibility of duplication of records (in some corner cases, same file may be picked up twice for processing). We had implemented a de-duplication logic to remove duplicate records.
you can rename the result document by using date-time,then your program can process the document according to the name of document.
Driver program checking the last run time stamp is good approach, but for storing last run time stamp, you can use a temporary file from HDFS.

Resources