NiFi: how to store maxTimestamp when using ListFile/GetFile processor? - apache-nifi

i am using MiNiFi 0.3 and NiFi 1.5 version.
we have a requirement to pull the data(csv) from 'A' folder using MiNiFi and send to NiFi running in linux.
for instance, if file is arriving with 10 records on 1 am. we need to move(not copy) file from 'A' folder to NiFi hub.
After 10 minutes (1.10 am), the appended file will be arriving with the older 10 records and new 10 records. so, totally it will contain 20 records.
we need to send only the new 10 records to the NiFi hub.
i tried ListFile -> FetchFile, but since we need to move the data. this does not work.
then i tried with GetFile processor, but it captures the whole 20 records.
is there any way to achieve the scenario.
thanks in advance.

Using FetchFile, you can configure it using property Completion Strategy to Move File or even Delete File(and then you can PutFile it whenever you like).

Related

NiFi GetMongo fetches data forever

I have millions of records in MongoDB and I want to use NiFi to move data. Here is the scenario I want to run:
1) I will setup NiFi
2) NiFi will automatically fetch records with batches of 100 records.
3) Once it is done, it will fetch when a new entry is added.
I tried to apply this scenario with a small MongoDB collection (fetch from mongo and store as a file) and I saw that NiFi is repeating the process forever and it is duplicating the records.
Here is the flow I created on NiFi:
Are there are any suggestions to solve this problem?
Unfortunately, GetMongo doesn't have state tracking capabilities. There are similar questions where I have explained about it. You can find them:
Apache NIFI Jon is not terminating automatically
Apache Niffi getMongo Processor

Apache Nifi - Consume Kafka + Merge Content + Put HDFS to avoid small files

I am having around 2000000 messages in Kafka topic and I want to put these records into HDFS using NiFi,so I am using PutHDFS processor for this along with ConsumeKafka_0_10 but it generates small files in HDFS, So I am using Merge Content processor for the merging the records before pushing the file.
Please help if the configuration needs changes This works fine for small number of messages but writes a single file for every record when it comes to topics with massive data.
Thank you!!
The Minimum Number of Entries is set to 1 which means it could have anywhere from 1 to the Max Number of Entries. Try making that something higher like 100k.

Creating larger NiFi flow files when using the ConsumeKafka processor

I've created a simple NiFi pipeline that reads a stream of data from a Kafka topic (using ConsumeKafka) and writes it to the HDFS (using PutHDFS). Currently, I'm seeing lots of small files being created on the HDFS. A new file is created about once a second, some with only one or two records.
I want fewer, larger files to be written to the HDFS.
I have the following settings in ConsumeKafka:
Message Demarcator = <new line>
Max Poll Records = 10000
Max Uncommitted Time = 20s
In the past I've used Flume instead of Nifi, and it has batchSize and batchDurationMillis, which allow me to tweak how big HDFS files are. It seems like ConsumeKafka in Nifi is missing a batchDurationMillis equivalent.
What's the solution in NiFi?
Using the Message Demarcator and Max Poll Records is the correct approach to get multiple messages per flow file. You may want to slow down the ConsumeKafka processor by adjusting the Run Schedule (on the scheduling tab) from 0 sec which means run as fast as possible, to something like 1 second or whatever makes sense for you to grab more data.
Even with the above, you would likely still want to stick a MergeContent processor before PutHDFS, and merge together flow files based on size so that you can wait til you have the appropriate amount of data before writing to HDFS.
How to use MergeContent will depend on the type of data you are merging... If you have Avro, there is a specific merge strategy for Avro. If you have JSON you can merge them one after another, or you can wrap them with a header, footer, and demarcator to make a valid JSON array.

Fetch file last one minute ago from the current time using nifi

Im throwing multiple csv files on my hdfs every minute using logstash.
I need to get the files from the past minute from the current time.
Im using nifi in this process.
For example right now is 11:30 AM, I need to get ONLY all the files that are saved 1 minute ago or 11:29AM.
What is the best approach here using nifi?
Thank you.
You can check following flow structure.
ListHDFS-->RouteOnAttribute-->FetchHDFS
You can use ListHDFS it lists all files from hdfs folder.
Use RouteOnAttribute to check datetime present in filename is previous minute or not by convert '08-23-17-11-29-AM' into milliseconds(toNumber()) .
Then check it to be equal to that milliseconds with previous minutes of current datetime like below.
${now():toNumber():minus(60000)}.
Here we have minus 1 minutes milliseconds("60000") with current date time.
If both is equals then proceed that queue into FetchHDFS processor it will fetch that particular file in which previous minute file.
Please let me know if you face any issues.

When to move data to HDFS/Hive?

So I'm developing an application that is expected to deal with large amounts of data, and as such I've decided to use Hadoop to process it.
My services node and datanodes are separated from the webapp, so I'm using HttpFS to communicate the app with Hadoop.
So, whenever a new row of data is generated in my application, should I already call the corresponding HttpFS URL to append the data to an HDFS file? Should I write this data in a file in the webserver and using a cronjob upload it to HDFS for example every hour?
Should I have the Hive table updated or should I just load the data in there whenever I need to query it?
I'm pretty new to Hadoop so any link that could help will also be useful.
I prefer below approach.
Do not call HtpFS URL to append data to HDSF file for every row update. HDFS is efficient when data file size is more than 128 MB (in Hadoop 2.x) or 64 MB (in Hadoop 1.x)
Write the data in web server. Have a rolling appender when file size reaches certain limit - in multiples of 128 MB e.g 1 GB file.
You can have hourly based cron jobs but make sure that you are sending big data file (e.g 1 GB or multiple of 128 MB) instead of just sending the log file, which is accumulated in 1 hour.
Regarding loading of data, you can use internal or external HIVE tables. Have a look at this article

Resources