TailFile Processor- Apache Nifi - apache-nifi

I'm using Tailfile processor to fetch logs from a cluster(3 nodes) scheduled to run every minute. The log file name changes for every hour
I was confused on which Tailing mode should I use . If I use Single File it is not fetching the new file generated after 1 hour. If I use the multifile, It is fetching the file after 3rd minute of file name change which is increasing the size of the file. what should be the rolling filename for my file and which mode should I use.
Could you please let me know. Thank you
Myfilename:
retrieve-11.log (generated at 11:00)- this is removed but single file mode still checks for this file
after 1 hour retrieve-12.log (generated at 12:00)
My Processor Confuguration:
Tailing mode: Multiple Files
File(s) to Tail: retrieve-${now():format("HH")}.log
Rolling Filename Pattern: ${filename}.*.log
Base Directory: /ext/logs
Initial Start Position: Beginning of File
State Location: Local
Recursive lookup: false
Lookup Frequency: 10 minutes
Maximum age: 24 hours

Sounds like you aren't really doing normal log file rolling. That would be, for example, where you write to logfile.log then after 1 day, you move logfile.log to be logfile.log.1 and then write new logs to a new, empty logfile.log.
Instead, it sounds like you are just writing logs to a different file based on the hour. I assume this means you overwrite each file every 24h?
So something like this might work?
EDIT:
So given that you are doing the following:
At 10:00, `retrieve-10.log` is created. Logs are written here.
At 11:00, `retrieve-11.log` is created. Logs are now written here.
At 11:10, `retrieve-10.log` is moved.
TailFile is only run every 10 minutes.
Then targeting a file based on the hour won't work. At 10:00, your tailFile only reads retrieve-10.log. At 11:00 your tailFile only reads retrieve-11.log. So worst case, you miss 10 minuts of logs between 10:50 and 11:00.
Given that another process is cleaning up the old files, there isn't going to be a back log of old files to worry about. So it sounds like there's no need to set the hour specifically.
tailing mode: multiple files
files to tail: /path/retrieve-*.log
With this, at 10:00, tailFile tails retrieve-9.log and retrieve-10.log. At 10:10, retrieve-9.log is removed and it tails retrieve-10.log. At 11:00 it tails retrieve-10.log and retrieve-11.log. At 11:10, retrieve-10.log is removed and it tails retrieve-11.log. Etc.

Related

logstash won’t read single line files

I'm trying to make a pipeline that sends xml documents to elasticsearch. Problem is that each document is in its own separate file as a single line without \n in the end.
Any way to tell logstash not to wait for \n but read whole file till EOF and send it?
Can you specify which logstash version you are using, and can you share your configuration?
It may depends on the mode you set: it may be tail or read. it defaults to tail, which means it listens on your file and it waits for default 1 hour before closing it and stopping waiting for new lines.
You may have to change this parameter fro 1 hour to 1 second if you know you have reached the EOF yet:
file {
close_older=> "1 second"
}
Let me know if that works!
Docs here: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html#plugins-inputs-file-close_older

Issues with mkdbfile in a simple "read a file > Create a hashfile job"

Hello datastage savvy people here.
Two days in a row, the same single datastage job failed (not stopping at all).
The job tries to create a hashfile using the command /logiciel/iis9.1/Server/DSEngine/bin/mkdbfile /[path to hashfile]/[name of hashfile] 30 1 4 20 50 80 1628
(last trace in the log)
Something to consider (or maybe not ? ) :
The [name of hashfile] directory exists, and was last modified at the time of execution) but the file D_[name of hashfile] does not.
I am trying to understand what happened to prevent the same incident to happen next run (tonight).
Previous to this day, this job is in production since ever, and we had no issue with it.
Using Datastage 9.1.0.1
Did you check the job log to see if captured an error? When a DataStage job executes any system command via command execution stage or similar methods, the stdout of the called job is captured and then added to a message in the job log. Thus if the mkdbfile command gives any output (success messages, errors, etc) it should be captured and logged. The event may not be flagged as an error in the job log depending on the return code, but the output should be there.
If there is no logged message revealing cause of directory non-create, a couple of things to check are:
-- Was target directory on a disk possibly out of space at that time?
-- Do you have any Antivirus software that scans directories on your system at certain times? If so, it can interfere with I/o. If it does a scan at same time you had the problem, you may wish to update the AV software settings to exclude the directory you were writing dbfile to.

Apache flume rolling out the HDFS files on hourly basis

I'm new to Flume and I was exploring options to roll over my HDFS files on hourly basis using Flume.
In my project Apache Flume will read the messages from Rabbit MQ and it will write it to HDFS.
hdfs.rollInterval - It closes the file based on the time interval when it got opened.
New file will be created only when Flume reads a message after the file got closed. This option is not solving our problem.
hdfs.path = /%y/%m/%d/%H - This option is working fine and it creates folder on hourly basis. But the problem is new folder will be created only when new message comes.
For example: Messages are coming till 11.59, the file will be in open state. Then the messages stop coming till 12.30. But, the file will still be in open state. After 12.30 new message comes. Then because of hdfs.path configuration, previous file will be closed and new file will be created in new folder.
Previous file cannot be used for computation till it is closed.
We need an option of closing the opened files on hourly basis perfectly. I'm wondering if there any options in flume for doing that.
hdfs.rollInterval is described as
Number of seconds to wait before rolling current file
So this line should cause the files to allocate for an hour at a time
hdfs.rollInterval = 3600
And I would additionally ignore file size and event count, so add these as well
hdfs.rollSize = 0
hdfs.rollCount = 0
hdfs.idleTimeout
Timeout after which inactive files get closed (0 = disable automatic closing of idle files)
For example, you can set this property to 180. The file will be opened

Script cannot pick a file a second before execution

I have a script which runs every 15 minutes (checks if a new file is present and emails it) [00, 15, 30, 45]
Sometimes there are files being transferred to this location at the very last second (HH:MM:59seconds)
E.g.
Let's say file modified timestamp is (16:14:59)
Script next run is (16:15:00)
The script is not being able to pick that file so is never sent via email
Part of the code:
check=`find . -mmin -15`
If you schedule your script to start at 16:15:00.0000000, it will not start and finish in that exact nanosecond. Likewise, disk writes are not atomic, and starting to write a file at 16:14:59.9999999 does not mean it'll finish and have a timestamp before 16:15:00.
It will take hundreds or maybe thousands of milliseconds before all the OS processes are executed, initialized and scheduled, and all the disk reads finish. The exact time it takes to finish this is unpredictable.
That means that one of your jobs may run at 16:15:00.13 while the next runs at 16:30:01.22, leaving a second long gap where you can lose files.
Instead of checking "modified in the last 15 minutes" you should check for "modified since the last run of the script" (keeping track of the last processed filename or modification date), or at the very least, "modified on or after 16:15:00 but strictly before 16:30:00".

Monitor for file greater than 1 hour old

I need to be able to determine if a file is older than one hour and if so, email, text, or page to notify that our process has started. I need any file that ends with .eob
You can use DateTime date = File.GetCreationTime(str_Path);
str_path is the path of the .eob file.
Then compare date variable with DateTime.Now()
and if its greater than 1 hour send your email or whatever you want to do next
EDIT
To accomplish this you will have to make a small c#/vb .Net application that will, when compiled, make a .exe. Then you can schedukle a job in windows to execute your small app.

Resources