UNIX file change interval - shell

Is there a way to determine change frequency of a file?
The situation is i have a log file which will be rolling all the time, in that way i can say my application is running .
if it's not writing any then i can say there's some problem.
So instead of using tail and see manually if the logs are rolling , how can i check if the log is rolling programmatically like analysing it for 2 mins and checking if logs are being written?
Is there a way to track the change interval by using stat in some program kinda ???
i mean i can take 2 mins as parameter,at first storing mtime and after 2mins checking with new time and confirming it's changed, but i need to know the frequency kinda like x modifications/time or number.of.lines written/sec kinda

A better idea would be to have inotify, gamin, or FAM notify you when the file has been modified.

On a Unix system, the stat() family of functions will obtain a file's metadata. The st_mtime member of the struct stat structure will give you the time of last modification.
Also on a Unix system, sending a signal 0 to a process will tell you if the process is still alive without affecting the process.

Related

Obtaining Dynamically Changing Log Files

Does the problem I am facing have some kind of a fancy name like "Dining philosophers problem" or "Josephus problem" etc etc? This is so that I can do some research on it.
I want to retrieve the latest log file in Windows. The log file will change its name to log.2, log.3, log.4 .....and so on when it is full(50MB let's say) and the incoming log will be inserted in log.1.
Now, I have a solution to this. I try to poll the server intermittently if the latest file (log.1) has any changes or not.
However, I soon found out that the log.1 is changing to log.2 at an unpredictable time causing me to miss the log file (because I will only retrieve log.1 if log.1 has any changes in its' "Date Modified" properties).
I hope there is some kind of allegory I can give to make this easy to understand. The closest thing I can relate is that of a stroboscope freezing a fan with an unknown frequency giving the illusion of the fan is freezing but the fan has actually spin lot of time. You get the gist.
Thanks in advance.
The solution will be to have your program keep track of the last modified dates for both files log.1 and log.2. When you poll, check log.2 for changes and then check log.1 for changes.
Most of the time, log.2 will not have changed. When it does, you read the updated data there, and then read the updated data in log.1. In code, it would look something like this:
DateTime log1ModifiedDate // saved, and updated whenever it changes
DateTime log2ModifiedDate
if log2.DateModified != log2ModifiedDate
Read and process data from log.2
update log2ModifiedDate
if log1.DateModified != log1ModifiedDate
Read and process data from log.1
update log1ModifiedDate
I'm assuming that you poll often enough that log.1 won't have rolled over twice such that the file that used to be log.1 is now log.3. If you think that's likely to happen, you'll have to check log.3 as well as log.2 and log.1.
Another way to handle this in Windows is to implement file change notification, which will tell you whenever a file changes in a directory. Those notifications are delivered to your program asynchronously. So rather than polling, you respond to notifications. In .NET, you'd use FileSystemWatcher. With the Windows API, you'd use FindFirstChangeNotification and associated functions. This CodeProject article gives a decent example.
Get file-list, sort it in decending order, take first file, read log lines!

Checking whether ruby script is running in windows

I need to run a ruby script for one week and check whether it is running for every hour.
Could you please suggest me some way? I need to check this in windows machine.
For ex:- I have script called one_week_script.rb which will run for one week, in between i want to check whether the script is running or not? if it is not running, then running that script from another script
A typical solution is to use a "heartbeat" strategy. The "to be monitored" notifies a "watchdog" process on a regular interval. A simple way of doing this might be to update the contents of some file every so often, and the watchdog simply checks that same file to see if it's got recent data.
The alternative, simply checking if the process is still 'loaded' has some weaknesses, The program could be locked up, even though it's still apparently 'running'. Using the heartbeat/watchdog style means you know that the watched process is operating normally, because you're getting feedback from it.
In a typical scenario, you might just write the current time, and some arbitrary diagnostic data, say the number of bytes processed (whatever that might mean for you).

Verify whether ftp is complete or not?

I got an application which is polling on a folder continuously. Once any file is ftp to the folder, the application has to move this file to some other folder for processing.
Here, we don't have any option to verify whether ftp is complete or not.
One command "lsof" is suggested in the technical forums. It got a file description column which gives the file status.
Since, this is a free bsd command and not present in old versions of linux, I want to clarify the usage of this command.
Can you guys tell us your experience in file verification and is there any other alternative solution available?
Also, is there any risk in using this utility?
Appreciate your help in advance.
Thanks,
Mathew Liju
We've done this before in a number of different ways.
Method one:
If you can control the process sending the files, have it send the file itself followed by a sentinel file. For example, send the real file "contracts.doc" followed by a one-byte "contracts.doc.sentinel".
Then have your listener process watch out for the sentinel files. When one of them is created, you should process the equivalent data file, then delete both.
Any data file that's more than a day old and doesn't have a corresponding sentinel file, get rid of it - it was a failed transmission.
Method two:
Keep an eye on the files themselves (specifically the last modification date/time). Only process files whose modification time is more than N minutes in the past. That increases the latency of processing the files but you can usually be certain that, if a file hasn't been written to in five minutes (for example), it's done.
Conclusion:
Both those methods have been used by us successfully in the past. I prefer the first but we had to use the second one once when we were not allowed to change the process sending the files.
The advantage of the first one is that you know the file is ready when the sentinel file appears. With both lsof (I'm assuming you're treating files that aren't open by any process as ready for processing) and the timestamps, it's possible that the FTP crashed in the middle and you may be processing half a file.
There are normally three approaches to this sort of problem.
providing a signal file so that when your file is transferred, an additional file is sent to mark that transfer is complete
add an entry to a log file within that directory to indicate a transfer is complete (this really only works if you have a single peer updating the directory, to avoid concurrency issues)
parsing the file to determine completeness. e.g. does the file start with a length field, or is it obviously incomplete ? e.g. parsing an incomplete XML file will result in a parse error due to the lack of an end element. Depending on your file's size and format, this can be trivial, or can be very time-consuming.
lsof would possibly be an option, although you've identified your Linux portability issue. If you use this, note the -F option, which formats the output suitable for processing by other programs, rather than being human-readable.
EDIT: Pax identified a fourth (!) method I'd forgotten - using the fact that the timestamp of the file hasn't updated in some time.
There is a fifth method. You can also check if the FTP Session is still active. This will work if every peer has it's own ftp user account. As long as the user is not logged off from FTP, assume the files are not complete.

How can I speed up Perl's readdir for a directory with 250,000 files?

I am using Perl readdir to get file listing, however, the directory contains more than 250,000 files and this results long time (longer than 4 minutes) to perform readdir and uses over 80MB of RAM. As this was intended to be a recurring job every 5 minutes, this lag time will not be acceptable.
More info:
Another job will fill the directory (once per day) being scanned.
This Perl script is responsible for processing the files. A file count is specified for each script iteration, currently 1000 per run.
The Perl script is to run every 5 min and process (if applicable) up to 1000 files.
File count limit intended to allow down stream processing to keep up as Perl pushes data into database which triggers complex workflow.
Is there another way to obtain filenames from directory, ideally limited to 1000 (set by variable) which would greatly increase speed of this script?
What exactly do you mean when you say readdir is taking minutes and 80 MB? Can you show that specific line of code? Are you using readdir in scalar or list context?
Are you doing something like this:
foreach my $file ( readdir($dir) ) {
#do stuff here
}
If that's the case, you are reading the entire directory listing into memory. No wonder it takes a long time and a lot of memory.
The rest of this post assumes that this is the problem, if you are not using readdir in list context, ignore the rest of the post.
The fix for this is to use a while loop and use readdir in a scalar context.
while (
defined( my $file = readdir $dir )
) {
# do stuff.
}
Now you only read one item at a time. You can add a counter to keep track of how many files you process, too.
The solution would maybe lie in the other end : at the script that fills the directory...
Why not create an arborescence to store all those files and that way have lots of directories each with a manageable number of files ?
Instead of creating "mynicefile.txt" why not "m/my/mynicefile", or something like that ?
Your file system would thank you for that (especially if you remove the empty directories when you have finished with them).
This is not exactly an answer to your query, but I think having that many files in the same directory is not a very good thing for overall speed (including, the speed at which your filesystem handles add and delete operations, not just listing as you have seen).
A solution to that design problem is to have sub-directories for each possible first letter of the file names, and have all files beginning with that letter inside that directory. Recurse to the second, third, etc. letter if need be.
You will probably see a definite speed improvement on may operations.
You're saying that the content gets there by unpacking zip file(s). Why don't you just work on the zip files instead of creating/using 250k of files in one directory?
Basically - to speed it up, you don't need specific thing in perl, but rather on filesystem level. If you are 100% sure that you have to work with 250k files in directory (which I can't imagine a situation when something like this would be required) - you're much better off with finding better filesystem to handle it than to finding some "magical" module in perl that would scan it faster.
Probably not. I would guess most of the time is in reading the directory entry.
However you could preprocess the entire directory listing, creating one file per 1000-entries. Then your process could do one of those listing files each time and not incur the expense of reading the entire directory.
Have you tried just readdir() through the directory without any other processing at all to get a baseline?
You aren't going to be able to speed up readdir, but you can speed up the task of monitoring a directory. You can ask the OS for updates -- Linux has inotify, for example. Here's an article about using it:
http://www.ibm.com/developerworks/linux/library/l-ubuntu-inotify/index.html?ca=drs-
You can use Inotify from Perl:
http://metacpan.org/pod/Linux::Inotify2
The difference is that you will have one long-running app instead of a script that is started by cron. In the app, you'll keep a queue of files that are new (as provided by inotify). Then, you set a timer to go off every 5 minutes, and process 1000 items. After that, control returns to the event loop, and you either wake up in 5 minutes and process 1000 more items, or inotify sends you some more files to add to the queue.
(BTW, You will need an event loop to handle the timers; I recommend EV.)

Are windows file creation timestamps reliable?

I have a program that uses save files. It needs to load the newest save file, but fall back on the next newest if that one is unavailable or corrupted. Can I use the windows file creation timestamp to tell the order of when they were created, or is this unreliable? I am asking because the "changed" timestamps seem unreliable. I can embed the creation time/date in the name if I have to, but it would be easier to use the file system dates if possible.
If you have a directory full of arbitrary and randomly named files and 'time' is the only factor, it may be more pointful to establish a filename that matches the timestamp to eliminate need for using tools to view it.
2008_12_31_24_60_60_1000
Would be my recommendation for a flatfile system.
Sometimes if you have a lot of files, you may want to group them, ie:
2008/
2008/12/
2008/12/31
2008/12/31/00-12/
2008/12/31/13-24/24_60_60_1000
or something larger
2008/
2008/12_31/
etc etc etc.
( Moreover, if you're not embedding the time, what is your other distinguishing characteritics, you cant have a null file name, and creating monotonically increasing sequences is way harder ? need info )
What do you mean by "reliable"? When you create a file, it gets a timestamp, and that works. Now, the resolution of that timestamp is not necessarily high -- on FAT16 it was 2 seconds, I think. On FAT32 and NTFS it probably is 1 second. So if you are saving your files at a rate of less then one per second, you should be good there. Keep in mind, that user can change the timestamp value arbitrarily. If you are concerned about that, you'll have to embed the timestamp into the file itself (although in my opinion that would be ovekill)
Of course if the user of the machine is an administrator, they can set the current time to whatever they want it to be, and the system will happily timestamp files with that time.
So it all depends on what you're trying to do with the information.
Windows timestamps are in UTC. So if your timezone changes (ie. when daylight savings starts or ends) the timestamp will move forward/back an hour. Apart from that, and the accuracy of about 2 seconds, there is no reason to think that the timestamps are invalid, and its certainly ok to use them. But I think its bad practice, when you can simply put the timestamp in the name, or in the file itself even.
What if the system time is changed for some reason? It seems handy, but perhaps some other version number counting up would be better.
Added: A similar question, but with databases, here.
I faced some issues with created time of a file after deletion and recreation under same name.
Something similar to this comment in GetFileInfoEx docs
Problem getting correct Creation Time after file was recreated
I tried to use GetFileAttributesEx and then get ftCreationTime field of
the resulting WIN32_FILE_ATTRIBUTE_DATA structure. It works just fine
at first, but after I delete file and recreate again, it keeps giving
me the original already incorrect value until I restart the process
again. The same problem happens for FindFirstFile API, as well. I use
Window 2003.
this is said to be related to something called tunnelling
try usining this when you want to rename the file
Path.Combine(ArchivedPath, currentDate + " " + fileInfo.Name))

Resources