Why do mtime and atime need to be updated? - alluxio

Does anyone know why the mtime and atime need to be updated when completing the file?
mInodeTree.updateInode(rpcContext, UpdateInodeEntry.newBuilder()
.setId(inode.getId())
.setUfsFingerprint(ufsFingerprint)
.setLastModificationTimeMs(opTimeMs) // mtime?
.setLastAccessTimeMs(opTimeMs) // atime?
.setOverwriteModificationTime(true)
.build();
mInodeTree.updateInodeFile(rpcContext, entry.build());

In the early days when Alluxio is mostly used for Spark or MR for storing really large files, we were thinking the completion could take a while, thus the completion time may better reflect the mtime and atime. I don’t think there is any particular technical reason behind that

Related

Is there a way to tell if a file has changed in the Windows API other than opening it or timestamps?

I'm writing a program which needs to look at a very large number of files, some of which are very large in size. I'd like to visit a file only once, unless it changes. If it changes I need to revisit it again.
The way I know of to do this is with datestamps. One can look at the modified date to see if it is newer than the last time you looked at the file. Obviously those can be changed programmatically, so I'm wondering if there is a way to determine if a file has changed other than that. (I'm thinking along the lines of a UUID for the file which is changed every time it is modified or an epoch counter, but I'm open to more exotic solutions)
You can monitor changes for these files, assuming you continue to run the whole time. Check the FindFirstChangeNotification API. You can take a look at this project as an example. Sysinternals also has a similar tool, I believe it's implemented in a similar way.

Terminal find using download time

I was wondering if there is a way to find files using the find tool in Terminal based on file's download time. I know there are options for access (-amin), creation (-cmin), and modified (-mmin), but can't figure out a way to filter files based on time they were downloaded.
I checked and the creation time was not same as it's download time. If find can't do it, what's my other best option.
There's no creation time in Unix; ctime is the inode change time.
Your best bet is to use the time of last modification, aka mtime, which gives you the time the download ended. If you must know when the download started, you need to record the date prior to the download. If you need the download duration, you subtract the end time from the start time. There are tons of questions how to compute the length between two time stamps. Don't ask another :-)
EDIT: It appears your downloader (which one? Why didn't you specify it?) changes the time stamps to match the original. You can read its documentation if it has an option to suppress this. You could also find out if it can write the file to stdout and redirect it (e.g. wget -O - http://file > file) This will always force the mtime to be current.

One file database with HDFS and MapReduce

Lets imagine I want to store a big number of urls with associated metadata
URL => Metadata
in a file
hdfs://db/urls.seq
I would like this file to grow (if new URLs are found) after every run of MapReduce.
Would that work with Hadoop? As I understand MapReduce outputs data to a new directory. Is there any way to take that output and append it to the file?
The only idea which comes to my mind is to create a temporary urls.seq and then replace the old one. It works but it feels wasteful. Also from my understanding Hadoop likes the "write once" approach and this idea seams to be in conflict with that.
As blackSmith has explained that you can easily append an existing file in hdfs but it would bring down your performance because hdfs is designed with "write once" strategy. My suggestion is to avoid this approach until no option left.
One approach you may consider that is you can make a new file for every mapreduce output , if size of every output is large enough then this technique will benefit you most because writing a new file will not affect performance as appending does. And also if you are reading the output of each mapreduce in next mapreduce then reading anew file won't affect your performance that much as appending does.
So there is a trade off it depends what you want whether performance or simplicity.
( Anyways Merry Christmas !)

Obtaining Dynamically Changing Log Files

Does the problem I am facing have some kind of a fancy name like "Dining philosophers problem" or "Josephus problem" etc etc? This is so that I can do some research on it.
I want to retrieve the latest log file in Windows. The log file will change its name to log.2, log.3, log.4 .....and so on when it is full(50MB let's say) and the incoming log will be inserted in log.1.
Now, I have a solution to this. I try to poll the server intermittently if the latest file (log.1) has any changes or not.
However, I soon found out that the log.1 is changing to log.2 at an unpredictable time causing me to miss the log file (because I will only retrieve log.1 if log.1 has any changes in its' "Date Modified" properties).
I hope there is some kind of allegory I can give to make this easy to understand. The closest thing I can relate is that of a stroboscope freezing a fan with an unknown frequency giving the illusion of the fan is freezing but the fan has actually spin lot of time. You get the gist.
Thanks in advance.
The solution will be to have your program keep track of the last modified dates for both files log.1 and log.2. When you poll, check log.2 for changes and then check log.1 for changes.
Most of the time, log.2 will not have changed. When it does, you read the updated data there, and then read the updated data in log.1. In code, it would look something like this:
DateTime log1ModifiedDate // saved, and updated whenever it changes
DateTime log2ModifiedDate
if log2.DateModified != log2ModifiedDate
Read and process data from log.2
update log2ModifiedDate
if log1.DateModified != log1ModifiedDate
Read and process data from log.1
update log1ModifiedDate
I'm assuming that you poll often enough that log.1 won't have rolled over twice such that the file that used to be log.1 is now log.3. If you think that's likely to happen, you'll have to check log.3 as well as log.2 and log.1.
Another way to handle this in Windows is to implement file change notification, which will tell you whenever a file changes in a directory. Those notifications are delivered to your program asynchronously. So rather than polling, you respond to notifications. In .NET, you'd use FileSystemWatcher. With the Windows API, you'd use FindFirstChangeNotification and associated functions. This CodeProject article gives a decent example.
Get file-list, sort it in decending order, take first file, read log lines!

Are windows file creation timestamps reliable?

I have a program that uses save files. It needs to load the newest save file, but fall back on the next newest if that one is unavailable or corrupted. Can I use the windows file creation timestamp to tell the order of when they were created, or is this unreliable? I am asking because the "changed" timestamps seem unreliable. I can embed the creation time/date in the name if I have to, but it would be easier to use the file system dates if possible.
If you have a directory full of arbitrary and randomly named files and 'time' is the only factor, it may be more pointful to establish a filename that matches the timestamp to eliminate need for using tools to view it.
2008_12_31_24_60_60_1000
Would be my recommendation for a flatfile system.
Sometimes if you have a lot of files, you may want to group them, ie:
2008/
2008/12/
2008/12/31
2008/12/31/00-12/
2008/12/31/13-24/24_60_60_1000
or something larger
2008/
2008/12_31/
etc etc etc.
( Moreover, if you're not embedding the time, what is your other distinguishing characteritics, you cant have a null file name, and creating monotonically increasing sequences is way harder ? need info )
What do you mean by "reliable"? When you create a file, it gets a timestamp, and that works. Now, the resolution of that timestamp is not necessarily high -- on FAT16 it was 2 seconds, I think. On FAT32 and NTFS it probably is 1 second. So if you are saving your files at a rate of less then one per second, you should be good there. Keep in mind, that user can change the timestamp value arbitrarily. If you are concerned about that, you'll have to embed the timestamp into the file itself (although in my opinion that would be ovekill)
Of course if the user of the machine is an administrator, they can set the current time to whatever they want it to be, and the system will happily timestamp files with that time.
So it all depends on what you're trying to do with the information.
Windows timestamps are in UTC. So if your timezone changes (ie. when daylight savings starts or ends) the timestamp will move forward/back an hour. Apart from that, and the accuracy of about 2 seconds, there is no reason to think that the timestamps are invalid, and its certainly ok to use them. But I think its bad practice, when you can simply put the timestamp in the name, or in the file itself even.
What if the system time is changed for some reason? It seems handy, but perhaps some other version number counting up would be better.
Added: A similar question, but with databases, here.
I faced some issues with created time of a file after deletion and recreation under same name.
Something similar to this comment in GetFileInfoEx docs
Problem getting correct Creation Time after file was recreated
I tried to use GetFileAttributesEx and then get ftCreationTime field of
the resulting WIN32_FILE_ATTRIBUTE_DATA structure. It works just fine
at first, but after I delete file and recreate again, it keeps giving
me the original already incorrect value until I restart the process
again. The same problem happens for FindFirstFile API, as well. I use
Window 2003.
this is said to be related to something called tunnelling
try usining this when you want to rename the file
Path.Combine(ArchivedPath, currentDate + " " + fileInfo.Name))

Resources