How to write files to NetApps filer with Snaplock Compliance Volume - netapp

We have a NetApps FAS2040 device with a Snaplock compliance volume configure. Image files are written to the volume using IBM FileNet for compliant storage of scanned post.
We want to replace the FileNet element with a in-house solution where we write the images ourselves to the volume. What I would like to know is what is involved in doing this.
Is it just a case of writing a file to the volume then setting the read only attribute to true?. How would I configure expiration for the file. Can I change the time between it being read only and then permanently committed?
Thanks
Stuart

Setting a file read only will activate the snaplock. It will default to the default snaplock retention period - which is configured on your filer.
Usually this is set to a short interval, to avoid accidentally snaplocking junk for protracted periods of time. There's really nothing more annoying than accidentally locking a 100TB aggregate for 3 years.
You update the individual file retention by setting an access time. By default, when you set a file read only, the atime will be set to now + the default retention.
You can update this to a longer timespan (but never shorter) via the touch command:
touch -a -t 201411210000 <filename>
(note - iso date format. YYYYMMDDHHmm)
You can use the stat command to show you the retention dates.
The whole process is geared off the snaplock compliance clock - it's a tamper resistant clock, which should generally match 'real time' but there are a variety of circumstances where it won't. E.g. if you've powered the filer down, or the volumes or whatever, you can't reset the clock, forward date and mess with snaplock. It will adjust itself by up to a week per year, but NO FASTER.
Once the atime is less than the value on the snaplock clock, you can then delete the file (but note - not modify it).
You can set snaplock.autocommit_period to automatically snaplock files that have been static for the defined period.
I would very strongly recommend thinking hard about snaplock retention periods. Setting this to years will almost certainly cause you grief, when you realise you need to do some essential maintenance and you can't because snaplock is in the way. You can only delete an aggregate once every piece of data in it has had the clock expire. (And if any has been offline ever, the clock will be skewed too). That means years after your system is end of life, it will not be usable for anything other than watching the clock tick.
I would suggest instead that you consider a mechanism to refresh snaplock on a periodic basis - e.g. set the retention for 6 months. Run a monthly cronjob to update it and add another month.
You cannot change the time between "read only" and "permanently committed" because that's not the way it works. As soon as it's read only, it's snaplocked. But as said - the retention period should hopefully default to a sensible number.

Related

Modifying post-2040 macOS file creation/modification dates

In my Synology NAS, I have an APFS share with files that have been transferred back and forth for decades across different OSes.
original systems: probably ext4 filesystem and Synology-hosted NFS mount, years ago (various systems, Linux/Windows)
current system: EXT4 filesystem, with Synology-hosted AFP mounts (to a macOS 10.15 system, though I doubt that matters)
For files that were copied via NFS originally, and now hosted via AFP, all the file dates seem to be offset by some amount. I can sort of eyeball the datetime offset, but is there a definitive number I can use? (And a simple way to parse/modify the times using something like GetFileInfo?)
For reference, I have a copy of iTerm2-3_2_6.zip, dated "2039-01-22 08:25:17". I would probably map that to 2019-01-21 (release date for 3.2.7), implying a 20-year offset.
The closest thing I can think of is macOS epoch starting on 2001-01-01 instead of UNIX 1970-01-01, but that's a 30-year offset.
There's also the "year 2038 problem", and some software might be doing something clever with 32-bit overflows to support post-2038 datetimes, but I have at least one file dated "2031-08-10", so that seems unlikely.
This seems to be something related to 32-bit and 64-bit overflows, somewhere in this complicated storage stack; the way the datetimes + error offsets add up is consistently close to 2^31, though I wasn't able to find any clear pattern.
Also, I noticed strange behavior from my Synology system's use of eCryptFS, which seems to lag metadata updates when done in batch. (In particular, I suspect that some eCryptFS/Synology metadata is getting translated incorrectly, or just never written to disk.)
Anyway, I basically wrote a Python script that does the following:
check if os.stat() reports an atime/mtime newer than 2030
check that both atime and mtime are newer; error out if they differ
adjust both times back by 632599096 seconds (offset based on comparing copies of the one file I found in common between two systems).
with the following caveats to watch out for:
macOS's GetFileInfo/SetFile utilities only support 32-bit datetimes, so you should generally avoid using them (even just to verify the metadata updates).
something in the Synology/eCryptFS encryption gets very slow, so after a few dozen metadata updates, the updates will no longer be visible (even after calling sync from the shell). But if you give it some time, you'll see the corrected atime/mtime changes.
The OS-specific os.stat field, ctime, does literally track metadata update times. And there doesn't seem to be a way to manually set it (nor a need to, since this isn't visible through any GUI).
The combination of slow metadata updates + GetFileInfo reporting the wrong time made this incredibly frustrating, until I figured both out. In practice, this means you have to test metadata updates on a few files at a time, then your large batch execution can only be verified a few hours later (I waited a day).
This should have been a blog post, good riddance.

KStreams tmp files cleanup

My Kstreams consumer stores some checkpoint information under /tmp/kafka-streams/. This folder fills up pretty fast in our case. My kstream basically consumes a 1kb message in 3second window and dedups the same based on a key. I am looking for suggestions on how to purge this data periodically so the disk doesn't fill up in terms of what files to keep vs not?
If you use windowed aggreation, by default a retention time of 1 day is used, to allow handling out-of-order data correctly. This means, all windows of the last 24h (or actually up to 36h) are stored.
You can try to reduce the retention time to store a shorter history:
.aggregate(..., Materialized.as(null).withRetentionTime(...));
older version (pre 2.1.0): TimeWindows#until(...) (or SessionWindows#until(...))

NiFi: ListFile Processor is not detecting file changes sometimes

ListFile processor is not detecting any changes to a previously processed file and reprocess it. FYI, I have tried the following options already for reprocessing and only the finally mentioned hack is working. This is in a single-node NiFi I am running in my development environment.
Update Scenario: ListFile processor is not detecting file content changes and trigger automatically post-update (i.e file updates using VIM editor)
Timestamp modification Scenario: Changing the file timestamp with touch -c command changes the file timestamp but this does not cause auto-trigger of the ListFile processor either.
Stop-start Scenario: Stop-start of the whole process group in NiFi after changing the file as mentioned above also does not cause triggering of ListFile processor.
Waiting Clause: Waiting for long enough after file change also does not help - just in case we assume it will auto-trigger after some delay.
HACK: The only way I am able to trigger the re-processing of the file by ListFile processor is by changing the wildcard expression for "File Filter" in ListFile processor in a harmless, idempotent manner, for example from .*test.*\.csv to test.*\.csv and vice versa later (i.e go back and forth like this for repeated reprocessing).
Reprocessing of files with same old names and with modified data is a requirement for us. Please help!
And sometimes forced reprocessing of even an unmodified file could be required in case of unanticipated data issues upstream/downstream. Please help!
UPDATE
Still facing this sporadic behavior! Only restart of NiFi helps when the ListFile processor fails to respond to file change.
Probably this is delayed answer.
The old List processors like ListFiles/ListFtp/ListSftp etc. used only timestamp tracking strategy to identify the changed files. The processor used to cache last seen timestamp in its processor state and use it to list files with only greater timestamp.
However, this approach was very buggy. Hence they had to come up with much better strategy which is called Entity Tracking. This approach gives broad
range of monitoring on file changes. It keeps track of below parameters of each file in the specified directory.
Name
Size
Last modified timestamp
Any change in file is reflected in these key parameters. Since they are cached, any difference is treated as change, thus changed files appear in the success connection.

bash protect HD from excessive use

How do I avoid breaking the HD? I have a bash script running on an ubuntu machine, with this meta code:
bash1.sh
while(true)
run bash2.sh
sleep 60 seconds
done
bash2.sh:
if(directory is empty): exit
process file
delete file
The directory is network shared, and the computer is not doing anything else. Once per day a new file arrives and is processed. (I do know that bash1.sh can be replaced by watch). My concern is that bash1.sh is reading bash2.sh everytime - that can presumably be avoided by only having one script!? and bash2.sh is reading the same directory everytime. Is the directory really read from the HD, or is ubuntu somehow caching the dir in ram? -so it is only read when something changes? is it a problem that it is the same place on the HD that is read every time, or does it not matter because the HD is already spinning? If the HD never sleeps, does it matter if I set the loop time down to only one second?
Maybe the directory could be a pure ram dir - how do I do that? -or is there some simple way to check if something has arrived over the network without reading the directory?
Reading a file or directory once every sixty seconds is not excessive use.
Seriously, don't worry about it.
If it's really worrying you, you can rethink your strategy for detecting the file.
For example, do you really need to know, within sixty seconds, that the file has arrived? Can it arrive any time during the day? Can some parts of the day be considered unlikely?
Using information like that, you can adjust the timing of checks to suit. If the file is supposed to be delivered after 4pm, don't check for it at all before then.
Check for it every sixty seconds between 4pm and 5pm, then every ten minutes after that.
These are all business-related decisions that can be made but I would still suggest that it's unnecessary. Provided you regularly back up your disks (and have standby hardware if you need to be back up in a hurry), you shouldn't lose anything.
In fact, if you were really paranoid, you could dedicate an entire machine for this, whose sole purpose is to receive the file via FTP and, when it arrives, send it across to your real processing box.
Put nothing else on that machine and have a warm standby (exactly same software, IP address and so on but powered down) so that, if it fails, the standby can be activated in minutes.
The real processing machine is then only written to once a day - that's unlikely to affect the disk lifetime.
That's probably too paranoid for my liking but it shows that there are ways to mitigate almost any problem.

Are windows file creation timestamps reliable?

I have a program that uses save files. It needs to load the newest save file, but fall back on the next newest if that one is unavailable or corrupted. Can I use the windows file creation timestamp to tell the order of when they were created, or is this unreliable? I am asking because the "changed" timestamps seem unreliable. I can embed the creation time/date in the name if I have to, but it would be easier to use the file system dates if possible.
If you have a directory full of arbitrary and randomly named files and 'time' is the only factor, it may be more pointful to establish a filename that matches the timestamp to eliminate need for using tools to view it.
2008_12_31_24_60_60_1000
Would be my recommendation for a flatfile system.
Sometimes if you have a lot of files, you may want to group them, ie:
2008/
2008/12/
2008/12/31
2008/12/31/00-12/
2008/12/31/13-24/24_60_60_1000
or something larger
2008/
2008/12_31/
etc etc etc.
( Moreover, if you're not embedding the time, what is your other distinguishing characteritics, you cant have a null file name, and creating monotonically increasing sequences is way harder ? need info )
What do you mean by "reliable"? When you create a file, it gets a timestamp, and that works. Now, the resolution of that timestamp is not necessarily high -- on FAT16 it was 2 seconds, I think. On FAT32 and NTFS it probably is 1 second. So if you are saving your files at a rate of less then one per second, you should be good there. Keep in mind, that user can change the timestamp value arbitrarily. If you are concerned about that, you'll have to embed the timestamp into the file itself (although in my opinion that would be ovekill)
Of course if the user of the machine is an administrator, they can set the current time to whatever they want it to be, and the system will happily timestamp files with that time.
So it all depends on what you're trying to do with the information.
Windows timestamps are in UTC. So if your timezone changes (ie. when daylight savings starts or ends) the timestamp will move forward/back an hour. Apart from that, and the accuracy of about 2 seconds, there is no reason to think that the timestamps are invalid, and its certainly ok to use them. But I think its bad practice, when you can simply put the timestamp in the name, or in the file itself even.
What if the system time is changed for some reason? It seems handy, but perhaps some other version number counting up would be better.
Added: A similar question, but with databases, here.
I faced some issues with created time of a file after deletion and recreation under same name.
Something similar to this comment in GetFileInfoEx docs
Problem getting correct Creation Time after file was recreated
I tried to use GetFileAttributesEx and then get ftCreationTime field of
the resulting WIN32_FILE_ATTRIBUTE_DATA structure. It works just fine
at first, but after I delete file and recreate again, it keeps giving
me the original already incorrect value until I restart the process
again. The same problem happens for FindFirstFile API, as well. I use
Window 2003.
this is said to be related to something called tunnelling
try usining this when you want to rename the file
Path.Combine(ArchivedPath, currentDate + " " + fileInfo.Name))

Resources