(why) is FSCTL_SET_OBJECT_ID dangerous? - windows

NTFS files can have object ids. These ids can be set using FSCTL_SET_OBJECT_ID. However, the msdn article says:
Modifying an object identifier can result in the loss of data from portions of a file, up to and including entire volumes of data.
But it doesn't go into any more detail. How can this result in loss of data? Is it talking about potential object id collisions in the file system, and does NTFS rely on them in some way?
Side node: I did some experimenting with this before I found that paragraph, and set the object id's of some newly created files, here's hoping that my file system's still intact.

I really don't think this can directly result in loss of data.
The only way I can imagine it being possible is if e.g. a backup program assumes that (1) every file has an Object Id, and (2) that the program is keeping track of all IDs at all times. In that case it might assume that an ID that is not in its database must refer to a file that should not exist, and it might delete the file.
Yeah, I know it sounds ridiculous, but that's the only way I can think of in which this might happen. I don't think you can lose data just by changing IDs.

They are used by distributed link tracking service which enables client applications to track link sources that have moved. The link tracking service maintains its link to an object only by using these object identifier (ID).
So coming back to your question,
Is it talking about potential object id collisions in the file system
?
I dont think so. Windows does provides us the option to set the object IDs using FSCTL_SET_OBJECT_ID but that doesnt bring the risk of ID collision.
Attempting to set an object identifier on an object that already has an object identifier will fail.
.. and does NTFS rely on them in some way?
Yes. Object identifiers are used to track files and directories. An index of all object IDs is stored on the volume. Rename, backup, and restore operations preserve object IDs. However, copy operations do not preserve object IDs, because that would violate their uniqueness.
How can this result in loss of data?
You wont get into a serious problem if you change(or rather set) object ID of user-created files(as you did). However, if a user(knowingly/unknowingly) sets object ID used by a shared object file/library, change will not be reflected as is.
Since Windows doesnt want everyone(but developers) to play with crutial library files, it issues a generic warning:
Modifying an object identifier can result in the loss of data from
portions of a file, up to and including entire volumes of data.
Bottom line: Change it if you know what you are doing.
There's another msn article on distributed link tracking and object identifiers.
Hope it helps!
EDIT:
Thanks to #Mehrdad for pointing out.I didnt mean object identifiers of DLLs themselves but ones which they use internally.
OLEACC(a dll), provides the Active Accessibility runtime and manages requests from Active Accessibility clients[source]. It use OBJID_QUERYCLASSNAMEIDX object identifier [ source ]

Related

GetFileInformationByHandleEx/FileIdInfo vs DeviceIoControl/FSCTL_CREATE_OR_GET_OBJECT_ID for OpenFileById

Recently I've stumbled upon "If you want to use GUIDs to identify your files, then nobody's stopping you" article by Raymond Chen and wanted to implement this method. But then I found that there is another way to get file ID and this is GetFileInformationByHandleEx with FILE_INFO_BY_HANDLE_CLASS::FileIdInfo and using the FileId field (128 bit).
I tried both, both methods works as expected but I have a few questions I cannot find any answers to:
These methods return different IDs (and the id from GetFileInformationByHandleEx seems to use only the low 64 bit leaving the high part as zero). What each of them represent? Are they essentially the same thing or just two independent mechanisms to achieve the same goal?
Edit: Actually I've just found some information. So the ObjectID from DeviceIoControl is NTFS object ID but what is the other ID then? How do they relate (if at all)? Are both methods available only on NTFS or at least one of them will work on FAT16/32, exFAT, etc?
Documentation for FILE_INFO_BY_HANDLE_CLASS::FileIdInfo doesn't tell us that the ID may not exist unlike FSCTL_CREATE_OR_GET_OBJECT_ID where I need to explicitly state that I want the ID to be created if there isn't one already. Will it have any bad consequences if I'd just blindly request creation of object IDs for any file I'll be working with?
I found a comment for this question that these IDs remain unchanged if a file is moved to another volume (logical or physical). I did test only the DeviceIoControl method but they indeed don't chnage across drives but if I do move the file I'm required to supply OpenFileById with the destination volume handle, otherwise it won't open the file. So, is there a way to make OpenFileById find a file without keeping the volume reference?
I'm thinking of enumerating all connected volumes to try to open the file by ID for each until it succeed but I'm not sure how reliable is this. Could it be that there could exist two equal IDs that reference different files on different volumes?
How fast it is to ask system to get (or create) an ID? Will it hurt performance if I add the ID query to regular file enumeration procedures or I'd better to do that only on demand when I really need this?

What does data look like when using Event Sourcing?

I'm trying to understand how Event Sourcing changes the data architecture of a service. I've been doing a lot of research, but I can't seem to understand how data is supposed to be properly stored with event sourcing.
Let's say I have a service that keeps track of vehicles transporting packages. The current non relational structure for the data model is that each document represents a vehicle, and has many fields representing origin location, destination location, types of packages, amount of packages, status of the vehicle, etc. Normally this gets queried for information to be read to the front end. When changes are made by the user, the appropriate changes are made to this document in order to update this.
With event sourcing, it seems that a snapshot of every event is stored, but there seem to be a few ways to interpret that:
The first is that the multiple versions of the document I described exist, each a new snapshot every time a change is made. Each event would create a new version of this document and alter it. This is the easiest way for me to wrap my head around it, but I believe this to be incorrect.
Another interpretation I have is that each event stores SPECIFIC information about what's been altered in the document. When the vehicle status changes from On Road to Available, for example, an event specifically for vehicle status changes is triggered. Let's say it's called VehicleStatusUpdatedEvent, and contains the Vehicle ID number, the new status, and the timestamp for this event. So this event is stored and is published to a messaging queue. When picked up from the queue, the appropriate changes are made to the current version of the document. I can understand this, but I think I still have some misconceptions here. My understanding is that event sourcing allows us to have a snapshot of data upon each change, so we can know what it looks like at any point. What I just described would keep a log of changes, but still only have one version of the file, as the events only contain specific pieces of the whole file.
Can someone describe how the data flow and architecture works with event sourcing? Using the vehicle data example I provided might help me frame it better. I feel that I am close to understanding this, but I am missing some fundamental pieces that I can't seem to understand by searching online.
The current non relational structure for the data model is that each document represents a vehicle
OK, let's start from there.
In the data model you've described, storage of a document destroys the earlier copy.
Now imagine that instead we were storing the the document in a git repository. Then then saving the document would also save metadata, and that metadata would include a pointer to the previous document.
Of course, we've probably got a lot of duplication in that case. So instead of storing the complete document every time, we'll store a patch document (think JSON Patch), and metadata pointing to the original patch.
Take that same idea again, but instead of storing generic patch documents, we use domain specific messages that describe what is going on in terms of the model.
That's what the data model of an event sourced entity looks like: a list of domain specific descriptions of document transformations.
When you need to reconstitute the current state, you start with a state you know (which could be the "null" state of the document before anything happened to it, and replay onto that document all of the patches (events) that have occurred since.
If you want to do a temporal query, the game is the same, you replay the events up to the point in time that you are interested in.
So essentially when referring to an older build, you reconstruct the document using the events, correct?
Yes, that's exactly right.
So is there still a "current status" document or is that considered bad practice?
"It depends". In the general case, there is no current status document; only the write-ordered list of events is "real", and everything else is derived from that.
Conversations about event sourcing often lead to consideration of dedicated message stores for managing persistence of those ordered lists, and it is common that the message stores do not also support document storage. So trying to keep a "current version" around would require commits to two different stores.
At this point, designers typically either decide that "recent version" is good enough, in which case they build eventually consistent representations of documents outside of the transaction boundary... OR they decide current version is important, and look into storage solutions that support storing the current version in the same transaction as the events (ex: using an RDBMS).
what is the procedure used to generate the snapshot you want using the events?
IF you want to generate a snapshot, then you'll normally end up using a pattern called a projection, to iterate over the events and either fold or reduce them to create the document.
Roughly, you have a function somewhere that looks like
document-with-meta-data = projection(event-history-with-metadata)

How to handle a legal enforced data delete request in an event sourced system?

In an event sourced system, historic data in the form of events is never thrown away. Doing so could result in a corrupted state. Now imagine there is a court ruling, stating some data needs to be deleted (for example, search engines had to delete privacy specific data). How would you achieve this?
That's a really good question.
So far, I've learned of two possibilities.
Easy part first: if you are using event sourcing, then all of your views of your data should be derivable from the events in your event store. Therefore, all of the data that you have stored for reading (caches, screens, projections, reports) can be blown away and regenerated after you scrub the tainted data from the event store.
So you only need to figure out that part.
First, if the tainted data never gets into the store, you don't have to worry about scrubbing it out. For instance, sensitive information can be isolated in a key value store; references to that data in the event store are always by surrogate key. When you need to scrub, the data in the key value store is nuked, you have a bunch of events that point to something no longer readable, and you just need to ensure that your read models can continue to function if the referenced data is not available.
If the data does need to get into the event store -- because it's needed to maintain the integrity of the domain model -- then the idea of "aggregates" may be able to help.
Aggregates is an idea taken from ddd, the basic idea is that your domain can be decomposed into elements that don't need to share data directly. On aggregate never references data within another directly; instead you use indirect references by ID; the ID itself being another surrogate key.
Since these aggregates are isolated from each other, they can have their own event history. In which case you can scrub the tainted data by simply eliminating any aggregates that have been contaminated. You just delete the event streams.
A response like this doesn't put you in a corrupted state, just an inconsistent one. Everything still runs, there's just a bunch of data missing.
There's also the weapon of a "compensating event" available in the toolkit; you might be able to introduce a new stream of events that brings the system back to a consistent state. For example, if scrubbing a bunch of transactions takes the books out of balance, you may be able to publish an event that creates a charge against iCouldTellYouButThen....

How to determine a volume supports resolving bookmarks to renamed or moved files?

- bookmarkDataWithOptions:includingResourceValuesForKeys:relativeToURL:error:
Documentation states:
This method returns bookmark data that can later be
resolved into a URL object for a file even if the user moves or
renames it (if the volume format on which the file resides supports
doing so).
My question is, how can I query if a volume supports this feature?
From trial and error it seems only (internal?) hard drives support it, but I am looking for some kind of sure test like a NSURLVolumeSupports???Key.
NSURLVolumeSupportsPersistentIDsKey looks like a good candidate, but I failed to find any docs or google-info about it. Any hints?
It definitely sounds like the NSURLVolumeSupportsPersistentIDsKey would apply.
Following the hints in this forum thread here (archived version here), the documentation for the VOL_CAP_FMT_PERSISTENTOBJECTIDS volume capability flag (from man getattrlist(2)) says:
If this bit is set the volume format supports persistent object identifiers and can look up file system objects by their IDs. See ATTR_CMN_OBJPERMANENTID for details about how to obtain these identifiers.
and the common attribute ATTR_CMN_OBJPERMANENTID documentation says
An fsobj_id_t structure that uniquely and persistently identifies the file system object within its volume; persistence implies that this attribute is unaffected by mount/unmount operations on the volume.
Some file systems can not return this attribute when the volume is mounted read-only and will fail the request with error EROFS. (e.g. original HFS modifies on disk structures to generate persistent identifiers, and hence cannot do so if the volume is mounted read only.)

LINQ to XML updates - how does it handle multiple concurrent readers/writers?

I have an old system that uses XML for it's data storage. I'm going to be using the data for another mini-project and wanted to use LINQ to XML for querying/updating the data; but there's 2 scenarios that I'm not sure whether I need to handle myself or not:
1- If I have something similar to the following code, and 2 people happen to hit the Save() at the same time? Does LINQ to XML wait until the file is available again before saving, or will it just throw? I don't want to put locks in unless I need to :)
// I assume the next line doesn't lock the file
XElement doc = XElement.Load("Books.xml");
XElement newBook = new XElement("Book",
new XAttribute("publisher", "My Publisher"),
new XElement("author", "Me")));
doc.Add(newBook);
// What happens if two people try this at the same time?
doc.Save("Books.xml");
2- If I Load() a document, add a entry under a particular node, and then hit Save(); what happens if another user has already added a value under that node (since I hit my Load()) or even worse, deleted the node?
Obviously I can workaround these issues, but I couldn't find any documentation that could tell me whether I have to or not, and the first one at least would be a bit of a pig to test reliably.
It's not really a LINQ to XML issue, but a basic concurrency issue.
Assuming the two people are hitting Save at the same time, and the backing store is a file, then depending on how you opened the file for saving, you might get an error. If you leave it to the XDocument class (by just passing in a file name), then chances are it is opening it exclusively, and someone else trying to do the same (assuming the same code hitting it) will get an exception. You basically have to synchronize access to any shared resource that you are reading from/writing to.
If another user has already added a value, then assuming you don't have a problem obtaining the resource to write to, your changes will overwrite the resource. This is a frequent issue with databases known as optimistic concurrency, and you need some sort of value to indicate whether a change has occurred between the time you loaded the data, and when you save it (most databases will generate timestamp values for you).

Resources