LINQ to XML updates - how does it handle multiple concurrent readers/writers? - linq

I have an old system that uses XML for it's data storage. I'm going to be using the data for another mini-project and wanted to use LINQ to XML for querying/updating the data; but there's 2 scenarios that I'm not sure whether I need to handle myself or not:
1- If I have something similar to the following code, and 2 people happen to hit the Save() at the same time? Does LINQ to XML wait until the file is available again before saving, or will it just throw? I don't want to put locks in unless I need to :)
// I assume the next line doesn't lock the file
XElement doc = XElement.Load("Books.xml");
XElement newBook = new XElement("Book",
new XAttribute("publisher", "My Publisher"),
new XElement("author", "Me")));
doc.Add(newBook);
// What happens if two people try this at the same time?
doc.Save("Books.xml");
2- If I Load() a document, add a entry under a particular node, and then hit Save(); what happens if another user has already added a value under that node (since I hit my Load()) or even worse, deleted the node?
Obviously I can workaround these issues, but I couldn't find any documentation that could tell me whether I have to or not, and the first one at least would be a bit of a pig to test reliably.

It's not really a LINQ to XML issue, but a basic concurrency issue.
Assuming the two people are hitting Save at the same time, and the backing store is a file, then depending on how you opened the file for saving, you might get an error. If you leave it to the XDocument class (by just passing in a file name), then chances are it is opening it exclusively, and someone else trying to do the same (assuming the same code hitting it) will get an exception. You basically have to synchronize access to any shared resource that you are reading from/writing to.
If another user has already added a value, then assuming you don't have a problem obtaining the resource to write to, your changes will overwrite the resource. This is a frequent issue with databases known as optimistic concurrency, and you need some sort of value to indicate whether a change has occurred between the time you loaded the data, and when you save it (most databases will generate timestamp values for you).

Related

What does data look like when using Event Sourcing?

I'm trying to understand how Event Sourcing changes the data architecture of a service. I've been doing a lot of research, but I can't seem to understand how data is supposed to be properly stored with event sourcing.
Let's say I have a service that keeps track of vehicles transporting packages. The current non relational structure for the data model is that each document represents a vehicle, and has many fields representing origin location, destination location, types of packages, amount of packages, status of the vehicle, etc. Normally this gets queried for information to be read to the front end. When changes are made by the user, the appropriate changes are made to this document in order to update this.
With event sourcing, it seems that a snapshot of every event is stored, but there seem to be a few ways to interpret that:
The first is that the multiple versions of the document I described exist, each a new snapshot every time a change is made. Each event would create a new version of this document and alter it. This is the easiest way for me to wrap my head around it, but I believe this to be incorrect.
Another interpretation I have is that each event stores SPECIFIC information about what's been altered in the document. When the vehicle status changes from On Road to Available, for example, an event specifically for vehicle status changes is triggered. Let's say it's called VehicleStatusUpdatedEvent, and contains the Vehicle ID number, the new status, and the timestamp for this event. So this event is stored and is published to a messaging queue. When picked up from the queue, the appropriate changes are made to the current version of the document. I can understand this, but I think I still have some misconceptions here. My understanding is that event sourcing allows us to have a snapshot of data upon each change, so we can know what it looks like at any point. What I just described would keep a log of changes, but still only have one version of the file, as the events only contain specific pieces of the whole file.
Can someone describe how the data flow and architecture works with event sourcing? Using the vehicle data example I provided might help me frame it better. I feel that I am close to understanding this, but I am missing some fundamental pieces that I can't seem to understand by searching online.
The current non relational structure for the data model is that each document represents a vehicle
OK, let's start from there.
In the data model you've described, storage of a document destroys the earlier copy.
Now imagine that instead we were storing the the document in a git repository. Then then saving the document would also save metadata, and that metadata would include a pointer to the previous document.
Of course, we've probably got a lot of duplication in that case. So instead of storing the complete document every time, we'll store a patch document (think JSON Patch), and metadata pointing to the original patch.
Take that same idea again, but instead of storing generic patch documents, we use domain specific messages that describe what is going on in terms of the model.
That's what the data model of an event sourced entity looks like: a list of domain specific descriptions of document transformations.
When you need to reconstitute the current state, you start with a state you know (which could be the "null" state of the document before anything happened to it, and replay onto that document all of the patches (events) that have occurred since.
If you want to do a temporal query, the game is the same, you replay the events up to the point in time that you are interested in.
So essentially when referring to an older build, you reconstruct the document using the events, correct?
Yes, that's exactly right.
So is there still a "current status" document or is that considered bad practice?
"It depends". In the general case, there is no current status document; only the write-ordered list of events is "real", and everything else is derived from that.
Conversations about event sourcing often lead to consideration of dedicated message stores for managing persistence of those ordered lists, and it is common that the message stores do not also support document storage. So trying to keep a "current version" around would require commits to two different stores.
At this point, designers typically either decide that "recent version" is good enough, in which case they build eventually consistent representations of documents outside of the transaction boundary... OR they decide current version is important, and look into storage solutions that support storing the current version in the same transaction as the events (ex: using an RDBMS).
what is the procedure used to generate the snapshot you want using the events?
IF you want to generate a snapshot, then you'll normally end up using a pattern called a projection, to iterate over the events and either fold or reduce them to create the document.
Roughly, you have a function somewhere that looks like
document-with-meta-data = projection(event-history-with-metadata)

encodeRestorableState for unsaved documents

The documentation for NSDocument states:
Subclasses can override this method and use it to restore any
information that would be needed to restore the document’s window to
its current state. For example, you could use this method to record
references to the data currently managed by the document and displayed
by the window. (Do not store the actual data itself. Store only
references to the data so that you can load it later from disk.) You
must store enough data to reconfigure the document and its window to
their current state during a subsequent launch of the app.
What does "Do not store the actual data itself." actually mean? Is this a hard and fast rule? Or is it more of a guideline?
In particular, I'm wondering about the case of documents with unsaved changes in them. Is it "permissible" to store the unsaved changes (which may be everything if this is a new document)? Or, do I need to save the data off in a file somewhere... and if so, where is the preferred location?
I don't want to restore a bunch of identical (blank) documents if I had multiple unsaved new documents when the application was shut down.
Thanks for any hints on the proper way to handle this.
Never mind. It hit me in the shower this morning (where I make most of my tech breakthroughs).
I am pretty sure now that the key is to get autosaving working with my application.

(why) is FSCTL_SET_OBJECT_ID dangerous?

NTFS files can have object ids. These ids can be set using FSCTL_SET_OBJECT_ID. However, the msdn article says:
Modifying an object identifier can result in the loss of data from portions of a file, up to and including entire volumes of data.
But it doesn't go into any more detail. How can this result in loss of data? Is it talking about potential object id collisions in the file system, and does NTFS rely on them in some way?
Side node: I did some experimenting with this before I found that paragraph, and set the object id's of some newly created files, here's hoping that my file system's still intact.
I really don't think this can directly result in loss of data.
The only way I can imagine it being possible is if e.g. a backup program assumes that (1) every file has an Object Id, and (2) that the program is keeping track of all IDs at all times. In that case it might assume that an ID that is not in its database must refer to a file that should not exist, and it might delete the file.
Yeah, I know it sounds ridiculous, but that's the only way I can think of in which this might happen. I don't think you can lose data just by changing IDs.
They are used by distributed link tracking service which enables client applications to track link sources that have moved. The link tracking service maintains its link to an object only by using these object identifier (ID).
So coming back to your question,
Is it talking about potential object id collisions in the file system
?
I dont think so. Windows does provides us the option to set the object IDs using FSCTL_SET_OBJECT_ID but that doesnt bring the risk of ID collision.
Attempting to set an object identifier on an object that already has an object identifier will fail.
.. and does NTFS rely on them in some way?
Yes. Object identifiers are used to track files and directories. An index of all object IDs is stored on the volume. Rename, backup, and restore operations preserve object IDs. However, copy operations do not preserve object IDs, because that would violate their uniqueness.
How can this result in loss of data?
You wont get into a serious problem if you change(or rather set) object ID of user-created files(as you did). However, if a user(knowingly/unknowingly) sets object ID used by a shared object file/library, change will not be reflected as is.
Since Windows doesnt want everyone(but developers) to play with crutial library files, it issues a generic warning:
Modifying an object identifier can result in the loss of data from
portions of a file, up to and including entire volumes of data.
Bottom line: Change it if you know what you are doing.
There's another msn article on distributed link tracking and object identifiers.
Hope it helps!
EDIT:
Thanks to #Mehrdad for pointing out.I didnt mean object identifiers of DLLs themselves but ones which they use internally.
OLEACC(a dll), provides the Active Accessibility runtime and manages requests from Active Accessibility clients[source]. It use OBJID_QUERYCLASSNAMEIDX object identifier [ source ]

How do you RESTfully get a complicated subset of records?

I have a question about getting 'random' chunks of available content from a RESTful service, without duplicating what the client has already cached. How can I do this in a RESTful way?
I'm serving up a very large number of items (little articles with text and urls). Let's pretend it's:
/api/article/
My (software) clients want to get random chunks of what's available. There's too many to load them all onto the client. They do not have a natural order, so it's not a situation where they can just ask for the latest. Instead, there are around 6-10 attributes that the client may give to 'hint' what type of articles they'd like to see (e.g. popular, recent, trending...).
Over time the clients get more and more content, but at the server I have no idea what they have already, and because they're sent randomly, I can't just pass in the 'most recent' one they have.
I could conceivably send up the GUIDS of what's stored locally. The clients only store 50-100 locally. That's small enough to stuff into a POST variable, but not into the GET query string.
What's a clean way to design this?
Key points:
Data has no logical order
Clients must cache the content locally
Each item has a GUID
Want to avoid pulling down duplicates
You'll never be able to make this work satisfactorily if the data is truly kept in a random order (bear in mind the Dilbert RNG Effect); you need to fix the order for a particular client so that they can page through it properly. That's easy to do though; just make that particular ordering be a resource itself; at that point, you've got a natural (if possibly synthetic) ordering and can use normal paging techniques.
The main thing to watch out for is that you'll be creating a resource in response to a GET when you do the initial query: you probably should use a resource name that is a hash of the query parameters (including the client's identity if that matters) so that if someone does the same query twice in a row, they'll get the same resource (so preserving proper idempotency). You can always delete the resource after some timeout rather than requiring manual disposal…

Key based caching

I'm reading this article:
http://37signals.com/svn/posts/3113-how-key-based-cache-expiration-works
I'm not using rails so I don't really understand their example.
It says in #3:
When the key changes, you simply write the new content to this new
key. So if you update the todo, the key changes from
todos/5-20110218104500 to todos/5-20110218105545, and thus the new
content is written based on the updated object.
How does the view know to read from the new todos/5-20110218105545 instead of the old one?
I was confused about that too at first -- how does this save a trip to the database if you have to read from the database anyway to see if the cache is valid? However, see Jesse's comments (1, 2) from Feb 12th:
How do you know what the cache key is? You would have to fetch it from the database to know the mtime right? If you’re pulling the record from the database already, I would expect that to be the greatest hit, no?
Am I missing something?
and then
Please remove my brain-dead comment. I just realized why this doesn’t matter: the caching is cascaded, so yes a full depth regeneration incurs a DB hit. The next cache hit will incur one DB query for the top-level object—all the descendant objects are not queried because the cache for the parent object includes cached versions for the children (thus, no query necessary).
And Paul Leader's comment 2 below that:
Bingo. That’s why is works soooo well. If you do it right it doesn’t just eliminate the need to generate the HTML but any need to hit the db. With this caching system in place, our data-vis app is almost instantaneous, it’s actually useable and the code is much nicer.
So given the models that DHH lists in step 5 of the article and the views he lists in step 6, and given that you've properly setup your relationships to touch the parent objects on update, and given that your partials access your child data as parent.children, or even child.children in nested partials, then this caching system should have a net gain because as long as the parent's cache-key is still valid then the parent.children lookup will never happen and will also be pulled from cache, etc.
However, this method may be pointless if your partials reference lots of instance variables from the controller since those queries will already have been performed by the time Rails sees the calls to cache in the view templates. In that case you would probably be better off using other caching patterns.
Or at least this is my understanding of how it works. HTH

Resources