Related
So I'm currently diving the CQRS architecture along with the EventStore "pattern".
It opens applications to a new dimension of scalability and flexibility as well as testing.
However I'm still stuck on how to properly handle data migration.
Here is a concrete use case:
Let's say I want to manage a blog with articles and comments.
On the write side, I'm using MySQL, and on the read side ElasticSearch, now every time a I process a Command, I persist the data on the write side, dispatch an Event to persist the data on the read side.
Now lets say I've some sort of ViewModel called ArticleSummary which contains an id, and a title.
I've a new feature request, to include the article tags to my ArticleSummary, I would add some dictionary to my model to include the tags.
Given the tags did already exist in my write layer, I would need to update or use a new "table" to properly use the new included data.
I'm aware of the EventLog Replay strategy which consists in replaying all the events to "update" all the ViewModel, but, seriously, is it viable when we do have a billion of rows?
Is there any proven strategies? Any feedbacks?
I'm aware of the EventLog Replay strategy which consists in replaying
all the events to "update" all the ViewModel, but, seriously, is it
viable when we do have a billion of rows?
I would say "yes" :)
You are going to write a handler for the new summary feature that would update your query side anyway. So you already have the code. Writing special once-off migration code may not buy you all that much. I would go with migration code when you have to do an initial update of, say, a new system that requires some data transformation once off, but in this case your infrastructure would exist.
You would need to send only the relevant events to the new handler so you also wouldn't replay everything.
In any event, if you have a billion rows of data your servers would probably be able to handle the load :)
Im currently using the NEventStore by JOliver.
When we started, we were replaying our entire store back through our denormalizers/event handlers when the application started up.
We were initially keeping all our data in memory but knew this approach wouldn't be viable in the long term.
The approach we use currently is that we can replay an individual denormalizer, which makes things a lot faster since you aren't unnecessarily replaying events through denomalizers that haven't changed.
The trick we found though was that we needed another representation of our commits so we could query all the events that we handled by event type - a query that cannot be performed against the normal store.
I have a trie-based word detection algorithm for a custom dictionary. Note that regular expressions are too brittle with this dictionary as entries may contain spaces, periods, etc.
I've implemented the algorithm in a local C# app that reads in the dictionary from file and stores the trie in memory (it's compact, so no RAM size issues at all). Now I would like to use this algorithm in an MVC 3 app on a cloud host like AppHarbor, with the added twist that I want a web interface to enable adding/editing words.
It's fast enough that loading the dictionary from file and building the trie every time a user uploads their text would not be an issue (< 1s on my laptop). However, if I want to enable admins to edit the dictionary via the web interface, that would seem tricky since the dictionary would potentially be getting updated while a user is trying to upload text for analysis.
What is the best strategy for storing, loading, and updating the trie in an MVC 3 app?
I'm not sure if you are looking for specific implementation details, or more conceptual ideas about how to handle but I'll throw some ideas out there for now.
Actual Trie Classes - Here is a good C# example of classes for setting up a Trie. It sounds like you already have this part figured out.
Storing: I would persist the trie data to XML unless you are already using a database and have some need to have it in a dbms. The XML will be simple to work with in the MVC application and you don't need to worry about database connectivity issues, or the added cost of a database. I would also have two versions of the trie data on the server, a production copy and a production support copy, the second for which your admin can perform transactions against.
Loading In your admin module of the application, you may implement a feature for loading the trie data into memory, the frequency of data loading depends on your application needs. It could be scheduled or available as a manual function. Like in wordpress sites, if a user should access it while updating they would receive a message that the site is undergoing maintenance. You may choose to load into memory on demand only, and keep the trie loaded at all times except for if problems occurred.
Updating - I'd have a second database (or XML file) that is used for applying updates. The method of applying updates to production would depend partially on the frequency, quantity, and time of updates. One safe method might be to store transactions entered by the admin.
For example:
trie.put("John", 112);
trie.put("Doe", 222);
trie.Remove("John");
Then apply these transactions to your production data as needed via an admin function. If needed put your site into "maint" mode. If the updates are few and fast you may be able to code the site so that it will hold all work until transactions are processed, a user might have to wait a few milliseconds longer for a result but you wouldn't have to worry about mutating data issues.
This is pretty vague but just throwing some ideas out there... if you provide comments I'll try to give more.
1 Store trie in cache:
It is not dynamic data, and caching helps us in other tasks (like concurrency access to trie by admin and user)
2 Make access to cache clear:
:
public class TrieHelper
{
public Trie MyTrie
{
get
{
if (HttpContext.Current.Cache["myTrieKey"] == null)
HttpContext.Current.Cache["myTrieKey"] = LoadTrieFromFile(); //Returns Trie object
return (Trie)HttpContext.Current.Cache["myTrieKey"];
}
}
3 Lock trie object while adding operation in progress
public void AddWordToTrie(string word)
{
var trie = MyTrie;
lock (HttpContext.Current.Cache["myTrieKey"])
{
trie.AddWord(word);
} // notify that trie object locking when write data to file is not reuired
WriteNewWordToTrieFile(word); // should lock FileWriter object
}
}
4 If editing is performs by 1 admin at a time - store trie in xml file - it will be easy to implement logic of search element, after what word your word should be added (you can create function, that will use MyTrie object in memory), and add it, using linq to xml.
I've got a kind'a the same but 10 times bigger :)
The client design it's own calendar with questions ans possible answer in the meanwhile some is online and being used by the normal user.
What I come up was something as test and deploy. The Admin enters the calendar values and set it up correctly and after he can use a Preview button to see if it's like he needs/wants, then, to make the changes valid to all end users, he need to push Deploy.
He, as an ADMIN, will know that, until he pushes the DEPLOY button, all users accessing the Calendar will have the old values. Soon he hits deploy all is set in the Database, and pushed the files he uploaded into Amazon S3 (for faster access).
I update the Cache with the new calendar and the new Calendar object is cached until the App pool says otherwise or he hit the Deploy button again.
You could do something like this.
As you are going to perform your application in the cloud environment, I'd suggest you to take a look at CQRS and durable messaging and provide some concurrency model (possibly, optimistic concurrency and intelligent conflict detection http://skillsmatter.com/podcast/design-architecture/cqrs-not-just-for-server-systems 5:00)
Also, obviously, you need to analyze your business requirements more precisely because, as Udi Dahan mentioned, race conditions are result of the lack of business analysis.
I have a data model which is sort of like this simplified drawing:
alt text http://dl.dropbox.com/u/545670/thedatamodel.png
It's a little weird, but the idea is that the app manages multiple accounts/identities a person may have into a single messaging system. Each account is associated with one user on the system, and each message could potentially be seen/sent-to multiple accounts (but they have a globally unique ID hence the messageID property which is used on import to fetch message objects that may have already been downloaded and imported by a prior session).
The app is used from a per-account point of view - what I mean is that you choose which account you want to use, then you see the messages and stuff from that account's point of view in your window. So I have the messages attached to the account so that I can easily get the messages that should be shown using a fetch like this:
fetch.fetchPredicate = [NSPredicate predicateWithFormat:#"%# IN accounts", theAccount];
fetch.sortDescriptors = [NSArray arrayWithObject:[[NSSortDescriptor alloc] initWithKey:#"date" ascending:NO]];
fetch.fetchLimit = 20;
This seems like the right way to set this up in that the messages are shared between accounts and if a message is marked as read by one, I want it seen as being read by the other and so on.
Anyway, after all this setup, the big problem is that memory usage seems to get a little crazy. When I setup a test case where it's importing hundreds of messages into the system, and periodically re-fetching (using the fetch mentioned above) and showing them in a list (only the last 20 are referenced by the list), memory just gets crazy. 60MB.. 70MB... 100MB.. etc.
I tracked it down to the many-to-many relation between Account and Message. Even with garbage collection on, the managed objects are still being referenced strongly by the account's messages relationship property. I know this because I put a log in the finalize of my Message instance and never see it - but if I periodically reset the context or do refreshObject:mergeChanges: on the account object, I see the finalize messages and memory usage stays pretty consistent (although still growing somewhat, but considering I'm importing stuff, that's to be expected). The problem is that I can't really reset the context or the account object all the time because that really messes up observers that are observing other attributes of the account object!
I might just be modeling this wrong or thinking about it wrong, but I keep reading over and over that it's important to think of Core Data as an object graph and not a database. I think I've done that here, but it seems to be causing trouble. What should I do?
Use the Object Graph instrument. It'll tell you all of the ownerships keeping an object alive.
Have you read the section of the docs on this topic?
I'm struggling to apply RESTful principles to a new web application I'm working on. In particular, it's the idea that to be RESTful, each HTTP request should carry enough information by itself for its recipient to process it to be in complete harmony with the stateless nature of HTTP.
The application allows users to search for medications. The search accepts filters as input, for example, return discontinued medicines, include complimentary therapy etc..etc. In total there are around 30 filters that can be applied.
Additionally, patient details can be entered including the patients age, gender, current medications etc.
To be Restful, should all this information be included with every request? This seems to place a huge overhead on the network. Also, wouldn't the restrictions on URL length, at least for GET, make this unfeasible?
The "Filter As Resource" is a perfect tact for this.
You can PUT the filter definition to the filter resource, and it can return the filter ID.
PUT is idempotent, so even if the filter is already there, you just need to detect that you've seen the filter before, so you can return the proper ID for the filter.
Then, you can add a filter parameter to your other requests, and they can grab the filter to use for the queries.
GET /medications?filter=1234&page=4&pagesize=20
I would run the raw filters through some sort of canonicalization process, just to have a normalized set, so that, e.g. filter "firstname=Bob lastname=Eubanks" is identical to "lastname=Eubanks firstname=Bob". That's just me though.
The only real concern is that, as time goes on, you may need to obsolete some filters. You can simply error out the request should someone make a request with a missing or obsolete filter.
Edit answering question...
Let's start with the fundamentals.
Simply, you want to specify a filter for use in queries, but these filters are (potentially) involved and complicated. If it was simple /medications/1234, this wouldn't be a problem.
Effectively, you always need to send the filter to the query. The question is how to represent that filter.
The fundamental issue with things like sessions in REST systems is that they're typically managed "out of band". When you, say, go and create a medication, you PUT or POST to the medications resource, and you get a reference back to that medication.
With a session, you would (typically) get back a cookie, or perhaps some other token to represent that session. If your PUT to the medications resource created a session also, then, in truth, your request created two resources: a medication, and a session.
Unfortunately, when you use something like a cookie, and you require that cookie for your request, the resource name is no longer the true representation of the resource. Now it's the resource name (the URL), and the cookie.
So, if I do a GET on the resource named /medications/search, and the cookie represents a session, and that session happens to have a filter in it, you can see how in effect, that resource name, /medications/search, isn't really useful at all. I don't have all of the information I need to make effective use, because of the side effect of the cookie and the session and the filter therein.
Now, you could perhaps rewrite the name: /medications/search?session=ABC123, effectively embedding the cookie in the resource name.
But now you run in to the typical contract of sessions, notably that they're short lived. So, that named resource is less useful, long term, not useless, just less useful. Right now, this query gives me interesting data. Tomorrow? Probably not. I'll get some nasty error about the session being gone.
The other problem is that sessions typically are not managed as a resource. For example, they're usually a side effect, vs explicitly managed via GET/PUT/DELETE. Sessions are also the "garbage heap" of web app state. In this case, we're just kind of hoping that the session is properly populated with what is needed for this request. We actually don't really know. Again, it's a side effect.
Now, let's turn it on its head a little bit. Let's use /medications/search?filter=ABC123.
Obviously, casually, this looks identical. We just changed the name from 'session' to 'filter'. But, as discussed, Filters, in this case, ARE a "first class resource". They need to be created, managed, etc. the same as a medication, a JPEG, or any other resource in your system. This is the key distinction.
Certainly, you could treat "sessions" as a first class resource, creating them, putting stuff in them directly, etc. But you can see how, at least from a clarity point of view, a "first class" session isn't really a good abstraction for this case. Using a session, its like going to the cleaners and handing over your entire purse or briefcase. "Yea, the ticket is in there somewhere, dig out what you want, give me my clothes", especially compared to something explicit like a filter.
So, you can see how, at 30,000 feet, there's not a lot of difference in the case between a filter and a session. But when you zoom in, they're quite different.
With the filter resource, you can choose to make them a persistent thing forever and ever. You can expire them, you can do whatever you want. Sessions tend to have pre-conceived semantics: short live, duration of the connection, etc. Filters can have any semantics you want. They're completely separate from what comes with a session.
If I were doing this, how would I work with filters?
I would assume that I really don't care about the content of a filter. Specifically, I doubt I would ever query for "all filters that search by first name". At this juncture, it seems like uninteresting information, so I won't design around it.
Next, I would normalize the filters, like I mentioned above. Make sure that equivalent filters truly are equivalent. You can do this by sorting the expressions, ensuring fieldnames are all uppercase, or whatever.
Then, I would store the filter as an XML or JSON document, whichever is more comfortable/appropriate for the application. I would give each filter a unique key (naturally), but I would also store a hash for the actual document with the filter.
I would do this to be able to quickly find if the filter is already stored. Since I'm normalizing it, I "know" that the XML (say) for logically equivalent filters would be identical. So, when someone goes to PUT, or insert a new filter, I would do a check on the hash to see if it has been stored before. I may well get back more than one (hashes can collide, of course), so I'll need to check the actual XML payloads to see whether they match.
If the filters match, I return a reference to the existing filter. If not, I'd create a new one and return that.
I also would not allow a filter UPDATE/POST. Since I'm handing out references to these filters, I would make them immutable so the references can remain valid. If I wanted a filter by "role", say, the "get all expire medications filter", then I would create a "named filter" resource that associates a name with a filter instance, so that the actual filter data can change but the name remain the same.
Mind, also, that during creation, you're in a race condition (two requests trying to make the same filter), so you would have to account for that. If your system has a high filter volume, this could be a potential bottleneck.
Hope this clarifies the issue for you.
To be Restful, should all this information be included with every request?
No. If it looks like your server is sending (or receiving) too much information, chances are that there are one or more resources you haven't yet identified.
The first and most important step in designing a RESTful system is to identify and name your resources. How would you do that for your system?
From your description, here's one possible set of resources:
User - a user of the system (maybe a doctor or patient (?) - Role might need to be exposed as a resource here)
Medication - the stuff in the bottle, but it also might represent the kind of bottle (quantity and contents), or it might represent a particular bottle - depending on if you're a pharmacy or just a help desk.
Disease - the condition for which a Patient might want to take a Medication.
Patient - a person who might take a Medication
Recommendation - a Medication that might be beneficial to a Patient based on a Disease they suffer from.
Then you could look for relationships among resources;
User has and belongs to many Roles
Medication has and belongs to many Diseases
Disease has many Recommendations.
Patient has and belongs to many Medications and Diseases (poor chap)
Patient has many Recommendations
Recommendation has one Patient and has one Disease
The specifics are probably not right for your particular problem, but the idea is simple: create a network of relationships among your resources.
At this point it might be helpful to think about URI structure, although keep in mind that REST APIs must be hypertext-driven:
# view all Recommendations for the patient
GET http://server.com/patients/{patient}/recommendations
# view all Recommendations for a Medication
GET http://servier.com/medications/{medication}/recommendations
# add a new Recommendation for a Patient
PUT http://server.com/patients/{patient}/recommendations
Because this is REST, you'll spend most of your time defining the media types used to transfer representations of your resources between client and server.
By exposing more resources, you can cut down on the amount of data that needs to be transferred during each request. Also notice there are no query parameters in the URIs. The server can be as stateful as it needs to be to keep track of it all, and each request can be fully self-contained.
REST is for APIs, not (typical) applications. Don't try to wedge a fundamentally stateful interaction into a stateless model just because you read about it on wikipedia.
To be Restful, should all this information be included with every request? This seems to place a huge overhead on the network. Also, wouldn't the restrictions on URL length, at least for GET, make this unfeasible?
The size of parameters is usually insignificant compared to the size of resources the server sends. If you're using such large parameters that they are a network burden, place them on the server once and then use them as resources.
There are no significant restrictions on URL length -- if your server has such a limit, upgrade it. It's probably years old and chock-full of security vulnerabilities anyway.
No all of that does not have to be in every request.
Each resource (medication, patient history, etc) should have a canonical URI that uniquely identifies it. In some applications (eg, Rails-based ones) this will be something like "/patients/1234" or "/drugs/5678" but the URL format is unimportant.
A client that has previously obtained the URI for a resource (such as from a search, or from a link embedded in another resource) can retrieve it using this URI.
Are you working on a RESTful API that other apps will use to search your data? Or are you building a end-user focused web application where users will log in and perform these searches?
If your users are logging in, then you're already stateful as you'll have some type of session cookie to maintain the logged in state. I would go ahead and create a session object that contains all the search filters. If a user hasn't set any filters, then this object will be empty.
Here's a great blog post about using GET vs POST. It mentions a URL length limit set by Internet Explorer of 2,048 characters, so you want to use POST for long requests.
http://carsonified.com/blog/dev/the-definitive-guide-to-get-vs-post/
Hopefully you'll see the problem I'm describing in the scenario below. If it's not clear, please let me know.
You've got an application that's broken into three layers,
front end UI layer, could be asp.net webform, or window (used for editing Person data)
middle tier business service layer, compiled into a dll (PersonServices)
data access layer, compiled into a dll (PersonRepository)
In my front end, I want to create a new Person object, set some properties, such as FirstName, LastName according to what has been entered in the UI by a user, and call PersonServices.AddPerson, passing the newly created Person. (AddPerson doesn't have to be static, this is just for simplicity, in any case the AddPerson will eventually call the Repository's AddPerson, which will then persist the data.)
Now the part I'd like to hear your opinion on is validation. Somewhere along the line, that newly created Person needs to be validated. You can do it on the client side, which would be simple, but what if I wanted to validate the Person in my PersonServices.AddPerson method. This would ensure any person I want to save would be validated and removes any dependancy on the UI layer doing the work. Or maybe, validate both in UI and in by business server layer. Sounds good so far right?
So, for simplicity, I'll update the PersonService.AddPerson method to perform the following validation checks
- Check if FirstName and LastName are not empty
- Ensure this new Person doesn't already exist in my repository
And this method will return True if all validation passes and the Person is persisted, False if Validation fails or if the Person is not persisted.
But this Boolean value that AddPerson returns isn't enough for me at the UI layer to give the user a clear reason why the save process failed. So what's a lonely developer to do? Ultimately, I'd like the AddPerson method to be able to ensure what its about to save is valid, and if not, be able to communicate the reasons why it's not invalid to my UI layer.
Just to get your juices flowing, some ways of solving this could be: (Some of these solutions, in my opinion, suck, but I'm just putting them there so you get an understanding of what I'm trying to solve)
Instead of AddPerson returning a boolean, it can return an int (i.e. 0 = Success, Non Zero equals failure and the number indicates the reason why it failed.
In AddPerson, throw custom exceptions when validation fails. Each type of custom exception would have its own error message. In addition, each custom exception would be unique enough to catch in the UI layer
Have AddPerson return some sort of custom class that would have properties indicating whether validation passed or failed, and if it did fail, what were the reasons
Not sure if this can be done in VB or C#, but attach some sort of property to the Person and its underlying properties. This "attached" property could contain things like validation info
Insert your idea or pattern here
And maybe another here
Apologies for the long winded question, but I definately like to hear your opinion on this.
Thanks!
Multiple layers of validation go well with multi-layer apps.
The UI itself can do the simplest and quickest checks (are all mandatory fields present, are they using the appropriate character sets, etc) to give immediate feedback when the user makes a typo.
However the business logic should have the lion's share of validation responsibilities... and for once it's not a problem if this is "repetitious", i.e., if the business layer re-checks something that should already have been checked in the UI -- the BL should check all the business rules (this double checks on UI's correctness, enables multiple different UI clients that may not all be perfect in their checks -- e.g. a special client on a smart phone which may not have good javascript, and so on -- and, a bit, wards against maliciously hacked clients).
When the business logic saves the "validated" data to the DB, that layer should perform its own checks -- DBs are good at that, and, again, don't worry about some repetition -- it's the DB's job to enforce data integrity (you might want different ways to feed data to it one day, e.g. a "bulk loader" to import a number of Persons from another source, and it's key to ensure that all those ways to load data always respect data integrity rules); some rules such as uniqueness and referential integrity are really best enforced in the DB, in particular, for performance reasons too.
When the DB returns an error message (data not inserted as constraint X would be violated) to the business layer, the latter's job is to reinterpret that error in business terms and feed the results to the UI to inform the user; and of course the BL must similarly provide clear and complete info on business rules violation to the UI, again for display to the user.
A "custom object" is thus clearly "the only way to go" (in some scenarios I'd just make that a JSON object, for example). Keeping the Person object around (to maintain its "validation problems" property) when the DB refused to persist it does not look like a sharp and simple technique, so I don't think much of that option; but if you need it (e.g. to enable "tell me again what was wrong" functionality, maybe if the client went away before the response was ready and needs to smoothly restart later; or, a list of such objects for later auditing, &c), then the "custom validation-failure object" could also be appended to that list... but that's a "secondary issue", the main thing is for the BL to respond to the UI with such an object (which could also be used to provide useful non-error info if the insertion did in fact succeed).
Just a quick (and hopefully helpful) comment: when you're wondering where to place validation, try pretending that, soon, you're going to completely recreate your UI layer using a technology you're not yet so familiar with**. Try to keep out of that layer any validation-like business logic that you know for certain you'd have to rewrite in the new technology.
You'll find exceptions - business logic that ends up in your UI layer regardless, but it's a useful consideration nonetheless.
** Mobile dev, Silverlight, Voice XML, whatever - pretending you don't know the technology of your "new" UI layer helps you abstract your concerns and get less mired in implementation details.
The only important points are:
From the perspective of the front-end(s), the Middle Tier must perform all validation, you never know whether someone is going to try circumventing your front-end validation by talking directly to your Middle Tier (for whatever reason)
The Middle Tier may elect to delegate some of that validation to the DB layer (e.g. data integrity constraints)
You may optionally duplicate some validation in the UI, but that should only be for the sake of performance (to avoid round-trips to the Middle Tier for common scenarios, such as missing mandatory fields, incorrectly formatted data, etc.) These checks should never take the place of doing them in the Middle Tier
Validation should be done at all three levels.
When I am in a project I assume I am making a framework, which most of the time is not the case. Each layer is separate and must check all layers input before doing an operation
Each level can have a different way of doing it, it is not necessary they all use the same, but ideally they should all use the same validation with the ability to customize it.
You never want to let bad data into the database. So you can never trust the data you are getting from the business layer. It needs to be checked.
In the business layer you can never trust the UI layer, and you must check it to prevent un-needed calls to the database layer. The UI layer works the same way.
I disagree with David Basarab's comment that the same validations should be present in all layers. This defies the paradigm of responsibility of layers for one reason. Secondly, though the main intention is to make the layers (or components) loosely coupled, it is also important that a level of responsibility (and hence trust) is endowed on the layers. Though it might be necessary to duplicated some validations in UI and Business Layer (since UI layer can be bypassed by hacking attempts), however, it is not advisable to repeat the validations in each layer. Each layer should perform only those validations which they are responsible for. The biggest flaw in repeting validations in all layers is code redundancy, which can cause maintenance nightmare.
A lot of this is more style than substance. I personally favor returning status objects as a flexible and extensible solution. I would say that I think there are a couple classes of validation in play, the first being "does this person data conform to the contract of what a person is?" and the second being "does this person data violate constraints in the database?" I think the first validation can, and should be done at the client. The second should be done at the middle tier. With this division, you may find that the only reasons the save could fail are 1)violates a uniqueness constrains, or 2)something catastrophic. You could then return false for the first case, and throw an exception for the other.
If tier R is closer to the user (or any input stream you don't control) than tier S then tier S should validate all data received from tier R. This does not mean that tier R shouldn't validate data. It's better for the user if the GUI warns him he's making a mistake before he attempts a new transaction. But no matter how bulletproof the validation in your GUI is, the next tier up should not trust that any validation has taken place.
This assumes your database in completely under your control. If not, you have bigger problems.
Also, you could have the UI pass the data needed to build a Person object through some sort of PersonBuilder object, so that object creation is consolidated in the domain/business layer, and you can keep the Person object in a state that is always consistent. This makes more sense for more complex entities, however even for simple ones, it is good to centralize object creation, just like you centralize persistence, etc.