What point should you cache a value for a Property? - caching

I've always wondered about when and where is the best time to cache a property value... Some of them seem pretty simple, like the one below...
public DateTime FirstRequest {
get {
if (this.m_FirstRequest == null) {
this.m_FirstRequest = DateTime.Now;
}
return (DateTime)this.m_FirstRequest;
}
}
private DateTime? m_FirstRequest;
But what about some more complicated situations?
A value that comes from a database, but remains true after it's been selected.
A value that is stored in a built-in cache and might expire from time to time.
A value that has to be calculated first?
A value that requires some time to initialize. 0.001s, 0.1s, 1s, 5s???
A value that is set, but something else may come and set it to null to flag that it should be repopulated.
??? There seems to be limitless situations.
What do you think is the point that a property can no longer take care of itself and instead require something to populate its value?
[EDIT]
I see suggestions that I'm optimizing too early, etc. But my question is for when it is time to optimize. Caching everything isn't what I'm asking, but when it is time to cache, whose responsibility should it be?

In general, you should get the code working first and then optimize later and then only do optimizations that profiling say will help you.

I think you need to turn your question the other way around lest you fall into a trap of optimizing too early.
When do you think is the point that a property no longer needs recalculating on every call and instead uses some form of caching?
The caching of a value is an optimization and should therefore not be done as the norm. There are some cases where it is clearly relevant to use this optimization but in most cases, you should implement the property to work correctly every time by doing the appropriate work to get the value and then look to optimize it once you've profiled and shown that optimization is required.
Some reasons why caching is or is not a good idea:
- Don't cache if the value is prone to frequent change
- Do cache if it never changes and you are responsible for it never changing
- Don't cache if you are not responsible for providing the value as you're then relying on someone else's implementation
There are many other reasons for and against caching a value but there is certainly no hard and fast rule on when to cache and when not to cache - each case is different.
If you have to cache...
Assuming that you've determined some form of caching is the way to go, then how you perform that caching depends on what you are caching, why, and how the value is provided to you.
For example, if it is a singleton object or a timestamp as in your example, a simple "is it set?" condition that sets the value once is a valid approach (that or create the instance during construction). However, if it's hitting a database and the database tells you when it changes, you could cache the value based on a dirty flag that gets dirtied whenever the database value says it has changed. Of course, if you have no notification of changes, then you may have to either refresh the value on every call, or introduce a minimum wait time between retrievals (accepting that the value may not always be exact, of course).
Whatever the situation, you should always consider the pros and cons of each approach and consider the scenarios under which the value is referenced. Do the consumers always need the most recent value or can they cope with being slightly behind? Are you in control of the value's source? Does the source provide notification of changes (or can you make it provide such notifications)? What is the reliability of the source? There are many factors that can affect your approach.
Considering the scenarios you gave...
Again, assuming that caching is needed.
A value that comes from a database, but remains true after it's been selected.
If the database is guaranteed to retain that behavior, you can just poll the value the first time it is requested and cache it thereafter. If you can't guarantee the behavior of the database, you may want to be more careful.
A value that is stored in a built-in cache and might expire from time to time.
I would use a dirty flag approach where the cache is marked dirty from time to time to indicate that the cached value needs refreshing, assuming that you know when "time to time" is. If you don't, consider a timer that indicates the cache is dirty at regular intervals.
A value that has to be calculated first?
I would judge this based on the value. Even if you think caching is needed, the compiler may have optimised it already. However, assuming caching is needed, if the calculation is lengthy, it might be better during construction or via an interface such as ISupportInitialize.
A value that requires some time to initialize. 0.001s, 0.1s, 1s, 5s???
I'd have the property do no calculation for this and implement an event that indicates when the value changes (this is a useful approach in any situation where the value might change). Once the initialization is complete, the event fires, allowing consumers to get the value. You should also consider that this might not be suited for a property; instead, consider an asynchronous approach such as a method with a callback.
A value that is set, but something else may come and set it to null to flag that it should be repopulated.
This is just a special case of the dirty flag approach discussed in point 2.

Related

How to validate from aggregate

I am trying to understand validation from the aggregate entity on the command side of the CQRS pattern when using eventsourcing.
Basically I would like to know what the best practice is in handling validation for:
1. Uniqueness of say a code.
2. The correctness/validation of an eternal aggregate's id.
My initial thoughts:
I've thought about the constructor passing in a service but this seems wrong as the "Create" of the entity should be the values to assign.
I've thought about validation outside the aggregate, but this seem to put logic somewhere that I assume should be responsibility of the aggregate itself.
Can anyone give me some guidance here?
Uniqueness of say a code.
Ensuring uniqueness is a specific example of set validation. The problem with set validation is that, in effect, you perform the check by locking the entire set. If the entire set is included within a single "aggregate", that's easily done. But if the set spans aggregates, then it is kind of a mess.
A common solution for uniqueness is to manage it at the database level; RDBMS are really good at set operations, and are effectively serialized. Unfortunately, that locks you into a database solution with good set support -- you can't easily switch to a document database, or an event store.
Another approach that is sometimes appropriate is to have the single aggregate check for uniqueness against a cached copy of the available codes. That gives you more freedom to choose your storage solution, but it also opens up the possibility that a data race will introduce the duplication you are trying to avoid.
In some cases, you can encode the code uniqueness into the identifier for the aggregate. In effect, every identifier becomes a set of one.
Keep in mind Greg Young's question
What is the business impact of having a failure?
Knowing how expensive a failure is tells you a lot about how much you are permitted to spend to solve the problem.
The correctness/validation of an eternal aggregate's id.
This normally comes in two parts. The easier one is to validate the data against some agreed upon schema. If our agreement is that the identifier is going to be a URI, then I can validate that the data I receive does satisfy that constraint. Similarly, if the identifier is supposed to be a string representation of a UUID, I can test that the data I receive matches the validation rules described in RFC 4122.
But if you need to check that the identifier is in use somewhere else? Then you are going to have to ask.... The main question in this case is whether you need the answer to that right away, or if you can manage to check that asynchronously (for instance, by modeling "unverified identifiers" and "verified identifiers" separately).
And of course you once again get to reconcile all of the races inherent in distributed computing.
There is no magic.

What does 'dirty-flag' / 'dirty-values' mean?

I see some variables named 'dirty' in some source code at work and some other code. What does it mean? What is a dirty flag?
Generally, dirty flags are used to indicate that some data has changed and needs to eventually be written to some external destination. It isn't written immediate because adjacent data may also get changed and writing bulk of data is generally more efficient than writing individual values.
There's a deeper issue here - rather than "What does 'dirty mean?" in the context of code, I think we should really be asking - is 'dirty' an appropriate term for what is generally intended.
'Dirty' is potentially confusing and misleading. It will suggest to many new programmers corrupt or erroneous form data. The work 'dirty' implies that something is wrong and that the data needs to be purged or removed. Something dirty is, after all undesirable, unclean and unpleasant.
If we mean 'the form has been touched' or 'the form has been amended but the changes haven't yet been written to the server', then why not 'touched' or 'writePending' rather than 'dirty'?
That I think, is a question the programming community needs to address.
Dirty could mean a number of things, you need to provide more context. But in a very general sense a "dirty flag" is used to indicate whether something has been touched / modified.
For instance, see usage of "dirty bit" in the context of memory management in the wiki for Page Table
"Dirty" is often used in the context of caching, from application-level caching to architectural caching.
In general, there're two kinds of caching mechanisms: (1) write through; and (2) write back. We use WT and WB for short.
WT means that the write is done synchronously both to the cache and to the backing store. (By saying the cache and the backing store, for example, they can stand for the main memory and the disk, respectively, in the context of databases).
In contrast, for WB, initially, writing is done only to the cache. The write to the backing store is postponed until the cache blocks containing the data are about to be modified/replaced by new content.
The data is the dirty values. When implementing a WB cache, you can set dirty bits to indicate whether a cache block contains dirty value or not.

Implementation of Unsaved Changes Detection

It seems like three ways to approach detecting unsaved changes in a text/image/data file might be to:
Update a boolean flag every time the user makes a change or saves, which would result in a lot of unnecessary updates.
Keep a cached copy of the original file and diff the two every time a save operation needs to be checked.
Keep a stack of all past operations and push/pop operations as needed, resulting in a lot of extra memory usage.
In general, how do commercial applications detect whether unsaved changes exist and what are the advantages/disadvantages of each approach? I ran into this issue while writing a custom application that has special saving behavior and would like to know if there is a known best practice.
As long as you need an undo/redo system, you need that stack of past operations. To detect in wich state the document is, an item of the stack is set to be the 'saved state'. Current stack node is not that item, the document is changed.
You can see an example of this in Qt QUndoStack( http://doc.qt.nokia.com/stable/qundostack.html ) and its isClean() and setClean()
For proposition 1, updating a boolean is not something problematic and take little time.
This depends on the features you want and the size/format of the files, I guess.
The first option is the simplest, and it gives you just what you want with minial overhead.
The second option has the advantage that you can detect when changes have been manually reverted, so that there is no real change after all (although that probably doesn't happen all too often). On the other hand it is much more costly to make a diff just to check if anything was modified. You probably don't want to do that everytime the user presses a key.
The third option gives the ability to provide an undo-history. You could limit the number of items in that history by grouping changes together that were made consecutively (without moving the cursor in between), or something like that.

Is RegNotifyChangeKeyValue as coarse as it seems?

I've been using ReadDirectoryChangesW to monitor a particular portion of the file system. It rather nicely provides a partial pathname to the file or directory which changed along with a clue about the nature of the change. This may have spoiled me.
I also need to monitor a particular portion of the registry, but it looks as if RegNotifyChangeKeyValue is very coarse. It will tell me that something under the given key changed, but it doesn't seem to want to tell me what that something might have been. Bummer!
The portion of the registry in question is arbitrarily deep, so enumerating all the sub-keys and calling RegNotifyChangeKeyValue for each probably isn't a hot idea because I'll eventually end up having to overcome MAXIMUM_WAIT_OBJECTS. Plus I'd have to adjust the set of keys I'd passed to RegNotifyChangeKeyValue, which would be a fair amount of effort to do without enumerating the sub-keys every time, which would defeat a fair amount of the purpose.
Any ideas?
Unfortunately, yes. You probably have to cache all the values of interest to your code, and update this cache yourself whenever you get a change trigger, or else set up multiple watchers, one on each of the individual data items of interest. As you noted the second solution gets unwieldy very quickly.
If you can implement the required code in .Net you can get the same effect more elegantly via RegistryEvent and its subclasses.

Does soCaseInsensitive greatly impact performance for a TdxMemIndex on a TdxMemDataset?

I am adding some indexes to my DevExpress TdxMemDataset to improve performance. The TdxMemIndex has SortOptions which include the option for soCaseInsensitive. My data is usually a GUID string, so it is not case sensitive. I am wondering if I am better off just forcing all the data to the same case or if the soCaseInsensitive flag and using the loCaseInsensitive flag with the call to Locate has only a minor performance penalty (roughly equal to converting the case of my string every time I need to use the index).
At this point I am leaving the CaseInsentive off and just converting case.
IMHO, The best is to assure the data quality at Post time. Reasonings:
You (usually) know the nature of the data. So, eg. you can use UpperCase (knowing that GUIDs are all in ASCII range) instead of much slower AnsiUpperCase which a general component like TdxMemDataSet is forced to use.
You enter the data only once. Searching/Sorting/Filtering which all implies the internal upercassing engine of TdxMemDataSet it's a repeated action. Also, there are other chained actions which will trigger this engine whithout realizing. (Eg. a TcxGrid which is Sorted by default having GridMode:=True (I assume that you use the DevEx. components) and having a class acting like a broker passing the sort message to the underlying dataset.
Usually the data entry is done in steps, one or few records in a batch. The only notable exception is data aquisition applications. But in both cases above the user's usability culture allows way greater response times for you to play with. (IOW how much would add an UpperCase call to a record post which lasts 0.005 ms?) OTOH, users are very demanding with the speed of data retreival operations (searching, sorting, filtering etc.). Keep the data retreival as fast as you can.
Having the data in the database ready to expose reduces the risk of processing errors when you'll write (if you'll write) other modules (you need to remember to AnsiUpperCase the data in any module in any language you'll write). Also here a classical example is when you'll use other external tools to access the data (for ex. db managers to execute an SQL SELCT over the data).
hth.
Maybe the DevExpress forums (or ever a support email, if you have access to it) would be a better place to seek an authoritative answer on that performance question.
Anyway, is better to guarantee that data is on the format you want - for the reasons plainth already explained - the moment you save it. So, in that specific, make sure the GUID is written in upper(or lower, its a matter of taste)case. If it is SQL Server or another database server that have an guid datatype, make sure the SELECT make the work - if applicable and possible, even the sort.

Resources