I'd like to discuss post on 37signals blog named How key-based cache expiration works. I'm Django developer, not RoR, so here is Django "translation" by Ross Poulton: Key-based cache expiration with Django.
As you can see, main idea is following: we have "russian-doll" structure, where one object contains several levels of other.
class A:
timestamp updated_at;
class B:
A parent;
timestamp updated_at;
class C:
B parent;
timestamp updated_at;
View (for example, HTML) of object of class A is cached with all related objects. When class C is updated, we need:
Update timestamp in C.
Update timestamp in B.
Update timestamp in A.
When we access view of class A after this, we need:
Make SELECT to get timestamp from A.
Check, that there is no cached object with this timestamp, so we
need to recache it.
Make SELECT to get A data.
Make SELECT to get all timestamps from B.
Get Bs exist in cache.
Make SELECT to get Bs that not exists in cache.
Make SELECT to get all timestamps of Cs related with Bs that not
exist in cache.
Get Cs from cache, if exist.
Make SELECT to get Cs that not exist in cache.
So, if I understand this strategy right, we need to do 6 queries to DB - 2 for each object: one will get timestamps, second - objects, outdated in cache.
Instead, if we will reset all data, we need to make only 3 queries:
Get object A.
Get related objects B.
Get related objects C.
As I know, it's ofter better to execute 3 queries with more data instead of 6 queries with less. So is this strategy effective?
Of course, we can store timestamps in cache too, but in this case we will face with problem of invalidation of timestamp. So it's no sense to invalidate data for strategy that needed to avoid invalidation.
Please, correct me if I wrong in understanding of scope or principle of work of this algorithm.
Related
TL;DR: Is it enough to call repository.save() on the owning entity to persist the relationship or do I need to save both entities?
Let's say I have two entities, A and B and a oneToMany relationship between them. A can have multiple B's, B can have an A. B is the owning side of the relationship (has the foreign key). If I have two, already persisted entities and want to add and persist a relationship between them, then I typically do this:
a.addB(b);
b.setA(a);
bRepository.save(b);
My question is, do I also need to call aRepository.save(a)? Thanks in advance, Googling didn't help me to find the answer.
If as you describe the relationship is owned by B, A didn't change at all as far as JPA is concerned. So it doesn't need to get persisted.
If you have persisted or loaded A and B in the current session, no save at all is technically necessary. JPA keeps track of the entities, note that they are changed and will flush the changes to the database at the end of the transaction.
Good question and assuming that you have already saved the A entity the answer should be that you do NOT need to save the parent A entity again since you have added the child entity B to A's list of children yourself and A is already persisted.
If you ever reload A and all its children you should get the same list as you currently have.
Since it is lazy loaded your query should specifically load the children in the case you want that otherwise you might get into the situation where you assume that A has all its children but you doesn't if you reloaded A from the database without getting them.
In general though I have to question why you are keeping A around in the first place. Caching can be a good thing but your cache should refresh A when its children are updated and should fetch all of A's children if that is what is needed. In that case you don't need to add the new child to A yourself b/c it will be overwritten anyway. Probably doesn't hurt, but why do you want to second guess the cache?
More generally the pattern is simply to save B and be done with it. If your code needs A and all its children it should fetch from the database when needed.
These thoughts do not include JPAs entity cache since I have not attempted to get into very specific detail about that.
There are times when you store entire object graphs into cache, or objects with collections that make cache invalidation a little tricky.
What techniques are there to know when to invalid a cache?
For simple objects you can invalidate whenever you e.g. update/save the object, you could simply make an extra call and refresh the cache.
When you have a rich object like, for example:
User
Locations
Sales
History
Now this user object will become 'dirty' whenever the user properties, or Location/Sales/History collection data is mutated.
I think one simple method would be to updated the 'modified_date' property of the user object, and maybe keep the modified_date as part of the cache key, and make a call to get the user row and then compare then pull the object graph from the cache based on the modified_date in the key:
user_cache_key + user.id + user.modified_date
The only problem with this approach is you have to make sure you update the 'modified_date' whenever any of the objects dependancies are updated.
Are there any other possible solutions to this problem?
in my data model I take a statement of a user with hashtags, each hashtag is turned into a node and their co-occurrence is the relationship between them. For each relationship I need to take into account:
the user who created it rel.user property
the time it was created - rel.timestamp property
the context it was created in - rel.context property
the statement it was made in - rel.statement property
Now, Neo4J doesn't allow relationship property indexing and so when I do the search that requires me to retrieve and evaluate those properties, it takes a very long time. Specifically, when I do a Cypher request of the kind:
MERGE hashtag1-[rel:TO
{context:"deb659c0-a18d-11e3-ace9-1fa4c6cf2894",
statement:"824acc80-aaa6-11e3-88e3-453baabaa7ed",
user:"b9745f70-a13f-11e3-98c5-476729c16049"}]->hashtag2
ON CREATE
SET
rel.uid="824f6061-aaa6-11e3-88e3-453baabaa7ed",
rel.timestamp="13947117878770000";
This request first checks if there is a relationship with those properties and if there is, it won't do anything, but if there is none, it will add a new one (with a unique ID and timestamp). So then because evaluation of each relationship has to take place – and they are not indexed – it takes a very long time for this request to go through. Now I'm having a problem with such request because I'm dealing with about 100 nodes and 300 relations at one query (the one above is only 1 type, there are also a few others added to the query but those above are the most expensive ones).
Therefore the actual question:
Does anybody know of a good way to keep those relationship properties and to somehow make them work faster, so they can be retrieved and evaluated when needed faster? Or do you think I should use a different type of request (if yes, which?)
Thank you!
This almost looks to me as if you relationship should actually be a node, which then would be connected to nodes:
context
user
statement
tag1
tag2
tagN
Then you can have sensible merge options (.e.g merge on UID).
Currently you loose the power of the graph model for your relationships.
This is also discussed in the graph-databases book in the chapter with the email domain.
Do you already have your hashtag1 and hashtag2 nodes available?
And if so, how many rels already exist between these?
What Cypher has to do for this to work, is to go over each of those relationships and compare all 3 properties (which I'm not sure will fit into shortstring storage) so they have to be loaded if they are not in the cache. You can check your store files, if you have a large string-store file then those uid's might not fit into the property records and have to be loaded separately.
What is the memory config of your server (heap and mmio)?
All that adds up.
If making things work is only requirement, we can put all controlling login and DB handling logic even in the views & it will work. However this is not a right approach for reusable design.
Before I ask my real design question, below is my current understanding about separation of responsibilities in terms of model.
All Database related code, even db related logic, should go in models.
For a table, say 'my_tab', propel generate 4 classes, out of which only 2 classes 'MyTab.php' and 'MyTabPeer.php' should be edited.
MyTabPeer.php must only have data fetching.
Any logic, if required to fetch data, should go in 'MyTab.php'
This is simple and I hope it is correct, if not, please correct me.
Now, I have a special condition. I've 4 tables, say a, b, c, d. For that, propel generated 8 editable classes (excluding base*.php)
A.php APeer.php B.php BPeer.php
C.php CPeer.php D.php DPeer.php
One page of my application, shows Mailbox (say). Mailbox is not a table in database but it gets its data from complex join query between above 4 tables along with lot of calculation/conditions.
I generated that query, fetch data from it and displayed it. Mailbox is running as expected. However I did it in my controller (action class), which I know is not a right place for that.
My question is, where should I put that code? Possible options:
Controller, I think, is not a right place for DB logic/fetch.
I've 8 model classed however data do not belong to any one of them but as combination of them all.
A separate helper/lib, but I know I'll never reuse that code as its unique page of the site.
Anywhere else?
Please suggest if I'm wrong but I guess I should put it in models as it is fetching data. Since A is primary table, I probably should put code in A.php and APeer.php. If that is correct place, next question is, What should go in A.php & what should go in APeer.php? I've following operations to do:
Some logic to decide what columns, should I select.
As like mailbox, I can show received/sent message. Controller will tell what to show but there are some db logic to set conditions.
Then really fetch data from complex Join query.
Returned data will have all rows but I might need to merge few rows conditionally.
As per my understanding, Point 3 should go in APeer.php and rest in A.php. Is my understanding correct?
You should create separate model class i.e. Mailbox.
Method of this model should do the complex select and return data to your action in controller. This solution will not break MVC approach.
I have some data being pulled in from an Entity model. This contains attributes of items, let's say car parts with max-speed, weight and size. Since there are a lot of parts and the base attributes never change, I've cached all the records.
Depending on the car these parts are used in, these attributes might now be changed, so I setup a new car, copy the values from the cached item "Engine" to the new car object and then add "TurboCharger", which boosts max speed, weight and size of the Engine.
The problem I'm running into is that it seems that the Entity model is still tracking the context back to the cached data. So when weight is increased by the local method, it increases it for all users. I tried adding "MergeOption.NoTracking" to my context as this is supposed to remove all entity tracking, but it still seems to be tracking back. If I turn off the cache, it works fine as it pulls fresh values from the database each time.
If I want to copy a record from my entity model, is there a way I can say "Copy the object but treat it as a standard object with no history of coming from entity" so that once my car has the attributes from an item, it is just a flattened object?
Cheers!
Im not too sure about MergeOption.NoTracking on the whole context and exactly what that does but what you can do as an alternative is to add .AsNoTracking() into your query from the database. This will definitely return a detached object.
Take a look here for some details on AsNoTracking usage : http://blog.staticvoid.co.nz/2012/04/entity-framework-and-asnotracking.html.
The other thing is to make sure you enumerate your collection before you insert to the cache to ensure that you arent acting within the queriable, ie use .ToArray().
The other option is to manually detach the object from the context (using Detach(T entity)).