Ruby implementations of the identity map pattern - ruby

I am planning to implement an identity map for a small project which is not using any ORM tool.
The standard implementation in most examples I have seen is just a hash by object id, however it is obvious that the hash will grow without limits. I was thinking in using memcahed or redis with a cache expiration, but that also means that some objects will expire in the cache and their data will be fetched once time again from the database under a new and different object (same entity under different objects in memory).
Considering that most ORMs do not require a running memcached/redis. How do they solve this problem? Do they actually solve it? is it not a concern having an entity represented by repeated instances?
The only solution I know of is in languages supporting smart pointers and storing weak references within the hash. It does not look to me that such approach could be taken with Ruby, so I am wondering how is this pattern normally implemented by Ruby ORMs.

I think they do use a Hash, certainly appears that DataMapper uses a hash. My presumption is that the identity map is per 'session', which is probably flushed after each request (also ensures transactions are flushed by the end of request boundary). Thus it can grow unbounded but has a fixed horizon that clears it. If the intention is to have a session that lasts longer and needs periodic cleaning then WeakRef might be useful. I would be cautious about maintaining an identity map for an extended period however, particularly if concurrency is involved and there are any expectations for consistent transactional changes. I know ActiveRecord considered adding an IdentityMap and then abandoned that effort. Depending on how rows are fetched there may be duplication, but it's probably less then you would think, OR the query should be re-thought.

Related

Hashing objects by their address and copying garbage collection

Sometimes it's convenient to create a hash table where the hash function uses the hashed object's address. The hash function is easy to implement and cheap to execute.
But there's a problem if you are working on a system which uses some form of copying garbage collection because after an object has been moved the previously calculated hash value is incorrect. Even rehashing the table during or after garbage collection seems problematic since (in principle) a hash function could consume memory and trigger another garbage collection.
My question is how to work around these issues. I'm not seeing any clean solution without some kind of compromise.
How to work around this issue depends on the tools which are provided by the specific ecosystem of the specific programming language one is using.
For example, in C#, one can use RuntimeHelpers.GetHashcode() for getting a hash which is guaranteed not to change over the full lifetime of an object, even when a garbage collection takes place and the physical adress of the object changes in memory.

Boost::UUIDs - When to use them ?

Serializing boost::uuids have a cost. Using them to index into vector / unordered_map requires additional hashing. What are the ideal use cases where boost::uuids are ideal data structures to use ?
UUIDs are valuable if you want IDs that are stable in time and across storage systems.
Imagine having two databases, each with auto-generated IDs.
Merging them would be a headache if the IDs are generated by incrementing integral values from 0.
Merging them would be a breeze if all relevant IDs are UUID.
Likewise, handing out a lot of data to an external party, who records operations offline, and subsequently applying these operations back on original data is much easier with UUIDs - even if relations between elements have been changed, new elements created etc.
UUID is also handy for universal "identification" (not authentication/authorization!) - like in driver versions, plugin ids etc. Think about detecting that an MSI is an update to a particular installed software package.
In general, I wouldn't rate UUIDs a characteristic of any data structure. I'd rate it a tool in designing your infrastructure. It plays on the level of persistence, exchange, not so much on the level of algorithms and in-memory manipulation.

Performance-wise, is it worth it to rename every mongo key name for production? [duplicate]

This question already has answers here:
Is shortening MongoDB property names worthwhile?
(7 answers)
Closed 5 years ago.
As far as I know, every key name is stored "as-is" in the mongo database. It means that a field "name" will be stored using the 4 letters everywhere it is used.
Would it be wise, if I want my app to be ready to store a large amount of data, to rename every key in my mongo documents? For instance, "name" would become "n" and "description" would become "d".
I expect it to reduce significantly the space used by the database as well as reducing the amount of data sent to client (not to mention that it kinda uglify the mongo documents content). Am I right?
If I undertake the rename of every key in my code (no need to rename the existing data, I can rebuild it from scratch), is there a good practice or any additional advise I should know?
Note: this is mainly speculation, I don't have benchmarking results to back this up
While "minifying" your keys technically would reduce the size of your memory/diskspace footprint, I think the advantages of this are quite minimal if not actually disadvantageous.
The first thing to realize is that data stored in Mongodb is actually not stored in its raw JSON format, its actually stored as pure binary using a standard know as BSON. This allows Mongo to do all sorts of internal optimizationsm, such as compression if you're using WiredTiger as your storage engine (thanks for pointing that ouT #Jpaljasma).
Second, lets say you do minify your keys. Well then you need to minify your keys. Every time. Forever. Thats a lot of work on your application side. Plus you need to unminify your keys when you read (because users wont know what n is). Every time. Forever. All of a sudden your minor memory optimization becomes a major runtime slowdown.
Third, that minifying/unminifying process is kinda complicated. You need to maintain and test mappings between the two, keep it tested, up to date, and never having any overlap (if you do, thats the end of all your data pretty much). I wouldn't ever work on that.
So overall, I think its a pretty terrible idea to minify your keys to save a couple of characters. Its important to keep the big picture in mind: the VAST majority of your data will be not in the keys, but in the values. If you want to optimize data size, look there.
The full name of every field is included in every document. So when your field-names are long and your values rather short, you can end up with documents where the majority of the used space is occupied by redundant field names.
This affects the total storage size and decreases the number of documents which can be cached in RAM, which can negatively affect performance. But using descriptive field-names does of course improve readability of the database content and queries, which makes the whole application easier to develop, debug and maintain.
Depending on how flexible your driver is, it might also require quite a lot of boilerplate code to convert between your application field-names and the database field-names.
Whether or not this is worth it depends on how complex your database is and how important performance is to you.

How to inform remote users a cached data is out of date?

Our company has built some new webservices. The services provide some large data so it is best to save the data in cache for performance issues. What if new or update data will provide in our webservices, how can we inform our users? What is the best way to do this?
First thing to do is to include the expiration / valid till date along with the data response.
Second thing to do is to make a separate web-service method to check if the data has been modified after the given date.
You basically have trade-off between caching, and making sure data is valid - and storing the entire data on your webservice. Finding the right solution is an engineering issue that really depends on your specific case, but here are some pointers and possible approaches:
Each entry in the cache must have expiry data, that will be wiped after the time has passed, it will make sure you don't store old data, and your cache is not full of unneccessary information.
You can send a message to all your users once some entry is invalidated, that they should take this data out of their cache. This requires your clients to listen to you, and becomes inefficient if data changes often.
You can store a hash value of each element, and before using the actual value - check the hash data is correct. This usually requires much less data transfer than checking the actual value, but you can have false negative - you think a value is not changed, while in fact it was.
In some cases (especially peer-to-peer, but not exclusively) it is wise to use Merkele Trees. The idea if Merkele trees is that each leaf holds data, and its hash value, and each internal node is a hash of its two sons.
The idea is you can find out very quickly if no change was made to the cache by checking the value of the root, and finding what was changed is done in O(logN).
The downside is this DS is probabilistic, and there is a small yet possible chance that a value was changed and you won't detect it.
This approach is an efficient generalization of (3)
Ultimately, there is no silver bullet, and the chosen method should fit your specific case, and depends on a lot of factors, some are:
Size of entry in cache
Rate of changes of cache
Web server availability
Availability to maintain connection with clients
Is a probabalistic approach enough?
etc.

Ok to use memcache in this way? or need a system re-architecture?

I have a "score" i need to calculate for multiple items for multiple users. Each user has many many scores unique to them, and calculating can be time/processor intensive. (the slowness isn't on the database end). To deal with this, I'm making extensive use of memcached. Without memcache some pages would take 10 seconds to load! Memcache seems to work well because the scores are very small pieces of information, but take awhile to compute. I'm actually setting the key to never expire, and then I delete them on the occasional circumstances the score changes.
I'm entering a new phase on this product, and am considering re-architecting the whole thing. There seems to be a way I can calculate the values iteratively, and then store them in a local field. It'll be a bit similar to whats happening now, just the value updates will happen faster, and the cache will be in the real database, and managing it will be a bit more work (I think I'd still use memcache on top of that though).
if it matters, its all in python/django.
Is intending on the cache like this bad practice? is it ok? why? should I try and re-architect things?
If it ain't broke...don't fix it ;^) It seems your method is working, so I'd say stick with it. You might look at memcachedb (or tokyo cabinet) , which is a persistent version of memcache. This way, when the memcache machine crashes and reboots, it doesn't have to recalc all values.
You're applying several architectural patterns here, and each of them certainly has a place. There's not enough information here for me to evaluate whether your current solution needs rearchitecting or whether your ideas will work. It does seem likley to me that as your understanding of the user's requirements grows you may want to improve things.
As always, prototype, measure performance, consider the trade off between complexity and performance - you don't need to be as fast as possible, just fast enough.
Caching in various forms is often the key to good performance. The question here is whether there's merit in persisting the caclulated, cahced values. If they're stable over time then this is often an effective strategy. Whether to persist the cache or make space for them in your database schema will probably depend upon the access patterns. I there are various query paths then a carefully designed database scheme may be appropriate.
Rather than using memcached, try storing the computed score in the same place as your other data; this may be simpler and require fewer boxes.
Memcached is not necessarily the answer to everything; it's intended for systems which need to read-scale very highly. It sounds like in your case, it doesn't need to, it simply needs to be a bit more efficient.

Resources