mongoid embedded 1-n vs referenced 1-n document

mongoid embedded 1-n vs referenced 1-n document - ruby

I'm designing an API and I am wondering which query is more efficient?
the embedded 1-n association:
profile = Profile.where('authenticated_devices.device_token'=> device_token, 'authenticated_devices.access_token'=> access_token).first
or the referenced 1-n association:
device = AuthenticatedDevice.where(device_token: device_token, access_token: access_token).first
profile = device.profile
profile.authenticated_device = device
I've done explains, and in the case of referenced it uses a BtreeCursor, in the case of embedded it uses a BasicCursor. Could the addition of indexes make the embedded 1-n query faster than the referenced 1-n? Also what are the pitfalls of this query? If I want absolute speed for my API is it better to use the embeddedd 1-n or the referenced 1-n? Lets imagine this API also has a heavy load.
Update:
I had this question: "The real decision for referencing to embedding depends on the amount of "related" data and how you intend to access it."
Answer: This is a very simple API. The API loads a current_user's Profile, which has all of their info, basically with every single API call I'll pass a device id and access token. This is the info embedded in the Profile model, what I call the Authenticated Device. It will be under heavy load. I'm trying to determine if I should stick with the embedded 1-N and add indexes, or move over to the referenced 1-n. So speed is my highest concern. Does this help?

The bottom line here is that for anything other than an embedded document you must make more than one query to the database. So in the references scenario you are finding the "child" in one request and then accessing the parent in other request. The code may hide this a bit, but it is actually two round trips to the database.
So the embedded model will be faster. What might be confusing you at the moment is the lack of an index present on your authenticated_devices.device_token field within your Profile model and collection. So with the index in place, then these look-ups are optimal.
It is true the another consideration here could be the cost of pulling the document that contains all of the "devices" in the embedded collection, but as long as the information is reasonably light it still should incur less overhead than an additional trip to the database as it were.
As a final point, if the information you were accessing from Profile is actually very light, then even though it might be against your sensibilities, the fastest possible way should very likely be to just replicate that information "per device" and use that in a single request rather than reference another document with another request.
So look at your usage patterns and consider the size of the data, but generally as long as you have indexes in place for your usage patterns there should be nothing wrong with the embedded model.

Related

Does this "overly general" type of programming have a name?

Anyone who has experience with the Salesforce platform will know it can essentially be used as a backend for a lot of web applications. They let the end user define custom objects and the fields on those objects. So for instance, rather than having some entity as a strongly-typed class in the code, they have a generic "custom object", whose behaviour and data is defined by the fields you choose and the triggers and rules you apply to it. So they don't have to update the code, recompile and redeploy every time a user adds one (which, given they are a web service would be both impractical and cause serious downtime, a lot).
I was thinking how this could be implemented, and I think Salesforce may do it in a very complex way but I'm specifically thinking how I can implement this. So far I've come up with this:
An "object defintion", which contains all the metadata for a specific record type. Equivalent to a hardcoded class definition.
A generic "record", probably with some sort of dictionary/map tying values to field identifiers that exist in the object definition.
When operating on user data, both the record and the object defintion need to be in memory so that the integrity of the data can be checked. Behaviour normally provided by methods can be applied using some kind of trigger system (again, I'm using a Salesforce example here because it's the best example I know of) with defined actions/events.
This whole system seems very clunky, slow (without serious optimisation), and like it would be prone to problems which wouldn't plague 99% of software projects, so I'd like to learn more about it, but I have no idea where to start looking.
Is the idea I've laid out above already an existing paradigm and if so what is it called?

You have encountered the custom-fields. The design is to enable tenant specific fields against a fixed entity. Since multi-tenancy at the highest level demand That a single codebase / database be used for all tenants with the options to full Customization. This design is the best approach. The below link points to a patent That was granted for managing the custom-fields per tenant.
https://www.google.com/patents/US7779039

Mongo Db design (embed vs references)

I've read a lot of documents, Q&A etc about that topic (embed or to use references).
I understand the points why you should use one or another approach, but I can't see that someone discuss (asked) similar case:
I have 2 (A and B) entities and relation between them is ONE_TO_MANY (A could belongs to many B), I can use embed (denormalization approach) and it's ok (I clearly understand it), but what if I would like (later) to modify one of used, into many B documents, A document field ? Modify it does not mean replace A by A', it means some changes into exactly A record. It means that (in embed case) I have to apply such changes in all B documents which had A version already.
based on description here http://docs.mongodb.org/manual/tutorial/model-embedded-one-to-many-relationships-between-documents/#data-modeling-example-one-to-many
What If later we would like to change used in many documents address:name field ?
What If we need the list of available addresses in the system ?
How fast that operations will be done in MongoDb ?

It's based on what operations are used mostly. If you are inserting and selecting lot of documents and there is a possibility, that e.g. once a month you will need to modify many nested sub-documents, I think that storing A inside B is good practice, it's what mongodb is supposed to be. You will save lot of time just selecting one document without needing to join another ones and slower update once a time you can stand without any problems.

How fast the update ops will be is obviously dependent on volume of data.
Other considerations as to whether to use embedded docs or references is whether the volume of data in a single document would exceed 16mb. That's a lot of documents mind.
In some cases however, it simply doesn't make sense to denormalise entire documents especially where they're used/referenced elsewhere.
Take a User document for example, you wouldn't usually denormalise all user attributes across each collection that needs to reference a user. Instead you reference the user [with maybe some denormalised user detail].
Obviously each additional denormalised value (unless it was an audit) would need to be updated when the referenced User changes, but you could queue the updates for a background process to deal with - rather than making the caller wait.

I'll throw in some more advice as to speed.
If you have a sub-document called A that is embedded in lots of documents - and you want to change instances of A ...
Careful that the documents don't grow too much with a change. That will hurt performance if A grows too big because it will force Mongo to move the document in memory.
It obviously depends on how many embedded instances you have. The more you have, the slower it will be.
It depends on how you match the sub-document. If you are finding A without an index, it's going to be slow. If you are using range operators to identify it, it will be slow.
Someone already mentioned the size of documents will most likely affect the speed.
The best advice I heard about whether to link or embed was this ... if the entity (A in this case) is mutable ... if it is going to mutate/change often ... then link it, don't embed it.

How do you RESTfully get a complicated subset of records?

I have a question about getting 'random' chunks of available content from a RESTful service, without duplicating what the client has already cached. How can I do this in a RESTful way?
I'm serving up a very large number of items (little articles with text and urls). Let's pretend it's:
/api/article/
My (software) clients want to get random chunks of what's available. There's too many to load them all onto the client. They do not have a natural order, so it's not a situation where they can just ask for the latest. Instead, there are around 6-10 attributes that the client may give to 'hint' what type of articles they'd like to see (e.g. popular, recent, trending...).
Over time the clients get more and more content, but at the server I have no idea what they have already, and because they're sent randomly, I can't just pass in the 'most recent' one they have.
I could conceivably send up the GUIDS of what's stored locally. The clients only store 50-100 locally. That's small enough to stuff into a POST variable, but not into the GET query string.
What's a clean way to design this?
Key points:
Data has no logical order
Clients must cache the content locally
Each item has a GUID
Want to avoid pulling down duplicates

You'll never be able to make this work satisfactorily if the data is truly kept in a random order (bear in mind the Dilbert RNG Effect); you need to fix the order for a particular client so that they can page through it properly. That's easy to do though; just make that particular ordering be a resource itself; at that point, you've got a natural (if possibly synthetic) ordering and can use normal paging techniques.
The main thing to watch out for is that you'll be creating a resource in response to a GET when you do the initial query: you probably should use a resource name that is a hash of the query parameters (including the client's identity if that matters) so that if someone does the same query twice in a row, they'll get the same resource (so preserving proper idempotency). You can always delete the resource after some timeout rather than requiring manual disposal…

Cache Management with Numerous Similar Database Queries

I'm trying to introduce caching into an existing server application because the database is starting to become overloaded.
Like many server applications we have the concept of a data layer. This data layer has many different methods that return domain model objects. For example, we have an employee data access object with methods like:
findEmployeesForAccount(long accountId)
findEmployeesWorkingInDepartment(long accountId, long departmentId)
findEmployeesBySearch(long accountId, String search)
Each method queries the database and returns a list of Employee domain objects.
Obviously, we want to try and cache as much as possible to limit the number of queries hitting the database, but how would we go about doing that?
I see a couple possible solutions:
1) We create a cache for each method call. E.g. for findEmployeesForAccount we would add an entry with a key account-employees-accountId. For findEmployeesWorkingInDepartment we could add an entry with a key department-employees-accountId-departmentId and so on. The problem I see with this is when we add a new employee into the system, we need to ensure that we add it to every list where appropriate, which seems hard to maintain and bug-prone.
2) We create a more generic query for findEmployeesForAccount (with more joins and/or queries because more information will be required). For other methods, we use findEmployeesForAccount and remove entries from the list that don't fit the specified criteria.
I'm new to caching so I'm wondering what strategies people use to handle situations like this? Any advice and/or resources on this type of stuff would be greatly appreciated.

I've been struggling with the same question myself for a few weeks now... so consider this a half-answer at best. One bit of advice that has been working out well for me is to use the Decorator Pattern to implement the cache layer. For example, here is an article detailing this in C#:
http://stevesmithblog.com/blog/building-a-cachedrepository-via-strategy-pattern/
This allows you to literally "wrap" your existing data access methods without touching them. It also makes it very easy to swap out the cached version of your DAL for the direct access version at runtime quite easily (which can be useful for unit testing).
I'm still struggling to manage my cache keys, which seem to spiral out of control when there are numerous parameters involved. Inevitably, something ends up not being properly cleared from the cache and I have to resort to heavy-handed ClearAll() approaches that just wipe out everything. If you find a solution for cache key management, I would be interested, but I hope the decorator pattern layer approach is helpful.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio