Mongo Db design (embed vs references) - performance

I've read a lot of documents, Q&A etc about that topic (embed or to use references).
I understand the points why you should use one or another approach, but I can't see that someone discuss (asked) similar case:
I have 2 (A and B) entities and relation between them is ONE_TO_MANY (A could belongs to many B), I can use embed (denormalization approach) and it's ok (I clearly understand it), but what if I would like (later) to modify one of used, into many B documents, A document field ? Modify it does not mean replace A by A', it means some changes into exactly A record. It means that (in embed case) I have to apply such changes in all B documents which had A version already.
based on description here http://docs.mongodb.org/manual/tutorial/model-embedded-one-to-many-relationships-between-documents/#data-modeling-example-one-to-many
What If later we would like to change used in many documents address:name field ?
What If we need the list of available addresses in the system ?
How fast that operations will be done in MongoDb ?

It's based on what operations are used mostly. If you are inserting and selecting lot of documents and there is a possibility, that e.g. once a month you will need to modify many nested sub-documents, I think that storing A inside B is good practice, it's what mongodb is supposed to be. You will save lot of time just selecting one document without needing to join another ones and slower update once a time you can stand without any problems.

How fast the update ops will be is obviously dependent on volume of data.
Other considerations as to whether to use embedded docs or references is whether the volume of data in a single document would exceed 16mb. That's a lot of documents mind.
In some cases however, it simply doesn't make sense to denormalise entire documents especially where they're used/referenced elsewhere.
Take a User document for example, you wouldn't usually denormalise all user attributes across each collection that needs to reference a user. Instead you reference the user [with maybe some denormalised user detail].
Obviously each additional denormalised value (unless it was an audit) would need to be updated when the referenced User changes, but you could queue the updates for a background process to deal with - rather than making the caller wait.

I'll throw in some more advice as to speed.
If you have a sub-document called A that is embedded in lots of documents - and you want to change instances of A ...
Careful that the documents don't grow too much with a change. That will hurt performance if A grows too big because it will force Mongo to move the document in memory.
It obviously depends on how many embedded instances you have. The more you have, the slower it will be.
It depends on how you match the sub-document. If you are finding A without an index, it's going to be slow. If you are using range operators to identify it, it will be slow.
Someone already mentioned the size of documents will most likely affect the speed.
The best advice I heard about whether to link or embed was this ... if the entity (A in this case) is mutable ... if it is going to mutate/change often ... then link it, don't embed it.

Related

slow-loading persistent store coordinator in core data

I have been developing a Cocoa app with Core Data. Initially everything seemed fine, but as I added data to the application, I found that the initial data window took ages to load. To fix that, I moved to another startup window that didn't have the data, so start-up was snappy. However, no matter what I do, my first fetch AND my first attempt to load a data window (with tables views) are always slow. (That is, if I fetch slowly and then ask for the data window, both will be slow the first time around.) After that, performance is acceptable.
I traced through my application and found that while I can quickly step through the program, no matter what, the step that retrieves the persistent store coordinator is incredibly slow ... 15 - 20 seconds can elapse with a spinning beach ball.
I've read elsewhere that I might want to denormalize the data. I don't think that will be sufficient. An earlier version was far less "interconnected" between the entities, and it still was a slug at startup. Now I'm looking at entities that may have as high as 18,000 managed objects. Some of the relations are essential to having the data work correctly.
I've also read about the option of employing a separate managed object context in the background. The problem with this is that even this background context would take too long to be usable. If the user tries to run a search, he or she will still be waiting forever for that context to load. I might buy myself a few seconds while the user decides what to type in to the search field, but I can't afford to stall for 25 seconds.
I noticed that once data is imported into the persistent store, even searches on a table that is not related to others (and only has 1000 objects) still takes ages to load. The reason seems to be that it's the coordinator retrieval itself that's slow, not the actual fetch or the context.
Can anyone point me in the right direction on how to resolve this? Thanks!
Before you create your data model:
If you’re storing large objects such as photos, audio or video, you need to be very careful with your model design.
The key point to remember is that when you bring a managed object into a context, you’re bringing all of its data into memory.
If large photos are within managed objects cut from the same entity that drives a table-view, performance will suffer. Even if you’re using a fetched results controller, you could still be loading over a dozen high-resolution images at once, which isn’t going to be instant.
To get around this issue, attributes that will hold large objects should be split off into a related entity. This way the large objects can remain in the persistent store and can be represented by a fault instead, until they really are needed.
If you need to display photos in a table view, you should use auto-generated thumbnail images instead.
Read the whole article
You might be getting ahead of yourself thinking PSC is the culprit.
There is more going on behind the scenes with CoreData than is readily obvious -- PSC is very flexible and must be directed.
A realistic approach for the data size you specified (18K) is to focus on modularizing the logic of your fetch request templates and validation for specific size cases (think small medium large XtraLarge, etc.).
The suggestion to denormalize your data does not take into account the overhead to get your data into a fully denormalized state, plus a (sometimes) unintended side-effect of denormalization is sparsity (unless you have very specific model of course).
Since you usually do not know beforehand what data will be accessed and modified beforehand, make a one-to-many relationship between your central task and any subtasks. This will free up some constraints on your data access.
You can always give your end users the option to choose how they want to handle the larger datasets.

mongoid embedded 1-n vs referenced 1-n document

I'm designing an API and I am wondering which query is more efficient?
the embedded 1-n association:
profile = Profile.where('authenticated_devices.device_token'=> device_token, 'authenticated_devices.access_token'=> access_token).first
or the referenced 1-n association:
device = AuthenticatedDevice.where(device_token: device_token, access_token: access_token).first
profile = device.profile
profile.authenticated_device = device
I've done explains, and in the case of referenced it uses a BtreeCursor, in the case of embedded it uses a BasicCursor. Could the addition of indexes make the embedded 1-n query faster than the referenced 1-n? Also what are the pitfalls of this query? If I want absolute speed for my API is it better to use the embeddedd 1-n or the referenced 1-n? Lets imagine this API also has a heavy load.
Update:
I had this question: "The real decision for referencing to embedding depends on the amount of "related" data and how you intend to access it."
Answer: This is a very simple API. The API loads a current_user's Profile, which has all of their info, basically with every single API call I'll pass a device id and access token. This is the info embedded in the Profile model, what I call the Authenticated Device. It will be under heavy load. I'm trying to determine if I should stick with the embedded 1-N and add indexes, or move over to the referenced 1-n. So speed is my highest concern. Does this help?
The bottom line here is that for anything other than an embedded document you must make more than one query to the database. So in the references scenario you are finding the "child" in one request and then accessing the parent in other request. The code may hide this a bit, but it is actually two round trips to the database.
So the embedded model will be faster. What might be confusing you at the moment is the lack of an index present on your authenticated_devices.device_token field within your Profile model and collection. So with the index in place, then these look-ups are optimal.
It is true the another consideration here could be the cost of pulling the document that contains all of the "devices" in the embedded collection, but as long as the information is reasonably light it still should incur less overhead than an additional trip to the database as it were.
As a final point, if the information you were accessing from Profile is actually very light, then even though it might be against your sensibilities, the fastest possible way should very likely be to just replicate that information "per device" and use that in a single request rather than reference another document with another request.
So look at your usage patterns and consider the size of the data, but generally as long as you have indexes in place for your usage patterns there should be nothing wrong with the embedded model.

How to deactivate safe mode in the mongo shell?

Short question is on the title: I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
Long Question for those willing to know the context:
I am working on a huge set of data like
{
_id:ObjectId("azertyuiopqsdfghjkl"),
stringdate:"2008-03-08 06:36:00"
}
and some other fields and there are about 250M documents like that (whole database with the indexes weights 36Go). I want to convert the date in a real ISODATE field. I searched a bit how I could make an update query like
db.data.update({},{$set:{date:new Date("$stringdate")}},{multi:true})
but did not find how to make this work and resolved myself to make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value. The query use the _id so the default index is used.
Problem is that it takes a very long time. I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added. I also set an index on a relevant field to process the database chunk by chunk. Finally I ran several concurrent mongo clients on both the server and my workstation to ensure that the limitant factor is the database lock availability and not any other factor like cpu or network costs.
I monitored the whole thing with mongotop, mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time. I am a bit disappointed mongodb does not have a more precise granularity on its write lock, why not allowing concurrent write operations on the same collection as long as there is no risk of interference? Now that I think about it I should have sharded the collection on a dozen shards even while staying on the same server, because there would have been individual locks on each shard.
But since I can't do a thing right now to the current database structure, I searched how to improve performance to at least spend 90% of my time writing in mongo (from 70% currently), and I figured out that since I ran my script in the default mongo shell, every time I make an update, there is also a getLastError() which is called afterwards and I don't want it because there is a 99.99% chance of success and even in case of failure I can still make an aggregation request after the end of the big process to retrieve the single exceptions.
I don't think I would gain so much performance by deactivating the getLastError calls, but I think itis worth trying.
I took a look at the documentation and found confirmation of the default behavior, but not the procedure for changing it. Any suggestion?
I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
You can use db.getLastError({w:0}) ( http://docs.mongodb.org/manual/reference/method/db.getLastError/ ) to do what you want but it won't help.
This is because for one:
make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value.
When using the shell in a non-interactive mode like within a loop it doesn't actually call getLastError(). As such downing your write concern to 0 will do nothing.
I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added.
I did tell people when they asked about this stuff to add those fields incase of movement but instead they listened to the guy who said "leave them out! They use space!".
I shouldn't feel smug but I do. That's an unfortunately side effect of being right when you were told you were wrong.
mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time
That's because of all the movement in your documents, kinda hard to fix that.
I am a bit disappointed mongodb does not have a more precise granularity on its write lock
The write lock doesn't actually denote the concurrency of MongoDB, this is another common misconception that stems from the transactional SQL technologies.
Write locks in MongoDB are mutexs for one.
Not only that but there are numerous rules which dictate that operations will subside to queued operations under certain circumstances, one being how many operations waiting, another being whether the data is in RAM or not, and more.
Unfortunately I believe you have got yourself stuck in between a rock and hard place and there is no easy way out. This does happen.

efficient serverside autocomplete

First off all I know:
Premature optimization is the root of all evil
But I think wrong autocomplete can really blow up your site.
I would to know if there are any libraries out there which can do autocomplete efficiently(serverside) which preferable can fit into RAM(for best performance). So no browserside javascript autocomplete(yui/jquery/dojo). I think there are enough topic about this on stackoverflow. But I could not find a good thread about this on stackoverflow (maybe did not look good enough).
For example autocomplete names:
names:[alfred, miathe, .., ..]
What I can think off:
simple SQL like for example: SELECT name FROM users WHERE name LIKE al%.
I think this implementation will blow up with a lot of simultaneously users or large data set, but maybe I am wrong so numbers(which could be handled) would be cool.
Using something like solr terms like for example: http://localhost:8983/solr/terms?terms.fl=name&terms.sort=index&terms.prefix=al&wt=json&omitHeader=true.
I don't know the performance of this so users with big sites please tell me.
Maybe something like in memory redis trie which I also haven't tested performance on.
I also read in this thread about how to implement this in java (lucene and some library created by shilad)
What I would like to hear is implementation used by sites and numbers of how well it can handle load preferable with:
Link to implementation or code.
numbers to which you know it can scale.
It would be nice if it could be accesed by http or sockets.
Many thanks,
Alfred
Optimising for Auto-complete
Unfortunately, the resolution of this issue will depend heavily on the data you are hoping to query.
LIKE queries will not put too much strain on your database, as long as you spend time using 'EXPLAIN' or the profiler to show you how the query optimiser plans to perform your query.
Some basics to keep in mind:
Indexes: Ensure that you have indexes setup. (Yes, in many cases LIKE does use the indexes. There is an excellent article on the topic at myitforum. SQL Performance - Indexes and the LIKE clause ).
Joins: Ensure your JOINs are in place and are optimized by the query planner. SQL Server Profiler can help with this. Look out for full index or full table scans
Auto-complete sub-sets
Auto-complete queries are a special case, in that they usually works as ever decreasing sub sets.
'name' LIKE 'a%' (may return 10000 records)
'name' LIKE 'al%' (may return 500 records)
'name' LIKE 'ala%' (may return 75 records)
'name' LIKE 'alan%' (may return 20 records)
If you return the entire resultset for query 1 then there is no need to hit the database again for the following result sets as they are a sub set of your original query.
Depending on your data, this may open a further opportunity for optimisation.
I will no comply with your requirements and obviously the numbers of scale will depend on hardware, size of the DB, architecture of the app, and several other items. You must test it yourself.
But I will tell you the method I've used with success:
Use a simple SQL like for example: SELECT name FROM users WHERE name LIKE al%. but use TOP 100 to limit the number of results.
Cache the results and maintain a list of terms that are cached
When a new request comes in, first check in the list if you have the term (or part of the term cached).
Keep in mind that your cached results are limited, some you may need to do a SQL query if the term remains valid at the end of the result (I mean valid if the latest result match with the term.
Hope it helps.
Using SQL versus Solr's terms component is really not a comparison. At their core they solve the problem the same way by making an index and then making simple calls to it.
What i would want to know is "what you are trying to auto complete".
Ultimately, the easiest and most surefire way to scale a system is to make a simple solution and then just scale the system by replicating data. Trying to cache calls or predict results just make things complicated, and don't get to the root of the problem (ie you can only take them so far, like if each request missed the cache).
Perhaps a little more info about how your data is structured and how you want to see it extracted would be helpful.

Does soCaseInsensitive greatly impact performance for a TdxMemIndex on a TdxMemDataset?

I am adding some indexes to my DevExpress TdxMemDataset to improve performance. The TdxMemIndex has SortOptions which include the option for soCaseInsensitive. My data is usually a GUID string, so it is not case sensitive. I am wondering if I am better off just forcing all the data to the same case or if the soCaseInsensitive flag and using the loCaseInsensitive flag with the call to Locate has only a minor performance penalty (roughly equal to converting the case of my string every time I need to use the index).
At this point I am leaving the CaseInsentive off and just converting case.
IMHO, The best is to assure the data quality at Post time. Reasonings:
You (usually) know the nature of the data. So, eg. you can use UpperCase (knowing that GUIDs are all in ASCII range) instead of much slower AnsiUpperCase which a general component like TdxMemDataSet is forced to use.
You enter the data only once. Searching/Sorting/Filtering which all implies the internal upercassing engine of TdxMemDataSet it's a repeated action. Also, there are other chained actions which will trigger this engine whithout realizing. (Eg. a TcxGrid which is Sorted by default having GridMode:=True (I assume that you use the DevEx. components) and having a class acting like a broker passing the sort message to the underlying dataset.
Usually the data entry is done in steps, one or few records in a batch. The only notable exception is data aquisition applications. But in both cases above the user's usability culture allows way greater response times for you to play with. (IOW how much would add an UpperCase call to a record post which lasts 0.005 ms?) OTOH, users are very demanding with the speed of data retreival operations (searching, sorting, filtering etc.). Keep the data retreival as fast as you can.
Having the data in the database ready to expose reduces the risk of processing errors when you'll write (if you'll write) other modules (you need to remember to AnsiUpperCase the data in any module in any language you'll write). Also here a classical example is when you'll use other external tools to access the data (for ex. db managers to execute an SQL SELCT over the data).
hth.
Maybe the DevExpress forums (or ever a support email, if you have access to it) would be a better place to seek an authoritative answer on that performance question.
Anyway, is better to guarantee that data is on the format you want - for the reasons plainth already explained - the moment you save it. So, in that specific, make sure the GUID is written in upper(or lower, its a matter of taste)case. If it is SQL Server or another database server that have an guid datatype, make sure the SELECT make the work - if applicable and possible, even the sort.

Resources