MongoDB embedded vs. reference from performance perspective - performance

I read that embedding is better from a performance point of view:
"If performance is an issue, embed." (http://www.mongodb.org/display/DOCS/Schema+Design) and most guides always say contains should be embedded.
However I am not sure this is the case. Suppose we have two objects: Blog and Post. Blog contains posts.
Now making all posts embedded in blog will have the following issues:
Paging. Since it's not possible to filter embedded objects, we will always get all posts and need to filter them out in the application.
Filtering. Same as before, when searching for word inside posts, it will not be possible to filter the embedded collection from MongoDB.
Insert. I assume inserting to collection is faster than inserting to embedded object. Is this correct? this is written anywhere?
Update. Same as before, inline updating field inside smaller document (Post) might be faster then inline updating the post inside big document of Blog. Is this correct?
Taking all of the above, I would go for having posts in a separate collection referencing Blog. Is this the correct conclusion?
(Note: Please do not factor document size limit in the response, let's assume each blog will have at most 1000 posts)

1.Paging possible with $slice operator:
db.blogs.find({}, {posts:{$slice: [10, 10]}}) // skip 10, limit 10
2.Filtering also possible:
db.blogs.find({"posts.title":"Mongodb!"}, {posts:{$slice: 1}}) //take one post
3,4. Generally i guess you are speaking about small performance difference. It's not rocket science, it just blog with at most 1000 posts.
You said:
Is this the correct conclusion?
No, if you care about performance (in general if system will be small you can go with separate document).
I've done small performance test regarding 3,4, here is results:
-----------------------------------------------------------------
| Count/Time | Inserting posts | Adding to nested collection |
-------------|--------------------------------------------------
| 1 | 1 ms | 28 ms |
| 1000 | 81 ms | 590 ms |
| 10000 | 759 ms | 2723 ms |
---------------------------------------------------------------

As for 3 & 4, if you are inserting into a nested document, it is basically an update.
This can be terribly bad for your performance because inserts are generally appended to the end of the data which works fine and fast. Updates, on the other hand, can be much trickier.
If your update does not change the size of a document (meaning that you had a key\value pair and simply changed the value to a new value that takes up the same amount of space) then you will be ok but when you start modifying documents and adding new data, a problem arises.
The problem is that while MongoDB allots more space than it needs for each document, it may not be enough. If you insert a document that is 1k large, MongoDB may allot 1.5k for the document to ensure that minor changes to the document have enough space to grow. If you use more than the allocated space, MongoDB has to fetch the entire document and re-write it at the tail end of the data.
There is obviously a performance implication in fetching and re-writing the data which will be amplified by the frequency of such an operation. To make matters worse, when this happens you end up leaving holes or pockets of unused space in your data files.
This ultimately gets copied into memory which means that you may end up using 2GB of RAM to store your data set, while in reality the data itself only takes up 1.5GB because there are .5GB worth of pockets. This fragmentation can be avoided by doing inserts as opposed to updates. It can also be fixed by doing a database repair.
In the next version of MongoDB there will be an online compaction function.

You can paging with '$slice' on embedded element
You can search with "field1.field2": /aRegex/ with aRegex is the word you search. But take care of performance.
About 3. and 4. I have no proof data.
BTW 2 collections can be easier to code/use/manage. And you can simply register blogId in each 'blog' document and add "blogId":"1234ABCD" in all your query

Related

Performance issue while using microstream

I just started learning microstream. After going through the examples published to microstream github repository, I wanted to test its performance with an application that deals with more data.
Application source code is available here.
Instructions to run the application and the problems I faced are available here
To summarize, below are my observations
While loading a file with 2.8+ million records, processing takes 5 minutes
While calculating statistics based on loaded data, application fails with an OutOfMemoryError
Why is microstream trying to load all data (4 GB) into memory? Am I doing something wrong?
MicroStream is not like a traditional database and starts from the concept that all data are in memory. And an Object graph can be stored to disk (or other media) when you store this through the StorageManager.
In your case, all data are in 1 list and thus when accessing this list it reads all records from the disk. The Lazy reference isn't useful how you have used it since it just handles the access to the one list with all data.
Some optimizations that you can introduce.
Split the data based on vendorId, or day using a Map<String, Lazy<List>>
When a Map value is 'processed' removed it from the memory again by clearing the lazy reference. https://docs.microstream.one/manual/5.0/storage/loading-data/lazy-loading/clearing-lazy-references.html
Increase the number of Channels to optimize the reading and writing the data. see https://docs.microstream.one/manual/5.0/storage/configuration/using-channels.html
Don't store the object graph every 10000 lines but just at the end of the loading.
Hope this helps you solve the issues you have at the moment

Server-side pagination in mean stack application

I am working on mean stack application with angular 4.I am worrying about pagination from the server side.I can not able to understand the logic.Please help me !
My guess is you are using MongoDB since you mentioned MEAN Stack. For implementing pagination, you can use the find, limit and skip functions.
Example: (pagesize is 10 records)
// Page 1
db.document.find().limit(10);
// Page 2
db.document.find().skip(10).limit(10);
// Page 3
db.document.find().skip(20).limit(10);
This is native to MongoDB however, this approach has a drawback as MongoDB manual states:
The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return results. As the offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.
You also can use any indexed field to achieve this (preferably _id field).

XPages performance - 2 apps on same server, 1 runs and 1 doesn't

We have been having a bit of a nightmare this last week with a business critical XPage application, all of a sudden it has started crawling really badly, to the point where I have to reboot the server daily and even then some pages can take 30 seconds to open.
The server has 12GB RAM, and 2 CPUs, I am waiting for another 2 to be added to see if this helps.
The database has around 100,000 documents in it, with no more than 50,000 displayed in any one view.
The same database set up as a training application with far fewer documents, on the same server always responds even when the main copy if crawling.
There are a number of view panels in this application - I have read these are really slow. Should I get rid of them and replace with a Repeat control?
There is also Readers fields on the documents containing Roles, and authors fields as it's a workflow application.
I removed quite a few unnecessary views from the back end over the weekend to help speed it up but that has done very little.
Any ideas where I can check to see what's causing this massive performance hit? It's only really become unworkable in the last week but as far as I know nothing in the design has changed, apart from me deleting some old views.
Try to get more info about state of your server and application.
Hardware troubleshooting is summarized here: http://www-10.lotus.com/ldd/dominowiki.nsf/dx/Domino_Server_performance_troubleshooting_best_practices
According to your experience - only one of two applications is slowed down, it is rather code problem. The best thing is to profile your code: http://www.openntf.org/main.nsf/blog.xsp?permaLink=NHEF-84X8MU
To go deeper you can start to look for semaphore locks: http://www-01.ibm.com/support/docview.wss?uid=swg21094630, or to look at javadumps: http://lazynotesguy.net/blog/2013/10/04/peeking-inside-jvms-heap-part-2-usage/ and NSDs http://www-10.lotus.com/ldd/dominowiki.nsf/dx/Using_NSD_A_Practical_Guide/$file/HND202%20-%20LAB.pdf and garbage collector Best setting for HTTPJVMMaxHeapSize in Domino 8.5.3 64 Bit.
This presentation gives a good overview of Domino troubleshooting (among many others on the web).
Ok so we resolved the performance issues by doing a number of things. I'll list the changes we did in order of the improvement gained, starting with the simple tweaks that weren't really noticeable.
Defrag Domino drive - it was showing as 32% fragmented and I thought I was on to a winner but it was really no better after the defrag. Even though IBM docs say even 1% fragmentation can cause performance issues.
Reviewed all the main code in the application and took a number of needless lookups out when they can be replaced with applicationScope variables. For instance on the search page, one of the drop down choices gets it's choices by doing an #Unique lookup on all documents in the database. Changed it to a keyword and put that in the application Scope.
Removed multiple checks on database.queryAccessRole and put the user's roles in a sessionScope.
DB had 103,000 documents - 70,000 of them were tiny little docs with about 5 fields on them. They don't need to be indexed by the FTIndex so we moved them in to a separate database and pointed the data source to that DB when these docs were needed. The FTIndex went from 500mb to 200mb = faster indexing and searches but the overall performance on the app was still rubbish.
The big one - I finally got around to checking the application properties, advanced tab. I set the following options :
Optimize document table map (ran copystyle compact)
Dont overwrite free space
Dont support specialized response hierarchy
Use LZ1 compression (ran copystyle compact with options to change existing attachments -ZU)
Dont allow headline monitoring
Limit entries in $UpdatedBy and $Revisions to 10 (as per domino documentation)
And also dont allow the use of stored forms.
Now I don't know which one of these options was the biggest gain, and not all of them will be applicable to your own apps, but after doing this the application flies! It's running like there are no documents in there at all, views load super fast, documents open like they should - quickly and everyone is happy.
Until the http threads get locked out - thats another question of mine that I am about to post so please take a look if you have any idea of what's going on :-)
Thanks to all who have suggested things to try.

Mongo Db design (embed vs references)

I've read a lot of documents, Q&A etc about that topic (embed or to use references).
I understand the points why you should use one or another approach, but I can't see that someone discuss (asked) similar case:
I have 2 (A and B) entities and relation between them is ONE_TO_MANY (A could belongs to many B), I can use embed (denormalization approach) and it's ok (I clearly understand it), but what if I would like (later) to modify one of used, into many B documents, A document field ? Modify it does not mean replace A by A', it means some changes into exactly A record. It means that (in embed case) I have to apply such changes in all B documents which had A version already.
based on description here http://docs.mongodb.org/manual/tutorial/model-embedded-one-to-many-relationships-between-documents/#data-modeling-example-one-to-many
What If later we would like to change used in many documents address:name field ?
What If we need the list of available addresses in the system ?
How fast that operations will be done in MongoDb ?
It's based on what operations are used mostly. If you are inserting and selecting lot of documents and there is a possibility, that e.g. once a month you will need to modify many nested sub-documents, I think that storing A inside B is good practice, it's what mongodb is supposed to be. You will save lot of time just selecting one document without needing to join another ones and slower update once a time you can stand without any problems.
How fast the update ops will be is obviously dependent on volume of data.
Other considerations as to whether to use embedded docs or references is whether the volume of data in a single document would exceed 16mb. That's a lot of documents mind.
In some cases however, it simply doesn't make sense to denormalise entire documents especially where they're used/referenced elsewhere.
Take a User document for example, you wouldn't usually denormalise all user attributes across each collection that needs to reference a user. Instead you reference the user [with maybe some denormalised user detail].
Obviously each additional denormalised value (unless it was an audit) would need to be updated when the referenced User changes, but you could queue the updates for a background process to deal with - rather than making the caller wait.
I'll throw in some more advice as to speed.
If you have a sub-document called A that is embedded in lots of documents - and you want to change instances of A ...
Careful that the documents don't grow too much with a change. That will hurt performance if A grows too big because it will force Mongo to move the document in memory.
It obviously depends on how many embedded instances you have. The more you have, the slower it will be.
It depends on how you match the sub-document. If you are finding A without an index, it's going to be slow. If you are using range operators to identify it, it will be slow.
Someone already mentioned the size of documents will most likely affect the speed.
The best advice I heard about whether to link or embed was this ... if the entity (A in this case) is mutable ... if it is going to mutate/change often ... then link it, don't embed it.

Transferring lots of objects with Guid IDs to the client

I have a web app that uses Guids as the PK in the DB for an Employee object and an Association object.
One page in my app returns a large amount of data showing all Associations all Employees may be a part of.
So right now, I am sending to the client essentially a bunch of objects that look like:
{assocation_id: guid, employees: [guid1, guid2, ..., guidN]}
It turns out that many employees belong to many associations, so I am sending down the same Guids for those employees over and over again in these different objects. For example, it is possible that I am sending down 30,000 total guids across all associations in some cases, of which there are only 500 unique employees.
I am wondering if it is worth me building some kind of lookup index that I also send to the client like
{ 1: Guid1, 2: Guid2 ... }
and replacing all of the Guids in the objects I send down with those ints,
or if simply gzipping the response will compress it enough that this extra effort is not worth it?
Note: please don't get caught up in the details of if I should be sending down 30,000 pieces of data or not -- this is not my choice and there is nothing I can do about it (and I also can't change Guids to ints or longs in the DB).
Your wrote at the end of your question the following
Note: please don't get caught up in the details of if I should be
sending down 30,000 pieces of data or not -- this is not my choice and
there is nothing I can do about it (and I also can't change Guids to
ints or longs in the DB).
I think it's your main problem. If you don't solve the main problem you will be able to reduce the size of transferred data to 10 times for example, but you still don't solve the main problem. Let us we think about the question: Why so many data should be sent to the client (to the web browser)?
The data on the client side are needed to display some information to the user. The monitor is not so large to show 30,000 total on one page. No user are able to grasp so much information. So I am sure that you display only small part of the information. In the case you should send only the small part of information which you display.
You don't describe how the guids will be used on the client side. If you need the information during row editing for example. You can transfer the data only when the user start editing. In the case you need transfer the data only for one association.
If you need display the guids directly, then you can't display all the information at once. So you can send the information for one page only. If the user start to scroll or start "next page" button you can send the next portion of data. In the way you can really dramatically reduce the size of transferred data.
If you do have no possibility to redesign the part of application you can implement your original suggestion: by replacing of GUID "{7EDBB957-5255-4b83-A4C4-0DF664905735}" or "7EDBB95752554b83A4C40DF664905735" to the number like 123 you reduce the size of GUID from 34 characters to 3. If you will send additionally array of "guid mapping" elements like
123:"7EDBB95752554b83A4C40DF664905735",
you can reduce the original size of data 30000*34 = 1020000 (1 MB) to 300*39 + 30000*3 = 11700+90000 = 101700 (100 KB). So you can reduce the size of data in 10 times. The usage of compression of dynamic data on the web server can reduce the size of data additionally.
In any way you should examine why your page is so slowly. If the program works in LAN, then the transferring of even 1MB of data can be quick enough. Probably the page is slowly during placing of the data on the web page. I mean the following. If you modify some element on the page the position of all existing elements have to be recalculated. If you would be work with disconnected DOM objects first and then place the whole portion of data on the page you can improve the performance dramatically. You don't posted in the question which technology you use in you web application so I don't include any examples. If you use jQuery for example I could give some example which clear more what I mean.
The lookup index you propose is nothing else than a "custom" compression scheme. As amdmax stated, this will increase your performance if you have a lot of the same GUIDs, but so will gzip.
IMHO, the extra effort of writing the custom coding will not be worth it.
Oleg states correctly, that it might be worth fetching the data only when the user needs it. But this of course depends on your specific requirements.
if simply gzipping the response will compress it enough that this extra effort is not worth it?
The answer is: Yes, it will.
Compressing the data will remove redundant parts as good as possible (depending on the algorithm) until decompression.
To get sure, just send/generate the data uncompressed and compressed and compare the results. You can count the duplicate GUIDs to calculate how big your data block would be with the dictionary compression method. But I guess gzip will be better because it can also compress the syntactic elements like braces, colons, etc. inside your data object.
So what you are trying to accomplish is Dictionary compression, right?
http://en.wikibooks.org/wiki/Data_Compression/Dictionary_compression
What you will get instead of Guids which are 16 bytes long is int which is 4 bytes long. And you will get a dictionary full of key value pairs that will associate each guid to some int value, right?
It will decrease your transfer time when there're many objects with the same id used. But will spend CPU time before transfer to compress and after transfer to decompress. So what is the amount of data you transfer? Is it mb / gb / tb? And is there any good reason to compress it before sending?
I do not know how dynamic is your data, but I would
on a first call send two directories/dictionaries mapping short ids to long GUIDS, one for your associations and on for your employees e.g. {1: AssoGUID1, 2: AssoGUID2,...} and {1: EmpGUID1, 2:EmpGUID2,...}. These directories may also contain additional information on the Associations and Employees instances; I suspect you do not simply display GUIDs
on subsequent calls just send the index of Employees per Association { 1: [2,4,5], 3:[2,4], ...}, the key being the association short id and the ids in the array value, the short ids of the employees. Given your description building the reverse index: Employee to Associations may give better result size wise (but higher processing)
Then its all down to associative arrays manipulations which is straightforward in JS.
Again, if your data is (very) dynamic server side, the two directories will soon be obsolete and maintaining synchronization may cost you a lot.
I would start by answering the following questions:
What are the performance requirements? Are there size requirements? Speed requirements? What is the minimum performance that is truly needed?
What are the current performance metrics? How far are you from the requirements?
You characterized the data as possibly being mostly repeats. Is that the normal case? If not, what is?
The 2 options you listed above sound reasonable and trivial to implement. Try creating a look-up table and see what performance gains you get on actual queries. Try zipping the results (with look-ups and without), and see what gains you get.
In my experience if you're not TOO far from the goal, performance requirements are often trial and error.
If those options don't get you close to the requirements, I would take a step back and see if the requirements are reasonable in the time you have to solve the problem.
What you do next depends on which performance goals are lacking. If it is size, you're starting to be limited if you're required to send the entire association list ever time. Is that truly a requirement? Can you send the entire list once, and then just updates?

Resources