MongoDB why use an embedded list instead of separate collection? - performance

Assuming we have a server with large enough RAM, why should we worry about extra querying required when we use separate collections instead of embedded list of objects? Since queries will be really fast, would it be worth it to store objects as embedded list?

There is a 16MB size limit for BSON documents in MongoDB. So when depending on the data model, you impose an artificial limit to what can be stored.
Depending on your storage engine, increasing the document size frequently can cause the document moved within the data files, which is a rather costly operation you really want to prevent
With complicated data models, queries tend to get more complicated, leading to problem, as you can see often see here on SO. Complicated queries are not necessarily faster.
Usually, embedded documents stem from the fact that developers are used to SQL JOINs and want their data all within one query. But if you boil it down, usually you have questions like
For a given X, what are the Ys belonging to it?
So usually you already have X. There is no need to prematurely load data you never need in the majority of cases. Think of an overview page of Xs where you select the X you want to see the according Ys. Even with a pagination of 10, 9/10 of the data loaded would be useless if you had all the data embedded. Fun fact: This applies to SQL, too – though nobody seems to care about real optimizations nowadays.
This is the summary of my blog post "The problem with overembedding", where you find an in-detail explanation of the points mentioned above.

Related

How couchdb 1.6 inherently take advantage of Map reduce when it is Single Server Database

I am new to couch db, while going through documentation of Couch DB1.6, i came to know that it is single server DB, so I was wondering how map reduce inherently take advantage of it.
If i need to scale this DB then do I need to put more RAID hardware, of will it work on commodity hardware like HDFS?
I came to know that couch db 2.0 planning to bring clustering feature, but could not get proper documentation on this.
Can you please help me understanding how exactly internally file get stored and accessed.
Really appreciate your help.
I think your question is something like this:
"MapReduce is … a parallel, distributed algorithm on a cluster." [shortened from MapReduce article on Wikipedia]
But CouchDB 1.x is not a clustered database.
So what does CouchDB mean by using the term "map reduce"?
This is a reasonable question.
The historical use of "MapReduce" as described by Google in this paper using that stylized term, and implemented in Hadoop also using that same styling implies parallel processing over a dataset that may be too large for a single machine to handle.
But that's not how CouchDB 1.x works. View index "map" and "reduce" processing happens not just on single machine, but even on a single thread! As dch (a longtime contributor to the core CouchDB project) explains in his answer to https://stackoverflow.com/a/12725497/179583:
The issue is that eventually, something has to operate in serial to build the B~tree in such a way that range queries across the indexed view are efficient. … It does seem totally wacko the first time you realise that the highly parallelisable map-reduce algorithm is being operated sequentially, wat!
So: what benefit does map/reduce bring to single-server CouchDB? Why were CouchDB 1.x view indexes built around it?
The benefit is that the two functions that a developer can provide for each index "map", and optionally "reduce", form very simple building blocks that are easy to reason about, at least after your indexes are designed.
What I mean is this:
With e.g. the SQL query language, you focus on what data you need — not on how much work it takes to find it. So you might have unexpected performance problems, that may or may not be solved by figuring out the right columns to add indexes on, etc.
With CouchDB, the so-called NoSQL approach is taken to an extreme. You have to think explicitly about how you each document or set of documents "should be" found. You say, I want to be able to find all the "employee" documents whose "supervisor" field matches a certain identifier. So now you have to write a map function:
function (doc) {
if (doc.isEmployeeRecord) emit(doc.supervisor.identifier);
}
And then you have to query it like:
GET http://couchdb.local:5984/personnel/_design/my_indexes/_view/by_supervisor?key=SOME_UUID
In SQL you might simply say something like:
SELECT * FROM personnel WHERE supervisor == ?
So what's the advantage to the CouchDB way? Well, in the SQL case this query could be slow if you don't have an index on the supervisor column. In the CouchDB case, you can't really make an unoptimized query by accident — you always have to figure out a custom view first!
(The "reduce" function that you provide to a CouchDB view is usually used for aggregate functions purposes, like counting or averaging across multiple documents.)
If you think this is a dubious advantage, you are not alone. Personally I found designing my own indexes via a custom "map function" and sometimes a "reduce function" to be an interesting challenge, and it did pay off in knowing the scaling costs at least of queries (not so much for replications…).
So don't think of CouchDB view so much as being "MapReduce" (in the stylized sense) but just as providing efficiently-accessible storage for the results of running [].map(…).reduce(…) across a set of data. Because the "map" function is applied to only a document at once, the total set of data can be bigger than fits in memory at once. Because the "reduce" function is limited in its size, it further encourages efficient processing of a large set of data into an efficiently-accessed index.
If you want to learn a bit more about how the indexes generated in CouchDB are stored, you might find these articles interesting:
The Power of B-trees
CouchDB's File Format is brilliantly simple and speed-efficient (at the cost of disk space).
Technical Details, View Indexes
You may have noticed, and I am sorry, that I do not actually have a clear/solid answer of what the actual advantage and reasons were! I did not design or implement CouchDB, was only an avid user for many years.
Maybe the bigger advantage is that, in systems like Couchbase and CouchDB 2.x, the "parallel friendliness" of the map/reduce idea may come into play more. So then if you have designed an app to work in CouchDB 1.x it may then scale in the newer version without further intervention on your part.

Performance-wise, is it worth it to rename every mongo key name for production? [duplicate]

This question already has answers here:
Is shortening MongoDB property names worthwhile?
(7 answers)
Closed 5 years ago.
As far as I know, every key name is stored "as-is" in the mongo database. It means that a field "name" will be stored using the 4 letters everywhere it is used.
Would it be wise, if I want my app to be ready to store a large amount of data, to rename every key in my mongo documents? For instance, "name" would become "n" and "description" would become "d".
I expect it to reduce significantly the space used by the database as well as reducing the amount of data sent to client (not to mention that it kinda uglify the mongo documents content). Am I right?
If I undertake the rename of every key in my code (no need to rename the existing data, I can rebuild it from scratch), is there a good practice or any additional advise I should know?
Note: this is mainly speculation, I don't have benchmarking results to back this up
While "minifying" your keys technically would reduce the size of your memory/diskspace footprint, I think the advantages of this are quite minimal if not actually disadvantageous.
The first thing to realize is that data stored in Mongodb is actually not stored in its raw JSON format, its actually stored as pure binary using a standard know as BSON. This allows Mongo to do all sorts of internal optimizationsm, such as compression if you're using WiredTiger as your storage engine (thanks for pointing that ouT #Jpaljasma).
Second, lets say you do minify your keys. Well then you need to minify your keys. Every time. Forever. Thats a lot of work on your application side. Plus you need to unminify your keys when you read (because users wont know what n is). Every time. Forever. All of a sudden your minor memory optimization becomes a major runtime slowdown.
Third, that minifying/unminifying process is kinda complicated. You need to maintain and test mappings between the two, keep it tested, up to date, and never having any overlap (if you do, thats the end of all your data pretty much). I wouldn't ever work on that.
So overall, I think its a pretty terrible idea to minify your keys to save a couple of characters. Its important to keep the big picture in mind: the VAST majority of your data will be not in the keys, but in the values. If you want to optimize data size, look there.
The full name of every field is included in every document. So when your field-names are long and your values rather short, you can end up with documents where the majority of the used space is occupied by redundant field names.
This affects the total storage size and decreases the number of documents which can be cached in RAM, which can negatively affect performance. But using descriptive field-names does of course improve readability of the database content and queries, which makes the whole application easier to develop, debug and maintain.
Depending on how flexible your driver is, it might also require quite a lot of boilerplate code to convert between your application field-names and the database field-names.
Whether or not this is worth it depends on how complex your database is and how important performance is to you.

Querying a view using multiple keys

Given the following view for the gamesim-sample example:
function (doc, meta) {
if (doc.jsonType == "player" && doc.experience) {
emit([doc.experience,meta.id], doc.id);
}
}
I would like to Query the leaderboard for users who only belong to specific group (the grouping data is maintained in an external system).
For e.g. if the view has users "orange","purple","green","blue" and "red" I would like the leaderboard to give me the rankings of only "orange" and "purple" without having to query their respective current experience points.
...view/leaderboard?keys=[[null,"orange"],[null,"purple"]
The following works fine, but it requires additional queries to find the experience point of "orange" and "purple" beforehand. However, this does not scale for obvious reasons.
...view/leaderboard?keys=[[1,"orange"],[5,"purple"]
Thanks in advance!
Some NoSql vs. SQL Background
First, you have to remember that specifically with Couchbase, the advantage is the super-fast storage and retrieval of records. Indicies were added later, as a way to make storage a little more useful and less error-prone (think of them more as an automated inventory) and their design really constrains you to move away from SQL-style thinking. Your query above is a perfect example:
select *
from leaderboard
where id in ('orange','purple')
order by experience
This is a retrieval, computation, and filter all in one shot. This is exactly what NoSql databases are optimized not to do (and conversely, SQL databases are, which often makes them hopelessly complex, but that is another topic).
So, this leads to the primary difference between a SQL vs a NoSQL database: NoSql is optimized for storage while SQL is optimized for querying. In conjunction, it causes one to adjust how one thinks about the role of the database, which in my opinion should be more the former than the latter.
The creators of Couchbase originally focused purely on the storage aspect of the database. However, storage makes a lot more sense when you know what it is you have stored, and indices were added later as a feature (originally you had to keep track of your own stuff - it was not much fun!) They also added in map-reduce in a way that takes advantage of CB's ability to store and retrieve massive quantities of records simultaneously. Neither of these features were really intended to solve complex query problems (even though this query is simple, it is a perfect example because of this). This is the function of your application logic.
Addressing Your Specific Issue
So, now on to your question. The query itself appears to be a simple one, and indeed it is. However,
select * from leaderboard
is not actually simple. It is instead a 2-layer deep query, as your definition of leaderboard implies a sorted list from largest to smallest player experience. Therefore, this query, expanded out, becomes:
select * from players order by experience desc
Couchbase supports the above natively in the index mechanism (remember, it inventories your objects), and you have accurately described in your question how to leverage views to achieve this output. What Couchbase does not support is the third-level query, which represents your where clause. Typically, a where in Couchbase is executed in either the view "map" definition or the index selection parameters. You can't do it in "map" because you don't always want the same selection, and you can't do it in the index selection parameter because the index is sorted on experience level first.
Method 1
Let's assume that you are displaying this to a user on a web page. You can easily implement this filter client-side (or in your web service) by pulling the data as-is and throwing out values that you don't want. Use the limit and skip parameters to ask for more as the user scrolls down (or clicks more pages, or whatever).
Method 2
Reverse the order of your index, and sort by "group" (aka color) first, then experience level. Run separate queries to select the top 'N' users of each color, then merge and sort on the client side. This will take longer to load up-front but will give you a larger in-memory data set to work with if you need it for that reason. This method may not work well if you have a very uneven distribution of categories, in which case 'N' would need to be tailored to match the statistical distribution(s) within the categories.
Bottom Line
One parting thought is that NoSql databases were designed to deal with highly dynamic data sets. This requires some statistical thinking, because there no longer is a single "right" answer. Some degree of inconsistency and error is to be expected (as there always is in the real world). You can't expect a NoSql database to return a perfect query result - because there is no perfection. You have to settle for "good enough" - which is often much better than what is needed anyway.

Payload performance in Lucene

I know there are several topics on the web, as well as on SO, regarding indexing and query performance within Lucene, but I have yet to find one that discusses whether or not (and if so, how much?) creating payloads will affect query performance...
Here's the scenario ...
Let's say I want to index a collection of documents (anywhere from 100K - 10M), and each document has a subsection that I want to be able to search separately (or perhaps rank higher, depending on whether a match was found within that section).
I'm considering adding a payload (during indexing) to any term that appears within that subsection, so I can efficiently make that determination at query-time.
Does anyone know of any performance issues related to using payloads, or even better, could you point me to any online documentation about this topic?
Thanks!
EDIT: I appreciate the alternative solutions to my scenario, but in case I do need to use payloads in the future, does anyone have any comments regarding the original question about query performance?
The textbook solution to what you want to do is index each original document as two fields: one for the full document, and the other for the subsection. You can boost the subsection field separately either during indexing or during retrieval.
Having said that, you can read about Lucene payloads here: Getting Started with Payloads.
Your use case doesn't fit well with the purpose of payloads -- it looks to me that any payload information would be redundant.
Payloads are attached to individual occurrences of terms in the document, not to document/term pairs. In order to store and access payloads, you have to use the offset of the term occurrence within the document. In your case, if you know the offset, you should be able to calculate which section the term occurrence is in, without using payload data.
The broader question is the effect of payloads on performance. My experience is that when properly used, the payload implementation takes up less space and is faster than whatever workaround I was previously using. The biggest impact on disk space will be wherever you currently use Field.setOmitTermFreqAndPositions(true) to reduce index size. You will need to include positions to use payloads, which potentially makes the index much larger.

One complex query vs Multiple simple queries

What is actually better? Having classes with complex queries responsible to load for instance nested objects? Or classes with simple queries responsible to load simple objects?
With complex queries you have to go less to database but the class will have more responsibility.
Or simple queries where you will need to go more to database. In this case however each class will be responsible for loading one type of object.
The situation I'm in is that loaded objects will be sent to a Flex application (DTO's).
The general rule of thumb here is that server roundtrips are expensive (relative to how long a typical query takes) so the guiding principle is that you want to minimize them. Basically each one-to-many join will potentially multiply your result set so the way I approach this is to keep joining until the result set gets too large or the query execution time gets too long (roughly 1-5 seconds generally).
Depending on your platform you may or may not be able to execute queries in parallel. This is a key determinant in what you should do because if you can only execute one query at a time the barrier to breaking up a query is that much higher.
Sometimes it's worth keeping certain relatively constant data in memory (country information, for example) or doing them as a separately query but this is, in my experience, reasonably unusual.
Far more common is having to fix up systems with awful performance due in large part to doing separate queries (particularly correlated queries) instead of joins.
I don't think that any option is actually better. It depends on your application specific, architecture, used DBMS and other factors.
E.g. we used multiple simple queries with in our standalone solution. But when we evolved our product towards lightweight internet-accessible solution we discovered that our framework made huge number of request and that killed performance cause of network latency. So we sufficiently reworked our framework for using aggregated complex queries. Meanwhile, we still maintained our stand-alone solution and moved from Oracle Light to Apache Derby. And once more we found that some of our new complex queries should be simplified as Derby performed them too long.
So look at your real problem and solve it appropriately. I think that simple queries are good for beginning if there are no strong objectives against them.
From a gut feeling I would say:
Go with the simple way as long as there is no proven reason to optimize for performance. Otherwise I would put the "complex objects and query" approach in the basket of premature optimization.
If you find that there are real performance implications then you should in the next step optimize the roundtripping between flex and your backend. But as I said before: This is a gut feeling, you really should start out with a definition of "performant", start simple and measure the performance.

Resources