I am building a traffic tracking application. I ended up using CouchDB to store all the traffic log, the application can dynamically create views based on user's query and custom data.
I want to create thousands (or could up to millions) of views.
Is there a limit ? Would too many views impact CouchDB performance ?
There is no hard limit on the number of views. There are a few things I would recommend though:
First, split up your views among many design documents. My first thought is 1 per user, but you could probably sub-divide them further depending on how many views you actually have.
Views are grouped internally by the design document, which affects when they are rebuilt, where they are stored, etc. Thus, keeping things partitioned off will help prevent 1 user's views from impacting the performance of any other user.
In addition, without regularly compacting your database, each document (including design documents) retains the old copies across different writes, which is one of the reasons CouchDB uses so much disk space. (it trades using more disk space for the ability to write quickly)
Second, be very conservative with the values you emit() in your views. Avoid things like emit(key, doc). If you emit the entire document in your view, it will be considered part of the view index (which is stored separately from the primary database index) and creates multiple copies of the document. If you need to access the source document in your view, you should use include_docs=true.
Depending on exactly the situation, you may want to consider partitioning across multiple databases as well. That may not be possible, depending on how you want to write queries and such, but worth mentioning. If you can partition into databases, that will make creating backups a little easier and may scale better in the long run.
The main point is, CouchDB is very flexible, which is one of my favorite things about it, as it puts the power in your hands as a developer.
Related
How many views per bucket is too much, assuming a large amount of data in the bucket (>100GB, >100M documents, >12 document types), and assuming each view applies only to one document type? Or asked another way, at what point should some document types be split into separate buckets to save on the overhead of processing all views on all document types?
I am having a hard time deciding how to split my data into couchbase buckets, and the performance implications of the views required on the data. My data consists of more than a dozen relational DBs, with at least half with hundreds of millions of rows in a number of tables.
The http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-bestpractice.html doc section "using document types" seems to imply having multiple document types in the same bucket is not ideal because views on specific document types are updated for all documents, even those that will never match the view. Indeed, it suggests separating data into buckets to avoid this overhead.
Yet there is a limit of 10 buckets per cluster for performance reasons. My only conclusion therefore is that each cluster can handle a maximum of 10 large collections of documents efficiently. Is this accurate?
Tug's advice was right on and allow me to add some perspective as well.
A bucket can be considered most closely related to (though not exactly) a "database instantiation" within the RDMS world. There will be multiple tables/schemas within that "database" and those can all be combined within a bucket.
Think about a bucket as a logical grouping of data that all shares some common configuration parameters (RAM quota, replica count, etc) and you should only need to split your data into multiple buckets when you need certain datasets to be controlled separately. Other reasons are related to very different workloads to different datasets or the desire to be able to track the workload to those datasets separately.
Some examples:
-I want to control the caching behavior for one set of data differently than another. For instance, many customers have a "session" bucket that they want always in RAM whereas they may have a larger, "user profile" bucket that doesn't need all the data cached in RAM. Technically these two data sets could reside in one bucket and allow Couchbase to be intelligent about which data to keep in RAM, but you don't have as much guarantee or control that the session data won't get pushed out...so putting it in its own bucket allows you to enforce that. It also gives you the added benefit of being able to monitor that traffic separately.
-I want some data to be replicated more times than others. While we generally recommend only one replica in most clusters, there are times when our users choose certain datasets that they want replicated an extra time. This can be controlled via separate buckets.
-Along the same lines, I only want some data to be replicated to another cluster/datacenter. This is also controlled per-bucket and so that data could be split to a separate bucket.
-When you have fairly extreme differences in workload (especially around the amount of writes) to a given dataset, it does begin to make sense from a view/index perspective to separate the data into a separate bucket. I mention this because it's true, but I also want to be clear that it is not the common case. You should use this approach after you identify a problem, not before because you think you might.
Regarding this last point, yes every write to a bucket will be picked up by the indexing engine but by using document types within the JSON, you can abort the processing for a given document very quickly and it really shouldn't have a detrimental impact to have lots of data coming in that doesn't apply to certain views. If you don't mind, I'm particularly curious at which parts of the documentation imply otherwise since that certainly wasn't our intention.
So in general, we see most deployments with a low number of buckets (2-3) and only a few upwards of 5. Our limit of 10 comes from some known CPU and disk IO overhead of our internal tracking of statistics (the load or lack thereof on a bucket doesn't matter here). We certainly plan to reduce this overhead with future releases, but that still wouldn't change our recommendation of only having a few buckets. The advantages of being able to combine multiple "schemas" into a single logical grouping and apply view/indexes across that still exist regardless.
We are in the process right now of coming up with much more specific guidelines and sizing recommendations (I wrote those first two blogs as a stop-gap until we do).
As an initial approach, you want to try and keep the number of design documents around 4 because by default we process up to 4 in parallel. You can increase this number, but that should be matched by increased CPU and disk IO capacity. You'll then want to keep the number of views within each document relatively low, probably well below 10, since they are each processed in serial.
I recently worked with one user who had an fairly large amount of views (around 8 design documents and some dd's with nearly 20 views) and we were able to drastically bring this down by combining multiple views into one. Obviously it's very application dependent, but you should try to generate multiple different "queries" off of one index. Using reductions, key-prefixing (within the views), and collation, all combined with different range and grouping queries can make a single index that may appear crowded at first, but is actually very flexible.
The less design documents and views you have, the less disk space, IO and CPU resources you will need. There's never going to be a magic bullet or hard-and-fast guideline number unfortunately. In the end, YMMV and testing on your own dataset is better than any multi-page response I can write ;-)
Hope that helps, please don't hesitate to reach out to us directly if you have specific questions about your specific use case that you don't want published.
Perry
As you can see from the Couchbase documentation, it is not really possible to provide a "universal" rules to give you an exact member.
But based on the best practice document that you have used and some discussion(here) you should be able to design your database/views properly.
Let's start with the last question:
YES the reason why Couchbase advice to have a small number of bucket is for performance - and more importantly resources consumption- reason. I am inviting you to read these blog posts that help to understand what's going on "inside" Couchbase:
Sizing 1: http://blog.couchbase.com/how-many-nodes-part-1-introduction-sizing-couchbase-server-20-cluster
Sizing 2: http://blog.couchbase.com/how-many-nodes-part-2-sizing-couchbase-server-20-cluster
Compaction: http://blog.couchbase.com/compaction-magic-couchbase-server-20
So you will see that most of the "operations" are done by bucket.
So let's now look at the original question:
yes most the time your will organize the design document/and views by type of document.
It is NOT a problem to have all the document "types" in a single(few) buckets, this is in fact the way your work with Couchbase
The most important part to look is, the size of your doc (to see how "long" will be the parsing of the JSON) and how often the document will be created/updated, and also deleted, since the JS code of the view is ONLY executed when you create/change the document.
So what you should do:
1 single bucket
how many design documents? (how many types do you have?)
how any views in each document you will have?
In fact the most expensive part is not during the indexing or quering it is more when you have to rebalance the data and indices between nodes (add, remove , failure of nodes)
Finally, but it looks like you already know it, this chapter is quite good to understand how views works (how the index is created and used):
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-operation.html
Do not hesitate to add more information if needed.
If I have two different datasets in CouchDB,
one is infrequently updated (mostly updates to existing documents),
another one is written to very frequently (append-only)
Do I gain any advantage in separating them in separate databases performance-wise? Assume the database is regularly compacted.
From my experience the performance gains are really much dependent on the views when you query the data. I don't see how write performance would increase substantially by separating the db with frequent writes but, as this would impact the size of your database, I would advice to keep them separately. This would allow to run compacts at different times and overall, if you do have an issue with a database, it would allow you to isolate it and address it faster than all in one single database.
I am creating a database that will store 100.000 (and probably more in the future) users. While this obviously happens in a table with 1 row per user, every user can (and will) store hundreds of items. In programming language this would mean the user has 2 arrays (or one 2-dimensional array) of integers: a column for the itemid's and a column for the amounts.
My instincts tell me to create a table to hold all these items, with rows like (userid, itemid, amount). However this would result in a huge table. 200.000 users with 250 items each... that's 50 million entries in one table. This, plus the fact that the table will undergo continuous and rapid change, frightens me. (How rapid? I estimate up to 100 modifications per second.)
Typically there will be anywhere between 100 and 2000 users, all adding and removing items, and modifying amounts. These actions can and will happen in programming code. It would go as follows:
User starts session, program loads all the users items from the database
User modifies the item list
Every few minutes, the changes are saved into the database
When the user ends the session, it is also saved into the database
It is worth noting that there is a maximum to the number of items a user can store.
Are there any alternatives to using a separate table? Perhaps save the values in a formatted text string? Or is this one of the instances where using a MySQL database is actually a Bad Idea™?
Thank you for your time and insights.
My instincts tell me to create a table to hold all these items
Your instincts are right.
1) avoid premature optimisation
2) don't break the rules of normalization unless you've got a very good and real reason to do so
3) why do you suspect that the multi-table approach will be faster?
that's 50 million entries in one table
So what? Even if you only have an index on userid, the difference in performance compared with a single table per user will not be noticeably slower (in practice, with 200,000 users, it will be much, much faster - since the DBMS can comfortably keep an open file handle for each table!).
I estimate up to 100 modifications per second
Should be possible using MySQL and fairly basic hardware, but if it were me, and I wanted a bit of headroom, I'd go with a pair of mirrored SATA disks, tables on one mirror, indexes on the other.
The only issue I'd be concerned about (which applies regardless of which of the 2 models you choose) is supporting 2000 concurrent connections. Do the connections have to be concurrent? Or can each user download a working set (optionally using an optimistic locking strategy) and close off the connection, then push back the changes on a new connection? If not, then you'll probably want a good whack of memory and CPU.
But leaving aside whether to use one big table or lots of little ones, if this is the only use for the data, and access is not concurrent to particular data items, then why bother with a relational database at all? NoSQL or a shared filesystem might work just as well.
Putting data into one field as a array is alwmost always a mistake. It makes querying the data much harder and much more timeconsuming as well as much less likely to use indexes. It is ok, if the values were just text where you would never need to find one or more elements fo the array but it is my experience that this situation is rarely encountered. Modern databases can handle 50 million records without even breaking a sweat. That's a small table in daatbase terms.
It should be OK to do it as you described using two tables. The database should be able to handle millions of records.
The important points to look at:
1- Optimize your queries as much as possible.
2- Create the appropriate index(es) to speed up your queries.
3- Use InnoDB if you have concurrent read/update operations as it supports row-level locking as opposed to MyISAM.
4- Provide good hardware to support the database server.
5- Run the database server on a dedicated server if affordable.
I am planning on using CouchDB on a project. But as the querying mechanism involves writing views (which are a lot like indexes on regular RDMBMS's) I was wondering, if the document database keeps getting updated a lot ( a write heavy database) would CouchDB perform well compared to a regular RDBMS? Or do we have to compact/re-index the system occasionally to make it perform faster?
You might think of the pros/cons of the CouchDB view model this way. (CouchDB hackers may disagree but IMO it's accurate enough for users.)
A view function always performs a full "table scan" when it is first created (just like an RDBMS BTW)
As long as they have no side effects, map and reduce functions can be arbitrarily complex
Every document and map/reduce result is cached and never calculated again
If you add or change a document, it (and only it) will be re-computed (and cached) for that view
Given these, you can draw some conclusions about CouchDB performance:
There is never a re-index phase for the entire data set, just incremental per document update
Changing a view function forces re-building the entire index
Since both CouchDB and RDBMS must update the index for new data, it's reasonable to think performance will be similar for heavy update/insert usage.
Obviously YMMV and the standard cop-out, "you must test your own load" applies. However I will add a few more considerations.
I say RDBMS is flat out superior for exploratory-style querying your data. When you don't even know what questions to ask from your data, you really can't beat a language for querying that is structured.
However, once you define what you want to know, CouchDB (and perhaps Hadoop) provide the most rich querying system because you are just writing code.
If your data set is large, NoSQL databases will scale more easily. For example, CouchDB-Lounge allows a cluster of couches for parallel processing. Hadoop does the same so then it would come down to secondary considerations: familiarity, maintainability, CouchDB is a web server but requires a bit more DIY; Hadoop internalizes more cluster management at the cost of complexity, foreignness, etc.
I hope that helps shed some light on your decision!
I am going to start on with a new project. I need to deal with hundred gigs of data in a .NET application. It is very early stage now to give much detail about this project. Some overview is follows:
Lots of writes and Lots of reads on same tables, very realtime
Scaling is very important as the client insists expansion of database servers very frequently, thus, the application servers as well
Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented
Each row of data may contains lots of attributes to deal with
I am suggesting/having following as a solution:
Use distributed hash table sort of persistence (not S3 but an inhouse one)
Use Hadoop/Hive likes (any replacement in .NET?) for any analytical process across the nodes
Impelement GUI in ASP.NET/Silverlight (with lots of ajaxification,wherever required)
What do you guys think? Am i making any sense here?
Are your goals performance, maintainability, improving the odds of success, being cutting edge?
Don't give up on relational databases too early. With a $100 external harddrive and sample data generator (RedGate's is good), you can simulate that kind of workload quite easily.
Simulating that workload on a non-relational and cloud database and you might be writing your own tooling.
"Foreseeing, lots and lots of usage in terms of aggregate queries could be implemented"
This is the hallmark of a data warehouse.
Here's the trick with DW processing.
Data is FLAT. Facts and Dimensions. Minimal structure, since it's mostly loaded and not updated.
To do aggregation, every query must be a simple SELECT SUM() or COUNT() FROM fact JOIN dimension GROUP BY dimension attribute. If you do this properly so that every query has this form, performance can be very, very good.
Data can be stored in flat files until you want to aggregate. You then load the data people actually intend to use and create a "datamart" from the master set of data.
Nothing is faster than simple flat files. You don't need any complexity to handle terabytes of flat files that are (as needed) loaded into RDBMS datamarts for aggregation and reporting.
Simple bulk loads of simple dimension and fact tables can be VERY fast using the RDBMS's tools.
You can trivially pre-assign all PK's and FK's using ultra-high-speed flat file processing. This makes the bulk loads all the simpler.
Get Ralph Kimball's Data Warehouse Toolkit books.
Modern databases work very well with gigabytes. It's when you get into terabytes and petabytes that RDBMSes tend to break down. If you are foreseeing that kind of load, something like HBase or Cassandra may be what the doctor ordered. If not, spend some quality time tuning your database, inserting caching layers (memached), etc.
"lots of reads and writes on the same tables, very realtime" - Is integrity important? Are some of those writes transactional? If so, stick with RDBMS.
Scaling can be tricky, but it doesn't mean you have to go with cloud computing stuff. Replication in DBMS will usually do the trick, along with web application clusters, load balancers, etc.
Give the RDBMS the responsibility to keep the integrity. And treat this project as if it were a data warehouse.
Keep everything clean, you dont need to go using a lot of third parties tools: use the RDBMS tools instead.
I mean, use all tools that the RDBMS has, and write an GUI that extract all data from the Db using well written stored procedures of a well designed physical data model (index, partitions, etc).
Teradata can handle a lot of data and is scalable.