I have a CouchDB (v0.10.0) database that is 8.2 GB in size and contains 3890000 documents.
Now, I have the following as the Map of the view
function(doc) {emit([doc.Status], doc);
And it takes forever to load (4 hours and still no result).
Here's some extra information that might help describing the situation:
The view is not a temp view. The
view is defined before the 3890000
documents are inserted.
There isn't anything on the server. It is a ubuntu box with nothing but the defaults installed.
I see that my CPU is moving and working hard (sometimes shoots to 100%). The memory is moving as well but not increasing.
So my question is:
What is actually happening in the background?
Is this a "one time" thing where I have to wait once and it will somehow works later?
Don't emit the whole doc. It's unnecessary. You can instead run your query with include_docs=true, which will let you access the document via each row's doc attribute.
When you emit the whole doc you make the index as large or larger than your entire database. :)
Views are only updated the next time they are read. Upon reading, it processes all the documents that have been updated (created, updated, deleted) since the last time the view was read.
So even if you're view was defined before inserting the 3890000 documents, it will be processing the 3890000 documents for the view.
From http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
Note that by default views are not created and updated when a document is saved, but rather, when they are accessed. As a result, the first access might take some time depending on the size of your data while CouchDB creates the view. If preferable the views can also be updated when a document is saved using an external script that calls the views when updates have been made. An example can be found here: RegeneratingViewsOnUpdate
Also just came across this tip, which might be useful if you're running on Ubuntu:
http://nosql.mypopescu.com/post/1299848121/couchdb-and-ubuntu-configuration-trick-for
Related
Say I have a document "schema" that includes a show_from field that contains a timestamp as a Unix epoch. I then create a view keyed by this show_from date and only return those documents with a key on or before the current timestamp (per the request). Thus documents will appear in the view "passively", rather than from any update request.
Is it possible to use the CouchDB change API to monitor this change of view state, or would I have to poll the view to watch for changes? (My guess is the latter, because the change API seems only to be triggered by updates, but just for the sake of confirmation!)
The _changes feed can be filtered in a number of ways.
One of the ways of filtering the _changes feed is reusing a view's map function.
GET /[DB]/_changes?filter=_view&view=[DESIGN_DOC]/[VIEW_NAME]
Note:
For every _changes request, CouchDB is going to look at each change and run it over the filter function (or in this case the view's map function). None of this is cached for subsequent requests (as on mapreduce views). So it can be quite taxing on resources, unless the changeset is small.
For a large dataset (with many changes) it can be useful to bootstrap with the view, and only incrementally keep track of changes.
Additional info:
Using _changes you can poll for changes since a given sequence point, for the latest N changes, etc. You can also use long polling, or a continuous feed. As long as the changeset to consider (and filter through) is small, it makes sense to use _changes.
But if the view is itself ordered chronologically, as it seems to be your case, it may be pointless to use changes. Just query the view.
I am trying to enter Couchbase world and learning things about a views.
Several time in presentations and demos i heard its bad to return whole doc in from a view:
emit(meta.id, doc);
My question is why? What should i return then and how can i grab a proper values of the document?
It's a bad idea because it's actually counterproductive. Writing a document to the view means it will be stored on disk with the view index itself. You pay the IO price for writing the document to disk again (a duplicate of the original key/value doc), and you pay it again for reading it at query time. Because views queries are served from disk (or the file system cache), you will never take advantage of the integrated cache layer to retrieve the document faster. In short, in average it will be faster to get the document ID from the view and retrieve the document by id, than it is to just read the whole document from the view. This is especially true for operations on multiple documents.
It's bad because it's a large drain on resources, views will often update and overwrite indices, so if you are writing a whole doc repeatedly it's going to require a large amount of processor time and disk I/O (along with filesystem cache).
Therefore, it is recommended (and far more efficient) to have the view return the doc.id and then use the standard get procedure to return the whole doc.
i have a couchdb with a database containing ~20M documents. it takes ~12h to build a single view.
i have saved 6 views successfully. they returned results quickly. at first.
after 2 days idle, i added another view. it took much longer to build, and it was a "nice-to-have", not a requirement, so i killed it after ~60% completion (restarted the windows service).
my other views now start re-building their indexes when accessed.
really frustrated.
additional info: disk had gotten within 65GB of full (1TB disk; local)
Sorry you have no choice but to wait for the views to rebuild here. However I will try to explain why this is happening. It won't solve your problem but perhaps it will help you understand what is happening and how to prevent it in future.
From the wiki
CouchDB view index filenames are based on the contents of the design document (not its name, ID or revision). This means that two design documents with identical view code will share view index files.
What follows is that if you change the contents by adding a new view or updating the existing one couchdb will rebuild the indexes.
So I think the most obvious solution is to add new views in new design docs. It will prevent re indexing of existing views and the new one will take whatever time it needs to index any way.
Here is another helpful answer that throws light on how to effectively use couchdb design documents and views.
I've got just over 10,000,000 records in the database of my component and I think getItems/getListQuery is trying to load every single one of them into memory. The search form on the site extremely slow or comes back saying php is out of memory.
phpMyAdmin seems to be able to handle displaying this data - why not Joomla?
The strange thing is that the items are then displayed correctly using the globally set list limit of 5 to a page.
I've just looked and Joomla's cache is disabled - is that screwing me up here?
Many thanks in advance!
I fixed it in the end by copying the getPagination, getTotal, getItems etc. from the library's (list.php) and into my model (to override them). Then in each method, I made sure the results where returned instead of sending them to the cache.
The getTotal function seems to count the number of rows instead of doing a seperate count(*). That's ok with a few thousand records but over 1/2 million is asking for trouble!
I am using GWT 2.4. There are times when I have to show huge amount of records for example: 50,000 records on my screen in a gridtable or flextable. But it takes very long to load that screen say around 30 mins or so; or, ultimately the screen hangs or at times IE displays an error saying that this might take too long and your application will stop working, so do you wish to continue.
Is there any solution to improve gwt performance?
Don't bring all data at once, you should bring it in pages, as the comments suggested here.
However, paging not be trivial , as it might be that during paging your db is filled with more entries, and if you're using some sorting algorithm for the results,
the new entries might ruin your sorting (for example, when trying to fetch page #2, some entries that should have been on the first page are inserted.
You may decided that you create some sort of "cursor" for paging purposes and it will reflect the state of your database at the point you created it, so you will ignore entires that are entered during traversal between pages.
Another option you may consider, as part of paging is providing only a small version for each record - i.e - only the most important details, and let the user double click if he wants to see the whole details for the record - this can also provide you some performance improvement within each page.