My couchdb view is rebuilding for no reason - view

i have a couchdb with a database containing ~20M documents. it takes ~12h to build a single view.
i have saved 6 views successfully. they returned results quickly. at first.
after 2 days idle, i added another view. it took much longer to build, and it was a "nice-to-have", not a requirement, so i killed it after ~60% completion (restarted the windows service).
my other views now start re-building their indexes when accessed.
really frustrated.
additional info: disk had gotten within 65GB of full (1TB disk; local)

Sorry you have no choice but to wait for the views to rebuild here. However I will try to explain why this is happening. It won't solve your problem but perhaps it will help you understand what is happening and how to prevent it in future.
From the wiki
CouchDB view index filenames are based on the contents of the design document (not its name, ID or revision). This means that two design documents with identical view code will share view index files.
What follows is that if you change the contents by adding a new view or updating the existing one couchdb will rebuild the indexes.
So I think the most obvious solution is to add new views in new design docs. It will prevent re indexing of existing views and the new one will take whatever time it needs to index any way.
Here is another helpful answer that throws light on how to effectively use couchdb design documents and views.

Related

DSpace: Items only appear in Discovery after moving to another collection

I moved all the items from a collection to another. However, these items, on the source collection,didn't appear on the discovery. After the move, these same items appeared on the destination collection. Why these items didn't appear at the source collection before the move?
Still before the move: If I get these item's handle and try to access on a browser it works. Should it be a problem on discovery index?
There could be 2 causes for this issue.
the items need to be re-indexed. Depending on how the move was performed, the index may not have been updated.
If you are using XMLUI, the cocoon cache needs to be cleared
Here is my recommendation.
Since this is quick, clear the cocoon cache from the Admin->Control Panel->Java Information page.
It that does not resolve the issue, re-build your discovery index by running [dspace-install]/bin/dspace index-discovery -b
The re-index can take a while a while to complete. User search results will be impacted during the re-index process.
In addition to what terrywb said in his answer to this question, in order for automatic re-indexing to work, these things also need to be done:
The "discovery" event consumer must be enabled in your dspace.cfg
The solr data directory for the discovery index ([dspace]/solr/search/data) needs to be owned by the same user that tomcat runs under, so that the tomcat user can add/change/delete files and subdirectories
Automatic re-indexing should be triggered whenever you move items through the user interface or via bulk metadata editing.
Honestly, we've been through this before -- it would be helpful if you could give us more information on your original question rather than posting a new one.

couchDB views inaccessible while updating

Sorry I couldn't think of a more descriptive title: We have an issue with updating couchDB views since they're inaccessible while the design doc is being reindexed. Is the only solution to allow stale views?
In one scenario, there are several couchDB nodes which replicate with each other. Updating a view in one will cause all couchDB nodes to reindex the design doc. Is it not possible to update the view on one node and then replicate out the result? I assume the issue there is that new docs could be inserted into other nodes while the one is reindexing.
In another scenario, we have several couchDB nodes which are read/write and replicate with each other. For web apps, there's another cluster with read-only couchDB nodes... they don't replicate out, but are replicated to from the read/write pool. A solution here could be to take a node out of the cluster, update the view and wait for it to reindex. However, won't that node be missing any documents that were created during reindexing? Is it possible for it to continue receiving document inserts while reindexing?
Are there other possible solutions? We're migrating to the second scenario, so that's what I'm primarily concerned with, but I'm wondering if there's a general solution for either case. Using stale views isn't an ideal scenario since reindexing can take a long time and it's a high-traffic site.
It's great to hear that you are having success with CouchDB.
I suggest you use the staging-and-upgrade technique described in the wiki. It requires a little preparation to get working, however once you have it working, it works very well without any human effort.

Handling passive deletion updates (ie. archiving instead of deleting)

We are developing an application based on DDD principles. We have encountered a couple of problems so far that we can't answer nor can we find the answers on the Internet.
Our application is intended to be a cloud application for multiple companies.
One of the demands is that there are no physical deletions from the database. We make only passive deletion by setting Active property of entities to false. That takes care of Select, Insert and Delete operations, but we don't know how to handle update operations.
Update means changing values of properties, but also means that past values are deleted and there are many reasons that we don't want that. One of the primary reason is for Accounting purposes.
If we make all update statements as "Archive old values" and then "Create new values" we would have a great number of duplicate values. For eg., Company has Branches, and Company is the Aggregate Root for Branches. If I change Companies phone number, that would mean I have to archive old company and all of its branches and create completely new company with branches just for one property. This may be a good idea at first, but over time there will be many values which can clog up the database. Phone is maybe an irrelevant property, but changing the Address (if street name has changed, but company is still in the same physical location) is a far more serious problem.
Currently we are using ASP.NET MVC with EF CF for repository, but one of the demands is that we are able to easily switch, or add, another technology like WPF or WCF. Currently we are using Automapper to map DTO's to Domain entities and vice versa and DTO's are primary source for views, ie. we have no view models. Application is layered according to DDD principle, and mapping occurs in Service Layer.
Another demand is that we musn't create a initial entity in database and then fill the values, but an entire aggregate should be stored as a whole.
Any comments or suggestions are appreciated.
We also welcome any changes in demands (as this is an internal project, and not for a customer) and architecture, but only if it's absolutely neccessary.
Thank you.
Have you ever come across event sourcing? Sounds like it could be of use if you're interested in tracking the complete history of aggregates.
To be honest I would create another table that would be a change log inserting the old record and deleted records etc etc into it before updating the live data. Yes you are creating a lot of records but you are abstracting this data from live records and keeping this data as lean as possible.
Also when it comes to clean up and backup you have your live date and your changed / delete data and you can routinely back up and trim your old changed / delete and reduced its size depending on how long you have agreed to keep changed / delete data live with the supplier or business you are working with.
I think this would be the best way to go as your core functionality will be working on a leaner dataset and I'm assuming your users wont be wanting to check revision and deletions of records all the time? So by separating the data you are accessing it when it is needed instead of all the time because everything is intermingled.

Data Synchronization from Relational Database to Couch DB

I need to synchronize my Relational database(Oracle or Mysql) to CouchDb. Do anyone has any idea how its possible. if its possbile than how we can notify the CouchDb for any changes happened on the relational DB.
Thanks in advance.
First of all, you need to change the way you think about database modeling. Synchronizing to CouchDB is not just creating documents of all your tables, and pushing them to Couch.
I'm using CouchDB for a site in production, I'll describe what I did, maybe it will help you:
From the start, we have been using MySQL as our primary database. I had entities mapped out, including their relations. In an attempt to speed up the front-end I decided to use CouchDB as a content repository. The benefit was to have fully prepared documents, that contained all the relational data, so data could be fetched with much less overhead.
Because the documents can contain related entities - say a question document that contains all answers - I first decided what top-level entities I wanted to push to Couch. In my example, only questions would be pushed to Couch, and those documents would contain the answers, and possible some metadata, such as tags, user info, etc. When requesting a question on the frontend, I would only need to fetch one document to have all the information I need at that point.
Now for your second question: how to notify CouchDB of changes. In our case, all the changes in our data are done using a CMS. I have a single point in my code which all edit actions call. That's the place where I hooked in a function that persisted the object being saved to CouchDB. The function determines if this object needs persisting (ie: is it a top level entity), then creates a document of this object (think about some sort of toArray function), and fetches all its relations, recursively. The complete document is then pushed to CouchDB.
Now, in your case, the variables here may be completely different, but the basic idea is the same: figure out what documents you want saved, and how they look like. Then write a function that composes these documents and make sure this is called when changes are made to your relational database.
Notifying CouchDB of a change
CouchDB is very simple. Probably the easiest thing is directly updating an existing document. Two ways to implement this come to mind:
The easiest way is a normal CouchDB update: Fetch the current document by id; modify it; then send it back to Couch with HTTP PUT or POST.
If you have clear application-specific changes (e.g. "the views value was incremented") then writing an _update function seems prudent. Update function are very simple: they receive an HTTP query and a document; they modify the document; and then CouchDB stores the new version. You write update functions in Javascript and they run on the server. It is a great way to "compress" common actions into simpler (and fewer) HTTP queries.

CouchDB view is extremely slow

I have a CouchDB (v0.10.0) database that is 8.2 GB in size and contains 3890000 documents.
Now, I have the following as the Map of the view
function(doc) {emit([doc.Status], doc);
And it takes forever to load (4 hours and still no result).
Here's some extra information that might help describing the situation:
The view is not a temp view. The
view is defined before the 3890000
documents are inserted.
There isn't anything on the server. It is a ubuntu box with nothing but the defaults installed.
I see that my CPU is moving and working hard (sometimes shoots to 100%). The memory is moving as well but not increasing.
So my question is:
What is actually happening in the background?
Is this a "one time" thing where I have to wait once and it will somehow works later?
Don't emit the whole doc. It's unnecessary. You can instead run your query with include_docs=true, which will let you access the document via each row's doc attribute.
When you emit the whole doc you make the index as large or larger than your entire database. :)
Views are only updated the next time they are read. Upon reading, it processes all the documents that have been updated (created, updated, deleted) since the last time the view was read.
So even if you're view was defined before inserting the 3890000 documents, it will be processing the 3890000 documents for the view.
From http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
Note that by default views are not created and updated when a document is saved, but rather, when they are accessed. As a result, the first access might take some time depending on the size of your data while CouchDB creates the view. If preferable the views can also be updated when a document is saved using an external script that calls the views when updates have been made. An example can be found here: RegeneratingViewsOnUpdate
Also just came across this tip, which might be useful if you're running on Ubuntu:
http://nosql.mypopescu.com/post/1299848121/couchdb-and-ubuntu-configuration-trick-for

Resources