couchDB views inaccessible while updating - view

Sorry I couldn't think of a more descriptive title: We have an issue with updating couchDB views since they're inaccessible while the design doc is being reindexed. Is the only solution to allow stale views?
In one scenario, there are several couchDB nodes which replicate with each other. Updating a view in one will cause all couchDB nodes to reindex the design doc. Is it not possible to update the view on one node and then replicate out the result? I assume the issue there is that new docs could be inserted into other nodes while the one is reindexing.
In another scenario, we have several couchDB nodes which are read/write and replicate with each other. For web apps, there's another cluster with read-only couchDB nodes... they don't replicate out, but are replicated to from the read/write pool. A solution here could be to take a node out of the cluster, update the view and wait for it to reindex. However, won't that node be missing any documents that were created during reindexing? Is it possible for it to continue receiving document inserts while reindexing?
Are there other possible solutions? We're migrating to the second scenario, so that's what I'm primarily concerned with, but I'm wondering if there's a general solution for either case. Using stale views isn't an ideal scenario since reindexing can take a long time and it's a high-traffic site.

It's great to hear that you are having success with CouchDB.
I suggest you use the staging-and-upgrade technique described in the wiki. It requires a little preparation to get working, however once you have it working, it works very well without any human effort.

Related

Is Elasticsearch optimized for inserts?

I develop for a relatively large online store with a PHP backend, and it uses elasticsearch for some things (like text search, logging... etc).
Now, I'd like to start storing all kinds of information about user activity in ES. For instance, every page view (for instance: user enter product page/category page ,etc).
Is ES optimized for such a heavy load of continuous inserts, or should I consider some alternatives, like for instance having some sort of a buffer layer where I store all of my immediate inserts in memory, and then every minute or so, insert them into ES in bulk?
What is the industry standard? Or am I worrying in vain and ES is optimized for that?
Thanks.
Elasticsearch, when properly sized to handle your load, is definitely a valid alternative for such a use case.
You might decide, however, to store that streaming data into another cluster which is different from your production cluster, so as to not impact the health of the production cluster too much.
There are a lot variables to arrive at the correct decision, and we don't have enough information here, but it's definitely a valid way.

making elasticsearch and bigquery work together

I have a web app that displays the analysis data in browser with elasticsearch as backend data store.
Everything was cool as elasticsearch was handling about 1TB data and search queries were blazing fast.
Then came the decision to add data from all services into the app, close to a peta byte, and we switched to bigquery.[yes, we abandoned the elasticsearch and started querying bigquery directly ].
Now users of my app are complaining that their queries are slow, they are taking seconds (4~10~15), which used to display under a second before.
Naturally the huge amount of data here is to be blamed but I am wondering if there is a way to bring back elasticsearch into the game and make elasticsearch and bigquery play together nicely so that I can get the petaytes of storage from bigquery but still retain the lightspeed search of elasticsearch.
I am sure I am not the first one to face this issue rather I believe I am bit late to the bigquery party so I should be able to reap the benefits of delayed entry by getting all the problems already solved.
Thanks in advance if you can point me to the right direction.
This is a common pattern I see deployed by customers:
Use Elasticsearch to display results from the latest day/week - whatever fits within Elasticsearch's RAM.
Use BigQuery for everything else.
In this way your users will get sub-second results for 90% of their queries, and they will also be able to go wherever they want to go if Elasticsearch can't find an answer within its resources.
I'm not sure what are your users interfaces for getting data - but that's where this logic would need to be deployed.
(of course, expect improvements in the connections and speed as tech progresses)

DSpace: Items only appear in Discovery after moving to another collection

I moved all the items from a collection to another. However, these items, on the source collection,didn't appear on the discovery. After the move, these same items appeared on the destination collection. Why these items didn't appear at the source collection before the move?
Still before the move: If I get these item's handle and try to access on a browser it works. Should it be a problem on discovery index?
There could be 2 causes for this issue.
the items need to be re-indexed. Depending on how the move was performed, the index may not have been updated.
If you are using XMLUI, the cocoon cache needs to be cleared
Here is my recommendation.
Since this is quick, clear the cocoon cache from the Admin->Control Panel->Java Information page.
It that does not resolve the issue, re-build your discovery index by running [dspace-install]/bin/dspace index-discovery -b
The re-index can take a while a while to complete. User search results will be impacted during the re-index process.
In addition to what terrywb said in his answer to this question, in order for automatic re-indexing to work, these things also need to be done:
The "discovery" event consumer must be enabled in your dspace.cfg
The solr data directory for the discovery index ([dspace]/solr/search/data) needs to be owned by the same user that tomcat runs under, so that the tomcat user can add/change/delete files and subdirectories
Automatic re-indexing should be triggered whenever you move items through the user interface or via bulk metadata editing.
Honestly, we've been through this before -- it would be helpful if you could give us more information on your original question rather than posting a new one.

My couchdb view is rebuilding for no reason

i have a couchdb with a database containing ~20M documents. it takes ~12h to build a single view.
i have saved 6 views successfully. they returned results quickly. at first.
after 2 days idle, i added another view. it took much longer to build, and it was a "nice-to-have", not a requirement, so i killed it after ~60% completion (restarted the windows service).
my other views now start re-building their indexes when accessed.
really frustrated.
additional info: disk had gotten within 65GB of full (1TB disk; local)
Sorry you have no choice but to wait for the views to rebuild here. However I will try to explain why this is happening. It won't solve your problem but perhaps it will help you understand what is happening and how to prevent it in future.
From the wiki
CouchDB view index filenames are based on the contents of the design document (not its name, ID or revision). This means that two design documents with identical view code will share view index files.
What follows is that if you change the contents by adding a new view or updating the existing one couchdb will rebuild the indexes.
So I think the most obvious solution is to add new views in new design docs. It will prevent re indexing of existing views and the new one will take whatever time it needs to index any way.
Here is another helpful answer that throws light on how to effectively use couchdb design documents and views.

Data Synchronization from Relational Database to Couch DB

I need to synchronize my Relational database(Oracle or Mysql) to CouchDb. Do anyone has any idea how its possible. if its possbile than how we can notify the CouchDb for any changes happened on the relational DB.
Thanks in advance.
First of all, you need to change the way you think about database modeling. Synchronizing to CouchDB is not just creating documents of all your tables, and pushing them to Couch.
I'm using CouchDB for a site in production, I'll describe what I did, maybe it will help you:
From the start, we have been using MySQL as our primary database. I had entities mapped out, including their relations. In an attempt to speed up the front-end I decided to use CouchDB as a content repository. The benefit was to have fully prepared documents, that contained all the relational data, so data could be fetched with much less overhead.
Because the documents can contain related entities - say a question document that contains all answers - I first decided what top-level entities I wanted to push to Couch. In my example, only questions would be pushed to Couch, and those documents would contain the answers, and possible some metadata, such as tags, user info, etc. When requesting a question on the frontend, I would only need to fetch one document to have all the information I need at that point.
Now for your second question: how to notify CouchDB of changes. In our case, all the changes in our data are done using a CMS. I have a single point in my code which all edit actions call. That's the place where I hooked in a function that persisted the object being saved to CouchDB. The function determines if this object needs persisting (ie: is it a top level entity), then creates a document of this object (think about some sort of toArray function), and fetches all its relations, recursively. The complete document is then pushed to CouchDB.
Now, in your case, the variables here may be completely different, but the basic idea is the same: figure out what documents you want saved, and how they look like. Then write a function that composes these documents and make sure this is called when changes are made to your relational database.
Notifying CouchDB of a change
CouchDB is very simple. Probably the easiest thing is directly updating an existing document. Two ways to implement this come to mind:
The easiest way is a normal CouchDB update: Fetch the current document by id; modify it; then send it back to Couch with HTTP PUT or POST.
If you have clear application-specific changes (e.g. "the views value was incremented") then writing an _update function seems prudent. Update function are very simple: they receive an HTTP query and a document; they modify the document; and then CouchDB stores the new version. You write update functions in Javascript and they run on the server. It is a great way to "compress" common actions into simpler (and fewer) HTTP queries.

Resources