CouchDB change API on view - view

Say I have a document "schema" that includes a show_from field that contains a timestamp as a Unix epoch. I then create a view keyed by this show_from date and only return those documents with a key on or before the current timestamp (per the request). Thus documents will appear in the view "passively", rather than from any update request.
Is it possible to use the CouchDB change API to monitor this change of view state, or would I have to poll the view to watch for changes? (My guess is the latter, because the change API seems only to be triggered by updates, but just for the sake of confirmation!)

The _changes feed can be filtered in a number of ways.
One of the ways of filtering the _changes feed is reusing a view's map function.
GET /[DB]/_changes?filter=_view&view=[DESIGN_DOC]/[VIEW_NAME]
Note:
For every _changes request, CouchDB is going to look at each change and run it over the filter function (or in this case the view's map function). None of this is cached for subsequent requests (as on mapreduce views). So it can be quite taxing on resources, unless the changeset is small.
For a large dataset (with many changes) it can be useful to bootstrap with the view, and only incrementally keep track of changes.
Additional info:
Using _changes you can poll for changes since a given sequence point, for the latest N changes, etc. You can also use long polling, or a continuous feed. As long as the changeset to consider (and filter through) is small, it makes sense to use _changes.
But if the view is itself ordered chronologically, as it seems to be your case, it may be pointless to use changes. Just query the view.

Related

Transform data before adding to cache system or after when reading it

I have this situation, which I don't know which could fit better.
I have this solution where I search for soccer players, I only have their names and teams, but when a user comes to my website and clicks on the player I will use detailed information from the player that I get from various external providers (usually based by country).
I know which external provider to use when a call is done, and I pay to the external providers each time I grab data, so to mitigate this, I will try to get the less times possible, so I grab once a user clicks on the player info, and next time if it's in my database cache I will show my cache info. After 10 days I will grab again for the specific player form the external provider as I want the info to be somehow updated.
I will need to transform different providers data that come, usually, as JSON in my own structure so I can handle it the right way, I have my own object structure, so the fields coming from the external providers fall/map/transform in my code always with the same naming and structure..
So, my problem is to decide when should I map/transform data coming form the providers.
I grab data from the provider, I transform it to my JSON structure and record/keep it in the database cache system this once with my main structure, and in my solution code all I need is, everytime a user clicks on the soccer player details I get from the database cache this JSON field and convert it directly to object I know how to use.
I grab data from the provider, keep it as is in my database cache system, and in my solution code everytime someone clicks to get the soccer player detail info I get the JSON record from my database cache, I transform it for my naming and structure, and convert it to object
Notes:
- this is a cache database, records won't be kept forever, if in a call I see the record have more then 10 days I will get new data from the apropriate external provider
Deciding the layer to cache data is an art form all its own. The higher the layer you cache data, the more performant it will be (less reprocessing needed), but the lower the re-use potential will be (different parts of the application may use the same cache, and find value if it hasn’t been transformed too much).
Yours is another case of this. If you store it as the provider provides it, and you need to change the way you transform it, you won’t have to pay to re-retrieve it. If on the other hand, you store it as you need it now, you may have to discard it all if you decide to change the transformation method.
Like all architectural design decisions, it's all about trade-offs. You have to decide what is more important to you and your application.

Updating nested documents en masse

We've been using Elasticsearch to deliver the 700,000 or so pieces of content to the readers of our site for a couple of years but some circumstances have changed and we need to work out whether or not the service can adapt with us... (sorry this post is so long, I tried to anticipate all questions!)
We use Elasticsearch to store "snapshots" of our content to avoid duplicating work and slowing down our apps by making them fetch data and resolve all resources from our content APIs. We also take advantage of Elasticsearch's search API to retrieve the content in all sorts of ways.
To maintain content in our cluster we run a service that receives notifications of content changes from our APIs which triggers a content "ingest" (fetching the data, doing any necessary transformation and indexing it). The same service also periodically "reingests" content over time. Typically a new piece of content will be ingested in <30 seconds of publishing and touched every 5 days or so thereafter.
The most common method our applications use to retrieve content is by "tag". We have list pages to view content by tag and our users can subscribe to content updates for a tag. Every piece of content has one or more tags.
Tags have several properties:- ID, name, taxonomy, and it's relationship to the content. They're indexed as nested objects so that we can aggregate on them etc.
This is where it gets interesting... tags used to be immutable but we have recently changed metadata systems and they may now change - names will be updated, IDs may flux as they move taxonomy etc.
We have around 65,000 tags in use, the vast majority of which are used only in relatively small numbers. If and when these tags change we can trigger a reingest of all the associated content without requiring any changes to our infrastructure.
However, we also have some tags which are very common, the most popular of which is used more than 180,000 times. And we've just received warning it, a few others with tens of thousands of documents are due to change! So we need to be able to cope with these updates now and into the future.
Triggering a reingest of all the associated content and queuing it up is not the problem, but this could take quite some time, at least 3-5 hours in some cases, and we would like to try and avoid our list pages becoming orphaned or duplicated while this occurs.
If you've got this far, thank you! I have two questions:
Is there a more optimal mapping we could use for our documents knowing now that nested objects - often duplicated thousands of times - may change? Could a parent/child mapping work with so many relations?
Is there an efficient way to update a large number of nested objects? Hacks are fine, at least to cover us in the short term. Could the update by query API and a script handle it?
Thanks
I've already answered a similar question to your use case of Nested datatype.
Here is the link to the answer of maintaining Parent-Child relation data into ES using Nested datatype.
Try this. Do let me know if this solution helps in solving your problem.

How to design an event receiver which is capable of showing the recent events

The home page of meetup shows information on the recent meetups on the right hand side of the page. What kind of design patterns / tools (pref java based) would you use to implement such an output.
There's a couple of different approaches, which one you use would depend on several factors including the complexity of the business processes, the degree of flexibility desired and load.
Simple Solution
"RSVP Updates" are written directly to some data source during the "RSVP" process; this process is essentially hard-coded.
Have something that reads out the RSVP directly from the whatever data source / table they live in.
This solution will be fine if the load and data volumes are excessive. The key point is that the RSVP UI widget ends up pulling the data out of the same data source as where the updates are written to.
Performance
A few different options, based on the above as a starting point:
Hold the data twice: once in the "master" (Transactional) table of RSVP data, and once in a table built for servicing the UI (Basically OLTP vs OLAP). The second table would include all the relevant data so that there were no look-ups to other tables, and as it's an independent copy of the data you can manage it differently if you want to (for example: purge out old records so the table size is kept small).
Or, instead of a second table just keep all the data in memory. this would require you to pull the data out of the main transactional table whenever the in memory copy is lost.
Flexibility
Same as the original approach but instead of hard-coding in the step that records the RSVP (into a single data source) use a more loosely-coupled approach so that you can add / change / remove as many event processors as you wish. One will write the RSVP data to the main RSVP data source, while a second will do the same/similar but aggregated ready for the
"Recent RSVPs" UI Widget.
Dependency Injection will give you the flexibility - certainly if you deal with a single implementation of the event handler.
The Publish / Subscribe or Chain of Responsibility patterns might give you the basis of an approach.
Is that the kind of info you were after?

Data Synchronization from Relational Database to Couch DB

I need to synchronize my Relational database(Oracle or Mysql) to CouchDb. Do anyone has any idea how its possible. if its possbile than how we can notify the CouchDb for any changes happened on the relational DB.
Thanks in advance.
First of all, you need to change the way you think about database modeling. Synchronizing to CouchDB is not just creating documents of all your tables, and pushing them to Couch.
I'm using CouchDB for a site in production, I'll describe what I did, maybe it will help you:
From the start, we have been using MySQL as our primary database. I had entities mapped out, including their relations. In an attempt to speed up the front-end I decided to use CouchDB as a content repository. The benefit was to have fully prepared documents, that contained all the relational data, so data could be fetched with much less overhead.
Because the documents can contain related entities - say a question document that contains all answers - I first decided what top-level entities I wanted to push to Couch. In my example, only questions would be pushed to Couch, and those documents would contain the answers, and possible some metadata, such as tags, user info, etc. When requesting a question on the frontend, I would only need to fetch one document to have all the information I need at that point.
Now for your second question: how to notify CouchDB of changes. In our case, all the changes in our data are done using a CMS. I have a single point in my code which all edit actions call. That's the place where I hooked in a function that persisted the object being saved to CouchDB. The function determines if this object needs persisting (ie: is it a top level entity), then creates a document of this object (think about some sort of toArray function), and fetches all its relations, recursively. The complete document is then pushed to CouchDB.
Now, in your case, the variables here may be completely different, but the basic idea is the same: figure out what documents you want saved, and how they look like. Then write a function that composes these documents and make sure this is called when changes are made to your relational database.
Notifying CouchDB of a change
CouchDB is very simple. Probably the easiest thing is directly updating an existing document. Two ways to implement this come to mind:
The easiest way is a normal CouchDB update: Fetch the current document by id; modify it; then send it back to Couch with HTTP PUT or POST.
If you have clear application-specific changes (e.g. "the views value was incremented") then writing an _update function seems prudent. Update function are very simple: they receive an HTTP query and a document; they modify the document; and then CouchDB stores the new version. You write update functions in Javascript and they run on the server. It is a great way to "compress" common actions into simpler (and fewer) HTTP queries.

Client-server synchronization pattern / algorithm?

I have a feeling that there must be client-server synchronization patterns out there. But i totally failed to google up one.
Situation is quite simple - server is the central node, that multiple clients connect to and manipulate same data. Data can be split in atoms, in case of conflict, whatever is on server, has priority (to avoid getting user into conflict solving). Partial synchronization is preferred due to potentially large amounts of data.
Are there any patterns / good practices for such situation, or if you don't know of any - what would be your approach?
Below is how i now think to solve it:
Parallel to data, a modification journal will be held, having all transactions timestamped.
When client connects, it receives all changes since last check, in consolidated form (server goes through lists and removes additions that are followed by deletions, merges updates for each atom, etc.).
Et voila, we are up to date.
Alternative would be keeping modification date for each record, and instead of performing data deletes, just mark them as deleted.
Any thoughts?
You should look at how distributed change management works. Look at SVN, CVS and other repositories that manage deltas work.
You have several use cases.
Synchronize changes. Your change-log (or delta history) approach looks good for this. Clients send their deltas to the server; server consolidates and distributes the deltas to the clients. This is the typical case. Databases call this "transaction replication".
Client has lost synchronization. Either through a backup/restore or because of a bug. In this case, the client needs to get the current state from the server without going through the deltas. This is a copy from master to detail, deltas and performance be damned. It's a one-time thing; the client is broken; don't try to optimize this, just implement a reliable copy.
Client is suspicious. In this case, you need to compare client against server to determine if the client is up-to-date and needs any deltas.
You should follow the database (and SVN) design pattern of sequentially numbering every change. That way a client can make a trivial request ("What revision should I have?") before attempting to synchronize. And even then, the query ("All deltas since 2149") is delightfully simple for the client and server to process.
As part of the team, I did quite a lot of projects which involved data syncing, so I should be competent to answer this question.
Data syncing is quite a broad concept and there are way too much to discuss. It covers a range of different approaches with their upsides and downsides. Here is one of the possible classifications based on two perspectives: Synchronous / Asynchronous, Client/Server / Peer-to-Peer. Syncing implementation is severely dependent on these factors, data model complexity, amount of data transferred and stored, and other requirements. So in each particular case the choice should be in favor of the simplest implementation meeting the app requirements.
Based on a review of existing off-the-shelf solutions, we can delineate several major classes of syncing, different in granularity of objects subject to synchronization:
Syncing of a whole document or database is used in cloud-based applications, such as Dropbox, Google Drive or Yandex.Disk. When the user edits and saves a file, the new file version is uploaded to the cloud completely, overwriting the earlier copy. In case of a conflict, both file versions are saved so that the user can choose which version is more relevant.
Syncing of key-value pairs can be used in apps with a simple data structure, where the variables are considered to be atomic, i.e. not divided into logical components. This option is similar to syncing of whole documents, as both the value and the document can be overwritten completely. However, from a user perspective a document is a complex object composed of many parts, but a key-value pair is but a short string or a number. Therefore, in this case we can use a more simple strategy of conflict resolution, considering the value more relevant, if it has been the last to change.
Syncing of data structured as a tree or a graph is used in more sophisticated applications where the amount of data is large enough to send the database in its entirety at every update. In this case, conflicts have to be resolved at the level of individual objects, fields or relationships. We are primarily focused on this option.
So, we grabbed our knowledge into this article which I think might be very useful to everyone interested in the topic => Data Syncing in Core Data Based iOS apps (http://blog.denivip.ru/index.php/2014/04/data-syncing-in-core-data-based-ios-apps/?lang=en)
What you really need is Operational Transform (OT). This can even cater for the conflicts in many cases.
This is still an active area of research, but there are implementations of various OT algorithms around. I've been involved in such research for a number of years now, so let me know if this route interests you and I'll be happy to put you on to relevant resources.
The question is not crystal clear, but I'd look into optimistic locking if I were you.
It can be implemented with a sequence number that the server returns for each record. When a client tries to save the record back, it will include the sequence number it received from the server. If the sequence number matches what's in the database at the time when the update is received, the update is allowed and the sequence number is incremented. If the sequence numbers don't match, the update is disallowed.
I built a system like this for an app about 8 years ago, and I can share a couple ways it has evolved as the app usage has grown.
I started by logging every change (insert, update or delete) from any device into a "history" table. So if, for example, someone changes their phone number in the "contact" table, the system will edit the contact.phone field, and also add a history record with action=update, table=contact, field=phone, record=[contact ID], value=[new phone number]. Then whenever a device syncs, it downloads the history items since the last sync and applies them to its local database. This sounds like the "transaction replication" pattern described above.
One issue is keeping IDs unique when items could be created on different devices. I didn't know about UUIDs when I started this, so I used auto-incrementing IDs and wrote some convoluted code that runs on the central server to check new IDs uploaded from devices, change them to a unique ID if there's a conflict, and tell the source device to change the ID in its local database. Just changing the IDs of new records wasn't that bad, but if I create, for example, a new item in the contact table, then create a new related item in the event table, now I have foreign keys that I also need to check and update.
Eventually I learned that UUIDs could avoid this, but by then my database was getting pretty large and I was afraid a full UUID implementation would create a performance issue. So instead of using full UUIDs, I started using randomly generated, 8 character alphanumeric keys as IDs, and I left my existing code in place to handle conflicts. Somewhere between my current 8-character keys and the 36 characters of a UUID there must be a sweet spot that would eliminate conflicts without unnecessary bloat, but since I already have the conflict resolution code, it hasn't been a priority to experiment with that.
The next problem was that the history table was about 10 times larger than the entire rest of the database. This makes storage expensive, and any maintenance on the history table can be painful. Keeping that entire table allows users to roll back any previous change, but that started to feel like overkill. So I added a routine to the sync process where if the history item that a device last downloaded no longer exists in the history table, the server doesn't give it the recent history items, but instead gives it a file containing all the data for that account. Then I added a cronjob to delete history items older than 90 days. This means users can still roll back changes less than 90 days old, and if they sync at least once every 90 days, the updates will be incremental as before. But if they wait longer than 90 days, the app will replace the entire database.
That change reduced the size of the history table by almost 90%, so now maintaining the history table only makes the database twice as large instead of ten times as large. Another benefit of this system is that syncing could still work without the history table if needed -- like if I needed to do some maintenance that took it offline temporarily. Or I could offer different rollback time periods for accounts at different price points. And if there are more than 90 days of changes to download, the complete file is usually more efficient than the incremental format.
If I were starting over today, I'd skip the ID conflict checking and just aim for a key length that's sufficient to eliminate conflicts, with some kind of error checking just in case. (It looks like YouTube uses 11-character random IDs.) The history table and the combination of incremental downloads for recent updates or a full download when needed has been working well.
For delta (change) sync, you can use pubsub pattern to publish changes back to all subscribed clients, services like pusher can do this.
For database mirror, some web frameworks use a local mini database to sync server side database to local in browser database, partial synchronization is supported. Check meteror.
This page clearly describes mosts scenarios of data synchronization with patterns and example code: Data Synchronization: Patterns, Tools, & Techniques
It is the most comprehensive source I found, considering whole of delta syncs, strategies on how to handle deletions and server-to-client and client-to-server sync. It is a very good starting point, worth a look.

Resources