What's the benefits of using a design document? - view

While reading Couchbase document, I've came across the following.
Because the index for each map/reduce combination within each view
within a given design document is updated at the same time, avoid
declaring too many views within the same design document
http://docs.couchbase.com/admin/admin/Views/views-writing.html
If that's the case, why does it even allow you to add more than one views into a design document? Should I simply create one view per design document?

Grouping multiple views together under one design document is useful when the data being indexed by the views is related and when having the related indexes updated at the same time is desired.
For example take the 'beer sample' bucket that is distributed with Couchbase. It includes two types of documents, one type for breweries and one for beers. You can create a bunch of views on both of these document types. Say you have four views that achieve the following:
List all beers
List all breweries
List all beers under a certain abv
List all breweries in a country
So essentially we have two views that operate on documents related to breweries and another two that operate on beers. At this point is it very useful to group the related views together under one design document because it means they will both be updated at the same time. If you were to add a new brewery, both the view that lists all breweries and the view that lists all breweries in a country will have updates triggered at the same time. On the other hand, if you were to have these two views in separate design docs you would end up triggering two separate view updates, which will mean increased response time if you are using stale=false or potentially inconsistent results if you are not.
Whether this is useful in any given dataset depends on each implementation, how related the documents are and how important response times are. Couchbase gives you the option to tune the view updates to meet your requirements.
An additional reason is that you can control the automated index update triggers on a per design document basis, so you can have some views updated more regularly than others.
The wiki page on Couchbase View Engine Internals may be of interest to you as it explains the design document concept quite well and provides some further insight about how tasks are delegated to design documents by the view engine.

Related

Elasticsearch - Modelling video catalogue information into one index vs multiple indexes

I need to model a video catalogue composed of movies, tv shows, episodes, TV channels and live programs information into elasticsearch. Some of these entities are correlated, some not.
The attributes of these entities are quite different, even if there are some common ones.
Now since I may need to do query cross-entity, imagine the scenario of a customer searching for something that could be a movie, a tv channel or a live event program, is it better to have 1 single index containing a generic entity marked with a logical type attribute, or is it better to have multiple indexes, 1 for each entity (movie, show episode, channel, program) ?
In addition, some of these entities, like movies, can have metadata attributes into multiple languages.
Coming from a relational data model DB, I would create different indexes, one for every entity and have a language variant index for every language. Any suggestion or better approach in order to have great search performance and usability?
Whether to use several indexes or not very much depends on the application, so I cannot provide a definite answer, rather a few thoughts.
From my experience, indexes are rather a means to help maintenance and operations than for data modeling. It is, for example, much easier to delete an index than delete all documents from one source from a bigger index. Or if you support totally separate search applications which do not query across each others data, different indexes are the way to go.
But when you want to query, as you do, documents across data sources, it makes sense to keep them in one index. If only to have comparable ranking across all items in your index. Make sure to re-use fields across your data that have similar meaning (title, year of production, artists, etc.) For fields unique to a source we usually use prefix-marked field names, e.g. movie_... for movie-only meta data.
As for the the language you need to use language specific fields, like title_en, title_es, title_de. Ideally, at query time, you know your user's language (from the browser, because they selected it explicitly, ...) and then search in the language specific fields where available. Be sure to use the language specific analyzers for these fields, at query and at index time.
I see a search engine a bit as the dual of a database: A database stores data but can also index it. A search engine indexes data but can also store it. A database tends to normalize the schema to remove redundancy, a search engine works best with denormalized data for query performance.

Updating nested documents en masse

We've been using Elasticsearch to deliver the 700,000 or so pieces of content to the readers of our site for a couple of years but some circumstances have changed and we need to work out whether or not the service can adapt with us... (sorry this post is so long, I tried to anticipate all questions!)
We use Elasticsearch to store "snapshots" of our content to avoid duplicating work and slowing down our apps by making them fetch data and resolve all resources from our content APIs. We also take advantage of Elasticsearch's search API to retrieve the content in all sorts of ways.
To maintain content in our cluster we run a service that receives notifications of content changes from our APIs which triggers a content "ingest" (fetching the data, doing any necessary transformation and indexing it). The same service also periodically "reingests" content over time. Typically a new piece of content will be ingested in <30 seconds of publishing and touched every 5 days or so thereafter.
The most common method our applications use to retrieve content is by "tag". We have list pages to view content by tag and our users can subscribe to content updates for a tag. Every piece of content has one or more tags.
Tags have several properties:- ID, name, taxonomy, and it's relationship to the content. They're indexed as nested objects so that we can aggregate on them etc.
This is where it gets interesting... tags used to be immutable but we have recently changed metadata systems and they may now change - names will be updated, IDs may flux as they move taxonomy etc.
We have around 65,000 tags in use, the vast majority of which are used only in relatively small numbers. If and when these tags change we can trigger a reingest of all the associated content without requiring any changes to our infrastructure.
However, we also have some tags which are very common, the most popular of which is used more than 180,000 times. And we've just received warning it, a few others with tens of thousands of documents are due to change! So we need to be able to cope with these updates now and into the future.
Triggering a reingest of all the associated content and queuing it up is not the problem, but this could take quite some time, at least 3-5 hours in some cases, and we would like to try and avoid our list pages becoming orphaned or duplicated while this occurs.
If you've got this far, thank you! I have two questions:
Is there a more optimal mapping we could use for our documents knowing now that nested objects - often duplicated thousands of times - may change? Could a parent/child mapping work with so many relations?
Is there an efficient way to update a large number of nested objects? Hacks are fine, at least to cover us in the short term. Could the update by query API and a script handle it?
Thanks
I've already answered a similar question to your use case of Nested datatype.
Here is the link to the answer of maintaining Parent-Child relation data into ES using Nested datatype.
Try this. Do let me know if this solution helps in solving your problem.

Elasticsearch: Use a separate index for each language of the same data record

I have a data record which has a field called title. A record may have different languages for the title at the same time. Such a record has other fields whose values do not vary with languages and so I do not list them in the following two examples:
Record #1:
Title (English): Hello
Record #2:
Title (English): World
Title (Spanish): mundo
Currently there are four possible languages for the title: English, Spanish, French, and Chinese. There will be more languages supported when the system grows.
I am new to Elasticsearch. I think about having an separate index for each language. So for record #2, I will create two Elasticsearch documents (one for each language) and send a document to the index corresponding to its language.
Is this a good/acceptable design within indexing, update, delete, and search in mind? Any problems?
For this design, I believe it has at least benefits:
I can easily decide how many shards are needed for each language
independently
I can decide the number and locations of shards for index (language)
I can easily add an index for a new language when the system grows
destroying or re-indexing existing data.
The system can maximally take advantage of distributed computing
power
Thanks for any input!
Best.
Your solution would likely work fine, but you can run into issues with duplicate documents if you start allowing multi-language searches.
It might be more optimal to have multiple possible values per field, eg:
title.engligsh
title.spanish
You can have completely different analysis rules for each language without duplicating the document.
This approach will further allow you to add a new title.whatever fields to documents with their own analysis rules. Be warned though, last I checked, if you use a completely new custom analyzer you need to open/close the index for it to take effect, which will result in a few seconds of down time.
I'll try to find some time to expand this answer with an end to end example.

solr More than on entity in DataImportHandler

I need to know what is the recommended solution when I want to index my solr data using multiple queries and entities.
I ask because I have to add a new fields into schema.xml configuration. And depends of entity(query) there should be different fields definition.
query_one = "select * from car"
query_two = "select * fromm user"
Tables car and user have differents fields, so I should include this little fact in my schema.xml config (when i will be preparing fields definition).
Maybe someone of you creates a new solr instance for that kind of problem ?
I found something what is call MultiCore. Is it alright solution for my problem ?
Thanks
Solr does not stop you to host multiple entities in a single collection.
You can define the fields for both the entities and have them hosted within the Collection.
You would need to have an identifier to identify the Entities, if you want to filter the results per entity.
If your collections are small or there is a relationship between the User and Car it might be helpful to host them within the same collection
For Solr Multicore Check Answer
Solr Multicore is basically a set up for allowing Solr to host multiple cores.
These Cores which would host a complete different set of unrelated entities.
You can have a separate Core for each table as well.
For e.g. If you have collections for Documents, People, Stocks which are completely unrelated entities you would want to host then in different collections
Multicore setup would allow you to
Host unrelated entities separately so that they don't impact each other
Having a different configuration for each core with different behavior
Performing activities on each core differently (Update data, Load, Reload, Replication)
keep the size of the core in check and configure caching accordingly
Its more a matter of preference and requirements.
The main question for you is whether people will search for cars and users together. If not (they are different domains), you can setup multiple collections/cores. If they are going to be used together (e.g. a search for something that shows up in both cars and people), you may want to merge them into one index.
If you do use single collection for both types, you may want to setup dedicated request handlers returning different sets of fields and possibly tuning the searches. You can see an example of doing that (and a bit more) in the multilingual example from my book.

How to select specific view model data to load in a specific view

I'm not sure if I stated my question clearly, but I have two seperate pages and a single view model. Originally I only had one page, but I decided to split these up because my pages were getting too large (more specifically I had too many pivot items on a single page where two pages would seperate the data better for the user). I was wondering if it was possible to only load specific data to a single view from the view model, because as it is right now my application is freezing because my view model attempts to load all the data even though only about half of it needs to be used on the current page the user is viewing. If so, I'm assuming I would somehow need to let the view model know which data to load. How would I accomplish this. OR, is it good practice to create two seperate view models, one for each page, so that only the necessary data for each page will load accordingly and keep my application from freezing? I am not sure what the standard is here, or what is the most efficient in terms of CPU usage and response times, etc.
Loading more data than you need can definitely be a problem especially if you're doing it over the Internet. Why do it like that? Why not simply separate the viewmodel in two parts? The definition of VM basically says: (quote from Model-View-ViewModel (MVVM) Explained)
The viewmodel is a key piece of the triad because it introduces Presentation Separation, or the concept of keeping the nuances of the view separate from the model. Instead of making the model aware of the user's view of a date, so that it converts the date to the display format, the model simply holds the data, the view simply holds the formatted date, and the controller acts as the liaison between the two.
If you separated the view, you might as well separate the VM too in order to keep things simple.
Still, if that doesn't do it for you and your data is not exposed as a service of some kind, why not just using the parts of VM? Call only the methods you need according to the page which you're seeing, set only the properties you need, don't do it all. And do it on a different thread if the data is really large to process so that your UI doesn't freeze (and of course, in the meantime show the user that you're getting the data using a progress bar).
That should be enough for the scenario you described.

Resources