Elasticsearch - Modelling video catalogue information into one index vs multiple indexes - elasticsearch

I need to model a video catalogue composed of movies, tv shows, episodes, TV channels and live programs information into elasticsearch. Some of these entities are correlated, some not.
The attributes of these entities are quite different, even if there are some common ones.
Now since I may need to do query cross-entity, imagine the scenario of a customer searching for something that could be a movie, a tv channel or a live event program, is it better to have 1 single index containing a generic entity marked with a logical type attribute, or is it better to have multiple indexes, 1 for each entity (movie, show episode, channel, program) ?
In addition, some of these entities, like movies, can have metadata attributes into multiple languages.
Coming from a relational data model DB, I would create different indexes, one for every entity and have a language variant index for every language. Any suggestion or better approach in order to have great search performance and usability?

Whether to use several indexes or not very much depends on the application, so I cannot provide a definite answer, rather a few thoughts.
From my experience, indexes are rather a means to help maintenance and operations than for data modeling. It is, for example, much easier to delete an index than delete all documents from one source from a bigger index. Or if you support totally separate search applications which do not query across each others data, different indexes are the way to go.
But when you want to query, as you do, documents across data sources, it makes sense to keep them in one index. If only to have comparable ranking across all items in your index. Make sure to re-use fields across your data that have similar meaning (title, year of production, artists, etc.) For fields unique to a source we usually use prefix-marked field names, e.g. movie_... for movie-only meta data.
As for the the language you need to use language specific fields, like title_en, title_es, title_de. Ideally, at query time, you know your user's language (from the browser, because they selected it explicitly, ...) and then search in the language specific fields where available. Be sure to use the language specific analyzers for these fields, at query and at index time.
I see a search engine a bit as the dual of a database: A database stores data but can also index it. A search engine indexes data but can also store it. A database tends to normalize the schema to remove redundancy, a search engine works best with denormalized data for query performance.

Related

Is it OK to have multiple merge steps in an Excel Power query?

I have data from multiple sources - a combination of Excel (table and non table), csv and, sometimes, even a tsv.
I create queries for each data source and then I am bringing them together one step at a time or, actually, it's two steps: merge and then expand to bring in the fields I want for each data source.
This doesn't feel very efficient and I think that maybe I should be just joining everything together in the Data Model. The problem when I did that was that I couldn't then find a way to write a single query to access all the different fields spread across the different data sources.
If it were Access, I'd have no trouble creating a single query one I'd created all my relationships between my tables.
I feel as though I'm missing something: How can I build a single query out of the data model?
Hoping my question is clear. It feels like something that should be easy to do but I can't home in on it with a Google search.
It is never a good idea to push the heavy lifting downstream in Power Query. If you can, work with database views, not full tables, use a modular approach (several smaller queries that you then connect in the data model), filter early, remove unneeded columns etc.
The more work that has to be performed on data you don't really need, the slower the query will be. Please take a look at this article and this one, the latter one having a comprehensive list for Best Practices (you can also just do a search for that term, there are plenty).
In terms of creating a query from the data model, conceptually that makes little sense, as you could conceivably create circular references galore.

Cross querying distinct engines on AppSearch

Does AppSearch support cross distinct engine searching with the same query (where for eg two engines have a one to many relationship), such that the result set is a combination of both engines and with the data having filters applying to both datasets at the same time?
If this is supported, how would I write a query to do this, and are there special requirements regarding the data structure in the engines?
Or is there perhaps another way to structure data such that the second engine is not necessary but still allows the additional data to still be queryable?

Using AWS Appsync with DynamoDB, should you model relationships by storing "redundant copies" of related data on the same table (denormalization)?

I was recently reading through this section in the ElasticSearch documentation (or the guide to be more precise). It says that you should try to use a non-relational database the intended way, meaning you should avoid joins between different tables because they are not designed to handle those well. This also reminds me on the section in the DynamoDB docs stating that most well-designed DynamoDB backends only require one table.
Let's take as an example a recipes database where each recipe is using several ingredients. Every ingredient can be used in many different recipes.
Option 1: The obvious way to me to model this in AppSync and DynamoDB, would be to start with an ingredients table which has one item per ingredient storing all the ingredient data, with the ingredient id as partition key. Then I have another recipes table with the partion key recipe id and an ingredients field storing all the ingredient ids in an array. In AppSync I could then query a recipe by doing a GetItem request by recipe id and then resolving the ingredients field with a BatchGetItem on the ingredients table. Let's say a recipe contains 10 ingredients on average, so this would mean 11 GetItem requests sent to the DynamoDB tables.
Option 2: I would consider this a "join like" operation which is apparently not the ideal way to use non-relational databases. So, alternatively I could do the following: Make "redundant copies" of all the ingredient data on the recipes table and not only save the ingredient id there, but also all the other data from the ingredients table. This could drastically increase disk space usage, but apparently disk space is cheap and the increase in performance by only doing 1 GetItem request (instead of 11) could be worth it. As discussed later in the ElasticSearch guide this would also require some extra work to ensure concurrency when ingredient data is updated. So I would probably have to use a DynamoDB stream to update all the data in the recipes table as well when an ingredient is updated. This would require an expensive Scan to find all the recipes using the updated ingredient and a BatchWrite to update all these items. (An ingredient update might be rare though, so the increase in read performance might be worth that.)
I would be interested in hearing your thoughts on this:
Which option would you choose and why?
The second "more non-relational way" to do this seems painful and I am worried that with more levels/relations appearing (for example if users can create menus out of recipes), the resulting complexity could get out of hand quickly when I have to save "redundant copies" of the same data multiple times. I don't know much about relational databases, but these things seem much simpler there when every data has its unique location and that's it (I guess that's what "normalization" means).
Is getRecipe in the Option 1 really 11 times more expensive (performance and cost wise) than in Option 2? Or do I misunderstand something?
Would Option 1 be a cheaper operation in a relational database (e.g. MySQL) than in DynamoDB? Even though it's a join if I understand correctly, it's also just 11 ("NoSQL intended way") GetItem operations. Could this still be faster than 1 SQL query?
If I have a very relational data structure could a non-relational database like DynamoDB be a bad choice? Or is AppSync/GraphQL a way to still make it a viable choice (by allowing Option 1 which is really easy to build)? I read some opinions that constantly working around the missing join capability when querying NoSQL databases and having to do this on the application side is the main reason why it's not a good fit. But AppSync might be a way to solve this problem. Other opinions (including the DynamoDB docs) mention performance issues as the main reason why you should always query just one table.
This is quite late, I know, but might help someone down the road.
Start with an entity relationship diagram as this will help determine your options. Even in NoSQL, there are standard ways of modeling relationships.
Next, define your access patterns. Go through all the CRUDL operations and make sure that for each operation, you can access the specific data for that operation. For example, in your option 1 where ingredients are stored in an array in a field: think through an access pattern where you might need to delete an ingredient in a recipe. To do this, you need to know the index of the item in the array. Therefore, you have to obtain the entire array, find the index of the item, and then issue another call to update the array, taking into account possible race conditions.
Doing this in your application, while possible, is not efficient. You can also code this up in your resolver, but attempting to do so using velocity template language is not worth the headache, trust me.
The TL;DR is to model your entire application's entity relationship diagram, and think through all the access patterns. If the relationship is one-to-many, you can either denormalize the data, use a composite sort key, or use secondary indexes. If many-to-many, you start getting into adjacency lists and other advanced strategies. Alex DeBrie has some great resources here and here.

What's the benefits of using a design document?

While reading Couchbase document, I've came across the following.
Because the index for each map/reduce combination within each view
within a given design document is updated at the same time, avoid
declaring too many views within the same design document
http://docs.couchbase.com/admin/admin/Views/views-writing.html
If that's the case, why does it even allow you to add more than one views into a design document? Should I simply create one view per design document?
Grouping multiple views together under one design document is useful when the data being indexed by the views is related and when having the related indexes updated at the same time is desired.
For example take the 'beer sample' bucket that is distributed with Couchbase. It includes two types of documents, one type for breweries and one for beers. You can create a bunch of views on both of these document types. Say you have four views that achieve the following:
List all beers
List all breweries
List all beers under a certain abv
List all breweries in a country
So essentially we have two views that operate on documents related to breweries and another two that operate on beers. At this point is it very useful to group the related views together under one design document because it means they will both be updated at the same time. If you were to add a new brewery, both the view that lists all breweries and the view that lists all breweries in a country will have updates triggered at the same time. On the other hand, if you were to have these two views in separate design docs you would end up triggering two separate view updates, which will mean increased response time if you are using stale=false or potentially inconsistent results if you are not.
Whether this is useful in any given dataset depends on each implementation, how related the documents are and how important response times are. Couchbase gives you the option to tune the view updates to meet your requirements.
An additional reason is that you can control the automated index update triggers on a per design document basis, so you can have some views updated more regularly than others.
The wiki page on Couchbase View Engine Internals may be of interest to you as it explains the design document concept quite well and provides some further insight about how tasks are delegated to design documents by the view engine.

Elasticsearch: Use a separate index for each language of the same data record

I have a data record which has a field called title. A record may have different languages for the title at the same time. Such a record has other fields whose values do not vary with languages and so I do not list them in the following two examples:
Record #1:
Title (English): Hello
Record #2:
Title (English): World
Title (Spanish): mundo
Currently there are four possible languages for the title: English, Spanish, French, and Chinese. There will be more languages supported when the system grows.
I am new to Elasticsearch. I think about having an separate index for each language. So for record #2, I will create two Elasticsearch documents (one for each language) and send a document to the index corresponding to its language.
Is this a good/acceptable design within indexing, update, delete, and search in mind? Any problems?
For this design, I believe it has at least benefits:
I can easily decide how many shards are needed for each language
independently
I can decide the number and locations of shards for index (language)
I can easily add an index for a new language when the system grows
destroying or re-indexing existing data.
The system can maximally take advantage of distributed computing
power
Thanks for any input!
Best.
Your solution would likely work fine, but you can run into issues with duplicate documents if you start allowing multi-language searches.
It might be more optimal to have multiple possible values per field, eg:
title.engligsh
title.spanish
You can have completely different analysis rules for each language without duplicating the document.
This approach will further allow you to add a new title.whatever fields to documents with their own analysis rules. Be warned though, last I checked, if you use a completely new custom analyzer you need to open/close the index for it to take effect, which will result in a few seconds of down time.
I'll try to find some time to expand this answer with an end to end example.

Resources