DynamoDB and ElasticSearch - table design approaches - elasticsearch

I'am currently thinking about my DynamoDB database. The goal is to get the best speed with a really large amount of data possible. For me, DynamoDB seems as a good option. Furthermore, it is necessary for me to connect the table to ElasticSearch because I need geopoint-querys (like show me all posts that are within a specific area).
Would the following approach make sense to you regarding the best practices of DynamoDB? It is possible, that the sort key for example 'posts' is a hot spot, but if I query only with ElasticSearch, there should be no problem - right? What would be the best solution?
My tables look like this:
So my thoughts are:
To query all users just select every row with the sort_key 'user'
To get a post with the creator, query the post_id and the sort_key 'post'
In a relational database, the two tables would look like this:

You can do something like using overloaded attributes (using same columns for different things)
Then, query with user_id and post_id=0 for user info and post_id=others for posts

I don't think dynamo DB is a good option for that.
Limitation for DynamoDB listed below.
You can't get more than 1 MB data per API call
wildcard search will give you a poor performance
Aggregation will be a nightmare since you plan for the huge data

Related

Solr - schema per user group

currently I'm developing user-search application where users can do a full-text search. It should be extremely fast and there can be a lot of users, like 100.000. There are also like 10.000 user groups. Now I came across Solr and started to implement this, but it seems like I'm failing at the design level.
The requirements:
There is a default schema which is applied to all user groups
Each user is assigned to exactly one user group
A user group can have additional fields (besides the default schema) which should be displayed in the result set (so they can extend the data with custom data)
The search should be extremely fast
How would you realize that application that suits the requirements?
First, I thought about creating a "master core" for the default schema and create a core for each user group, so that I could join the necessary cores when a user requests the data. But it seems like that joining cores in standalone would not work because it does not support sharding. However, even if it would work, I'm concerned about performance because of joining at query time.
SolrCloud does seem to support sharding, but again, I would need to join the queries to one result set which would impact performance again. Additionally, I came across this post Query multiple collections with different fields in solr which says that I would need a merged schema (share-unification) to be able to query across collections/shards. So this would mean: whenever a user group's schema is changed, I would need to change my share-unifacation. As all user group's schemas rely on the share-unification, the search would be unavailable because I would need to re-index at least two schemas.
A simple solution would be to put everything into a single core (standalone) or collection (cloud), but this feels overwhelming.
Has someone did something similar before and can give a good advice or even a best practice?

Apollo GraphQL DataLoader DynamoDb

I'm new to GraphQL and am reading about N+1 issue and the dataloader pattern to increase performance. I'm looking at starting a new GraphQL project with DynamoDB for the database. I've done some initial research and found a couple of small NPM packages for dataloader and DynamoDb but they do no seem to be actively supported. So, it seems to me, from my initial research, that DynamoDB may not be the best choice supporting an Apollo GraphQL app.
Is it possible to implement dataloader pattern against DynamoDb database?
Dataloader doesn't care what kind of database you have. All that really matters is that there's some way to batch up your operations.
For example, for fetching a single entity by its ID, with SQL you'd have some query that's a bit like this:
select * from product where id = SOME_ID_1
The batch equivalent of this might be an in query as follows:
select * from product where id in [SOME_ID_1, SOME_ID_2, SOME_ID_3]
The actual mechanism for single vs batch querying is going to vary depending on what database you're using, it may not always be possible but it usually is. A quick search shows that DynamoDB has BatchGetItem which might be what you need.
Batching up queries that take additional parameters (such as pagination, or complex filtering) can be more challenging and may or may not be worth investing the effort. But batching anything that looks like "get X by ID" is always worth it.
In terms of finding libraries that support Dataloader and DynamoDB in particular, I wouldn't worry about it. You don't need this level of tooling. As long as there's some way of constructing the database query, and you can put it inside a function that takes an array of IDs and returns a result in the right shape, you can do it -- and this usually isn't complicated enough to justify adding another library.

Using AWS Appsync with DynamoDB, should you model relationships by storing "redundant copies" of related data on the same table (denormalization)?

I was recently reading through this section in the ElasticSearch documentation (or the guide to be more precise). It says that you should try to use a non-relational database the intended way, meaning you should avoid joins between different tables because they are not designed to handle those well. This also reminds me on the section in the DynamoDB docs stating that most well-designed DynamoDB backends only require one table.
Let's take as an example a recipes database where each recipe is using several ingredients. Every ingredient can be used in many different recipes.
Option 1: The obvious way to me to model this in AppSync and DynamoDB, would be to start with an ingredients table which has one item per ingredient storing all the ingredient data, with the ingredient id as partition key. Then I have another recipes table with the partion key recipe id and an ingredients field storing all the ingredient ids in an array. In AppSync I could then query a recipe by doing a GetItem request by recipe id and then resolving the ingredients field with a BatchGetItem on the ingredients table. Let's say a recipe contains 10 ingredients on average, so this would mean 11 GetItem requests sent to the DynamoDB tables.
Option 2: I would consider this a "join like" operation which is apparently not the ideal way to use non-relational databases. So, alternatively I could do the following: Make "redundant copies" of all the ingredient data on the recipes table and not only save the ingredient id there, but also all the other data from the ingredients table. This could drastically increase disk space usage, but apparently disk space is cheap and the increase in performance by only doing 1 GetItem request (instead of 11) could be worth it. As discussed later in the ElasticSearch guide this would also require some extra work to ensure concurrency when ingredient data is updated. So I would probably have to use a DynamoDB stream to update all the data in the recipes table as well when an ingredient is updated. This would require an expensive Scan to find all the recipes using the updated ingredient and a BatchWrite to update all these items. (An ingredient update might be rare though, so the increase in read performance might be worth that.)
I would be interested in hearing your thoughts on this:
Which option would you choose and why?
The second "more non-relational way" to do this seems painful and I am worried that with more levels/relations appearing (for example if users can create menus out of recipes), the resulting complexity could get out of hand quickly when I have to save "redundant copies" of the same data multiple times. I don't know much about relational databases, but these things seem much simpler there when every data has its unique location and that's it (I guess that's what "normalization" means).
Is getRecipe in the Option 1 really 11 times more expensive (performance and cost wise) than in Option 2? Or do I misunderstand something?
Would Option 1 be a cheaper operation in a relational database (e.g. MySQL) than in DynamoDB? Even though it's a join if I understand correctly, it's also just 11 ("NoSQL intended way") GetItem operations. Could this still be faster than 1 SQL query?
If I have a very relational data structure could a non-relational database like DynamoDB be a bad choice? Or is AppSync/GraphQL a way to still make it a viable choice (by allowing Option 1 which is really easy to build)? I read some opinions that constantly working around the missing join capability when querying NoSQL databases and having to do this on the application side is the main reason why it's not a good fit. But AppSync might be a way to solve this problem. Other opinions (including the DynamoDB docs) mention performance issues as the main reason why you should always query just one table.
This is quite late, I know, but might help someone down the road.
Start with an entity relationship diagram as this will help determine your options. Even in NoSQL, there are standard ways of modeling relationships.
Next, define your access patterns. Go through all the CRUDL operations and make sure that for each operation, you can access the specific data for that operation. For example, in your option 1 where ingredients are stored in an array in a field: think through an access pattern where you might need to delete an ingredient in a recipe. To do this, you need to know the index of the item in the array. Therefore, you have to obtain the entire array, find the index of the item, and then issue another call to update the array, taking into account possible race conditions.
Doing this in your application, while possible, is not efficient. You can also code this up in your resolver, but attempting to do so using velocity template language is not worth the headache, trust me.
The TL;DR is to model your entire application's entity relationship diagram, and think through all the access patterns. If the relationship is one-to-many, you can either denormalize the data, use a composite sort key, or use secondary indexes. If many-to-many, you start getting into adjacency lists and other advanced strategies. Alex DeBrie has some great resources here and here.

Joining Tables in Kibana

Suppose I have a huge database (table a) about employees in a certain department which includes the employee name in addition to many other fields. Now in a different databse (or a different table, say table b) I have only two entries; the employee name and his ID. But this table (b) contains entries not only for one department but rather for the whole company. The raw format for both tables is text-files so I parse them with logstash into Elasticsearch and then I visualize the results with Kibana.
Now after I created several visualizations from table (a) in Kibana where the x-axis shows the employee name, I realize it would be nice if we have the employee IDs instead. Since I know I have this information in table (b), I search for someway to tell Kibana to translate the employee name in the graphs generated from table (a) to employee ID based on table (b). My questions are as follows:
1) Is there a way to do this directly in Kibana? If yes, can we do it if each table is saved in a separate index or do we have to save them both in the same idnex?
2) If this cannot be done directly in Kibana and has to be done when indexing the data, is there a way to still parse both text files separately with logstash?
I know Elasticsearch is a non-relational database and therefore is not designed for SQL-like functionalities (join). However there should be an equivalent or a workaround. This is just a simple use case but of course the generic question is how to correlate data from different sources. Otherwise Elasticsearch would be honestly not that powerful.
Similar questions have been asked and answered.
Basically the answer is that -- no you can't do joins in Kibana, you have to do them at indexing time. Space is cheap and elasticsearch handles duplicate data nicely, so just create any fields you need to display at indexing time.
You might want to give Kibi a try.
The answer, unfortunately that I know of, is either write your own plug-in OR as we have had to do, downgrade to ES 2.4.1 and install Kibi
(https://siren.solutions/new-release-siren-join-2-4-1-compatible-with-es-2-4-1/)
and then install the kibi join plugin
(http://siren.solutions/relational-joins-for-elasticsearch-the-siren-join-plugin/)
This will allow you to get the joins you seek from a relational DB.

How to access data in Dynamics CRM?

What is the best way in terms of speed of the platform and maintainability to access data (read only) on Dynamics CRM 4? I've done all three, but interested in the opinions of the crowd.
Via the API
Via the webservices directly
Via DB calls to the views
...and why?
My thoughts normally center around DB calls to the views but I know there are purists out there.
Given both requirements I'd say you want to call the views. Properly crafted SQL queries will fly.
Going through the API is required if you plan to modify data, but it isnt the fastest approach around because it doesnt allow deep loading of entities. For instance if you want to look at customers and their orders you'll have to load both up individually and then join them manually. Where as a SQL query will already have the data joined.
Nevermind that the TDS stream is a lot more effecient that the SOAP messages being used by the API & webservices.
UPDATE
I should point out in regard to the views and CRM database in general: CRM does not optimize the indexes on the tables or views for custom entities (how could it?). So if you have a truckload entity that you lookup by destination all the time you'll need to add an index for that property. Depending upon your application it could make a huge difference in performance.
I'll add to jake's comment by saying that querying against the tables directly instead of the views (*base & *extensionbase) will be even faster.
In order of speed it'd be:
direct table query
view query
filterd view query
api call
Direct table updates:
I disagree with Jake that all updates must go through the API. The correct statement is that going through the API is the only supported way to do updates. There are in fact several instances where directly modifying the tables is the most reasonable option:
One time imports of large volumes of data while the system is not in operation.
Modification of specific fields across large volumes of data.
I agree that this sort of direct modification should only be a last resort when the performance of the API is unacceptable. However, if you want to modify a boolean field on thousands of records, doing a direct SQL update to the table is a great option.
Relative Speed
I agree with XVargas as far as relative speed.
Unfiltered Views vs Tables: I have not found the performance advantage to be worth the hassle of manually joining the base and extension tables.
Unfiltered views vs Filtered views: I recently was working with a complicated query which took about 15 minutes to run using the filtered views. After switching to the unfiltered views this query ran in about 10 seconds. Looking at the respective query plans, the raw query had 8 operations while the query against the filtered views had over 80 operations.
Unfiltered Views vs API: I have never compared querying through the API against querying views, but I have compared the cost of writing data through the API vs inserting directly through SQL. Importing millions of records through the API can take several days, while the same operation using insert statements might take several minutes. I assume the difference isn't as great during reads but it is probably still large.

Resources