How can I model social connections in ElasticSearch? - elasticsearch

Given the ElasticSearch NoSQL db, I'm trying to figure out how to best model social relationship data (yes, a graph db would be the best tool for the job, but in my current situation this choice might be forced upon me).
I'm new to ElasticSearch, and am reviewing ways to model relationships, but they don't seem to fit a use case for social connections, or at least it's not apparent to me how these would be modeled.
A greatly simplified version of my requirements is as follows:
People have IDs, names, and work place (they might not have a work place)
People can have friendship relationships with other people (and a date of friendship creation)
People can block other people from talking to them (directionality matters, as only the one who blocked can unblock)
People can work at the same work place
Things we're likely to query:
Give me all the people I'm friends with (given my ID)
Give me all the people I work with (given my ID)
Give me the union of the above 2, and the names and ids of their work places, but not those I've blocked or who have blocked me.
Give me all the friends who have a work place in the city where I work.
While the queries seem like they could be a challenge, I'm more interested in simply modeling people, work places, and the relationships between them in ElasticSearch in such a way that it makes sense, is maintainable, and that could support queries like these.
Documentation tells me ElasticSearch doesn't have joins. It has nested objects, and parent-child relationships, but neither of these seems like a fit for friendship relationships between people; both nested objects and parent-child have an implicit concept of single-ownership...unless I start duplicating people data everywhere, both in other people objects (for friends and for blocked) and in work places. That of course introduces the problem of keeping data consistent, as changing person data needs to change their duplicated data everywhere, and removing a friendship relationship must remove the other side of that relationship with the other person. This also brings up the issue of transactions, as I've heard that transactional support across different documents isn't supported.
Aside from denormalization and duplication, or application-side joining outside of the db, are there any better ways (aside from using a different DB) to model this in a sane way that's easier to query?

Sample simplified json with some explanation afterward:
{
"type":"person",
"id":1,
"name":"InverseFalcon",
"workplace":"StackOverflow",
"friend_ids":[3,4,19],
"blocked_ids":[45,24],
"blocked_by_ids":[5]
}
This should be lightning fast as you can retrieve the document, work your sets (union, intersection, etc.), and then perform a multi-get (mget) to retrieve the names and workflow places. Not using a graph database means recursive calls to get friends of friends, etc.

Related

Is it OK to have multiple merge steps in an Excel Power query?

I have data from multiple sources - a combination of Excel (table and non table), csv and, sometimes, even a tsv.
I create queries for each data source and then I am bringing them together one step at a time or, actually, it's two steps: merge and then expand to bring in the fields I want for each data source.
This doesn't feel very efficient and I think that maybe I should be just joining everything together in the Data Model. The problem when I did that was that I couldn't then find a way to write a single query to access all the different fields spread across the different data sources.
If it were Access, I'd have no trouble creating a single query one I'd created all my relationships between my tables.
I feel as though I'm missing something: How can I build a single query out of the data model?
Hoping my question is clear. It feels like something that should be easy to do but I can't home in on it with a Google search.
It is never a good idea to push the heavy lifting downstream in Power Query. If you can, work with database views, not full tables, use a modular approach (several smaller queries that you then connect in the data model), filter early, remove unneeded columns etc.
The more work that has to be performed on data you don't really need, the slower the query will be. Please take a look at this article and this one, the latter one having a comprehensive list for Best Practices (you can also just do a search for that term, there are plenty).
In terms of creating a query from the data model, conceptually that makes little sense, as you could conceivably create circular references galore.

Using AWS Appsync with DynamoDB, should you model relationships by storing "redundant copies" of related data on the same table (denormalization)?

I was recently reading through this section in the ElasticSearch documentation (or the guide to be more precise). It says that you should try to use a non-relational database the intended way, meaning you should avoid joins between different tables because they are not designed to handle those well. This also reminds me on the section in the DynamoDB docs stating that most well-designed DynamoDB backends only require one table.
Let's take as an example a recipes database where each recipe is using several ingredients. Every ingredient can be used in many different recipes.
Option 1: The obvious way to me to model this in AppSync and DynamoDB, would be to start with an ingredients table which has one item per ingredient storing all the ingredient data, with the ingredient id as partition key. Then I have another recipes table with the partion key recipe id and an ingredients field storing all the ingredient ids in an array. In AppSync I could then query a recipe by doing a GetItem request by recipe id and then resolving the ingredients field with a BatchGetItem on the ingredients table. Let's say a recipe contains 10 ingredients on average, so this would mean 11 GetItem requests sent to the DynamoDB tables.
Option 2: I would consider this a "join like" operation which is apparently not the ideal way to use non-relational databases. So, alternatively I could do the following: Make "redundant copies" of all the ingredient data on the recipes table and not only save the ingredient id there, but also all the other data from the ingredients table. This could drastically increase disk space usage, but apparently disk space is cheap and the increase in performance by only doing 1 GetItem request (instead of 11) could be worth it. As discussed later in the ElasticSearch guide this would also require some extra work to ensure concurrency when ingredient data is updated. So I would probably have to use a DynamoDB stream to update all the data in the recipes table as well when an ingredient is updated. This would require an expensive Scan to find all the recipes using the updated ingredient and a BatchWrite to update all these items. (An ingredient update might be rare though, so the increase in read performance might be worth that.)
I would be interested in hearing your thoughts on this:
Which option would you choose and why?
The second "more non-relational way" to do this seems painful and I am worried that with more levels/relations appearing (for example if users can create menus out of recipes), the resulting complexity could get out of hand quickly when I have to save "redundant copies" of the same data multiple times. I don't know much about relational databases, but these things seem much simpler there when every data has its unique location and that's it (I guess that's what "normalization" means).
Is getRecipe in the Option 1 really 11 times more expensive (performance and cost wise) than in Option 2? Or do I misunderstand something?
Would Option 1 be a cheaper operation in a relational database (e.g. MySQL) than in DynamoDB? Even though it's a join if I understand correctly, it's also just 11 ("NoSQL intended way") GetItem operations. Could this still be faster than 1 SQL query?
If I have a very relational data structure could a non-relational database like DynamoDB be a bad choice? Or is AppSync/GraphQL a way to still make it a viable choice (by allowing Option 1 which is really easy to build)? I read some opinions that constantly working around the missing join capability when querying NoSQL databases and having to do this on the application side is the main reason why it's not a good fit. But AppSync might be a way to solve this problem. Other opinions (including the DynamoDB docs) mention performance issues as the main reason why you should always query just one table.
This is quite late, I know, but might help someone down the road.
Start with an entity relationship diagram as this will help determine your options. Even in NoSQL, there are standard ways of modeling relationships.
Next, define your access patterns. Go through all the CRUDL operations and make sure that for each operation, you can access the specific data for that operation. For example, in your option 1 where ingredients are stored in an array in a field: think through an access pattern where you might need to delete an ingredient in a recipe. To do this, you need to know the index of the item in the array. Therefore, you have to obtain the entire array, find the index of the item, and then issue another call to update the array, taking into account possible race conditions.
Doing this in your application, while possible, is not efficient. You can also code this up in your resolver, but attempting to do so using velocity template language is not worth the headache, trust me.
The TL;DR is to model your entire application's entity relationship diagram, and think through all the access patterns. If the relationship is one-to-many, you can either denormalize the data, use a composite sort key, or use secondary indexes. If many-to-many, you start getting into adjacency lists and other advanced strategies. Alex DeBrie has some great resources here and here.

Exploring databased typing scheme at run time

We have built a database in SQL Server using two patterns we found in Len Silverston's Data Model Resource Book Vol. 3, one for Types and Categories: Classification and another for Hierarchies, Aggregations, and Peer-to-Peer relationships. Essentially, it is implmented as follows:
Classification
[Entity] (M x M) [EntityType] (M x M) [EntityTypeType]
...where the (M x M) is a many-to-many relationship implemented in the database by a Classification table.
Association
The entity and type tables above also have their own typed Rollups:
[Entity] (M x M) [Entity]
[EntityType] (M x M) [EntityType]
[EntityTypeType] (M x M) [EntityTypeType]
...where each (M x M) is a many-to-many relationship implemented in a Rollup table with a foreign key to the type of rollup/association.
The resulting data structure gives us tremendous expressive ability in terms of describing both our entities and their relationships to one another. However, we're having trouble taking advantage of this expressiveness in our applications. The primary issue is that in spite of the advances in EF 4& 5, M-2-M relationships are still tricky and beyond that we're trying to access the M-2-M's in at least 2 directions whenever we hit the database. It is especially complicated by:
We subtype both [Entity] and some subtypes of [Entity].
All the of the M2M tables - all the classification and rollup/association tables - have a payload that contains at least a From and Thru date. The rollups also contain at least a rollup type.
We don't want to have to load large, distant portions of the typing schema (EntityTypeType tables and their roll-ups) in order to interpret the data at runtime every time we access entities.
Technologies:
SQL Server 2008 R2
Entity Framework 5
.NET 4.5
MVC 4 (in the web app portion, we also have some Console Apps)
Questions about the model itself:
Is this simply an unworkable data model in .NET?
Should we first flatten our database into more .NET friendly views that essentially model our business objects?
Questions about the typing scheme - bear in mind that the types are pretty static:
Should we scaffold the [EntityType] and [EntityTypeType] tables, their classifications, and their rollups into C# classes? This would work similar to enum scaffolders, only we need more than a name/int since these have payloads date range and type payloads. If so, what are some ideas for how to scaffold those files - as static classes? Hard-coded object lists?
Should we instead cache the typing scheme at start-up (this bothers me, because it adds a lot of overhead to starting up the Console Apps)?
Any other ideas - scaffolded XML Files? etc...
Any ideas or experiences are much appreciated!
I tried to answer each question, but I have to admit I'm not sure if you are trying to dynamically create entities on top of a database at run-time - or if you're just trying to create entities dynamically before run-time.
If you're trying to release code that dynamically changes/adjusts when the schema in SQL Server is changed, then I would have some different answers. =)
Is this simply an unworkable data model in .NET?
Some things you mentioned that stood out to me:
Lots of M x M relationships.
Entity/EntityType/EntityTypeType
Rollups
Some questions I have after reading:
Did you guys pick a framework for modeling data in the hopes it would make everything easier?
Did you pick a framework because it seemed like the "right" way to do it?
I have a hard time following how you've modeled the data. What is an EntityTypeType exactly?
Are all the M x M relationships really needed? Just because Entity A and Entity B can be in a M x M relationship, should they?
I don't know your domain, but I know I have a hard time following what you've described. =) In my opinion it has already become somewhat unworkable for two reasons: 1) You're asking on SO about it, b) It's not easy to describe without a lot of text.
Should we first flatten our database into more .NET friendly views that essentially model our business objects?
Yes!
At least from my experience I would say yes. =)
I think it's ok to create entities in the database that have complex relationships. Parent/child, peer to peer, hierarchical, etc. All fine and good and normal.
But the application shouldn't have to interpret all of that to make sense of it. The database is good at doing joins, grouping data, creating views, etc. I would advise creating views or stored procedures that get as close to your business objects as possible - at least for the reads.
Also consider that if you push the complexity of relating objects to the application, you might pay some performance penalties.
Should we scaffold the [EntityType] and [EntityTypeType] tables, their classifications, and their rollups into C# classes?
I would advise against this. The only reason is that now you are doing database work in the application layer. If you want to do joins/rollups/etc. you're managing that data in the application - not the database.
I'm sure you already know this, but you want to avoid bringing back everything in the database into the application and then running queries/building objects.
SQL Server is awesome at relating objects and pulling together different entities. It'll be very fast at that layer.
I would create business objects and populate those from SQL.
If you needed to scaffold EntityType/EntityTypeType, just be careful you aren't running into N+1 issues.
What are some ideas for how to scaffold those files - as static classes? Hard-coded object lists?
Options:
Use an ORM to generate the classes for you. You've already seen some of that with EF.
Write your own code generation tool by looking at the SQL Server schemas for your entities.
Write classes by hand that aren't generated. I see this done all the time and it's not as bad as you think. Definitely some grunt work, but also gives flexibility.
Should we instead cache the typing scheme at start-up (this bothers me, because it adds a lot of overhead to starting up the Console Apps)?
You want to generate the business objects based on your entities at application load? Or another way of phrasing the question ... generate proxy classes at runtime?
I am not familiar with Len Silverston's book, and I assume neither is the vast majority of StackOverflow users. Since you didn't get answers to your question within 3 days, I'll now comment on it anyhow.
What strikes me as odd in your database model is that you seem to have multiple nested N:M relationships. There rarely ever are true N:M relationships - most of the time the intermediate table in a N:M relational model actually represents an object that exists in real life, leaving you with two regular 1:N relationships and more than just two columns in the linking table. I've only come across one or two true N:M relations in the last 25 years, so I kind of doubt you've actually got three here - it seems more likely that you're using a pattern that does not apply in the way you think it does.
What you describe smells a bit like you're trying to model a database in a database. Especially the "Should we first flatten our database into more .NET friendly views that essentially model our business objects?" sounds fishy. With very few exceptions, your database tables should match your business objects to begin with. If you need to create a flattened view to see your actual objects, you might be entirely heading in the wrong direction. But then again, its hard to tell from the information you gave.
Please give a less abstract description of what you're trying to do, what you've done so far.

JOIN/query across separate entities.

What are the current options for querying and joining two different Entity Data Models?
I've seen that it's possible to share a single model schema between multiple mapping and storage schemas, but it seems clunky and not encouraged.
The other option I can think of is to query the entities separately and then join the linq objects, but I'm not sure how I feel about dumping all of that into memory.
Does anyone have any better solutions?
The two options you list are the only ones I'm aware of. The former is harder than using a single model, but I wouldn't say "not encouraged." It falls into the unfortunately broad category of "supported Entity Framework features with no support in the GUI designer." The latter option is actually not so bad if you can retrieve only what you need, but will result in retrieving entities from two separate ObjectContexts, which could be awkward if you update. That said, updating objects in multiple contexts, potentially from different databases, is strictly no matter how you do it.
The Entity Framework team had mentioned working on better solutions for the future, but this is a weak point today, and I don't think it's going to change much in v4.

How to stop thinking "relationally"

At work, we recently started a project using CouchDB (a document-oriented database). I've been having a hard time un-learning all of my relational db knowledge.
I was wondering how some of you overcame this obstacle? How did you stop thinking relationally and start think documentally (I apologise for making up that word).
Any suggestions? Helpful hints?
Edit: If it makes any difference, we're using Ruby & CouchPotato to connect to the database.
Edit 2: SO was hassling me to accept an answer. I chose the one that helped me learn the most, I think. However, there's no real "correct" answer, I suppose.
I think, after perusing about on a couple of pages on this subject, it all depends upon the types of data you are dealing with.
RDBMSes represent a top-down approach, where you, the database designer, assert the structure of all data that will exist in the database. You define that a Person has a First,Last,Middle Name and a Home Address, etc. You can enforce this using a RDBMS. If you don't have a column for a Person's HomePlanet, tough luck wanna-be-Person that has a different HomePlanet than Earth; you'll have to add a column in at a later date or the data can't be stored in the RDBMS. Most programmers make assumptions like this in their apps anyway, so this isn't a dumb thing to assume and enforce. Defining things can be good. But if you need to log additional attributes in the future, you'll have to add them in. The relation model assumes that your data attributes won't change much.
"Cloud" type databases using something like MapReduce, in your case CouchDB, do not make the above assumption, and instead look at data from the bottom-up. Data is input in documents, which could have any number of varying attributes. It assumes that your data, by its very definition, is diverse in the types of attributes it could have. It says, "I just know that I have this document in database Person that has a HomePlanet attribute of "Eternium" and a FirstName of "Lord Nibbler" but no LastName." This model fits webpages: all webpages are a document, but the actual contents/tags/keys of the document vary soo widely that you can't fit them into the rigid structure that the DBMS pontificates from upon high. This is why Google thinks the MapReduce model roxors soxors, because Google's data set is so diverse it needs to build in for ambiguity from the get-go, and due to the massive data sets be able to utilize parallel processing (which MapReduce makes trivial). The document-database model assumes that your data's attributes may/will change a lot or be very diverse with "gaps" and lots of sparsely populated columns that one might find if the data was stored in a relational database. While you could use an RDBMS to store data like this, it would get ugly really fast.
To answer your question then: you can't think "relationally" at all when looking at a database that uses the MapReduce paradigm. Because, it doesn't actually have an enforced relation. It's a conceptual hump you'll just have to get over.
A good article I ran into that compares and contrasts the two databases pretty well is MapReduce: A Major Step Back, which argues that MapReduce paradigm databases are a technological step backwards, and are inferior to RDBMSes. I have to disagree with the thesis of the author and would submit that the database designer would simply have to select the right one for his/her situation.
It's all about the data. If you have data which makes most sense relationally, a document store may not be useful. A typical document based system is a search server, you have a huge data set and want to find a specific item/document, the document is static, or versioned.
In an archive type situation, the documents might literally be documents, that don't change and have very flexible structures. It doesn't make sense to store their meta data in a relational databases, since they are all very different so very few documents may share those tags. Document based systems don't store null values.
Non-relational/document-like data makes sense when denormalized. It doesn't change much or you don't care as much about consistency.
If your use case fits a relational model well then it's probably not worth squeezing it into a document model.
Here's a good article about non relational databases.
Another way of thinking about it is, a document is a row. Everything about a document is in that row and it is specific to that document. Rows are easy to split on, so scaling is easier.
In CouchDB, like Lotus Notes, you really shouldn't think about a Document as being analogous to a row.
Instead, a Document is a relation (table).
Each document has a number of rows--the field values:
ValueID(PK) Document ID(FK) Field Name Field Value
========================================================
92834756293 MyDocument First Name Richard
92834756294 MyDocument States Lived In TX
92834756295 MyDocument States Lived In KY
Each View is a cross-tab query that selects across a massive UNION ALL's of every Document.
So, it's still relational, but not in the most intuitive sense, and not in the sense that matters most: good data management practices.
Document-oriented databases do not reject the concept of relations, they just sometimes let applications dereference the links (CouchDB) or even have direct support for relations between documents (MongoDB). What's more important is that DODBs are schema-less. In table-based storages this property can be achieved with significant overhead (see answer by richardtallent), but here it's done more efficiently. What we really should learn when switching from a RDBMS to a DODB is to forget about tables and to start thinking about data. That's what sheepsimulator calls the "bottom-up" approach. It's an ever-evolving schema, not a predefined Procrustean bed. Of course this does not mean that schemata should be completely abandoned in any form. Your application must interpret the data, somehow constrain its form -- this can be done by organizing documents into collections, by making models with validation methods -- but this is now the application's job.
may be you should read this
http://books.couchdb.org/relax/getting-started
i myself just heard it and it is interesting but have no idea how to implemented that in the real world application ;)
One thing you can try is getting a copy of firefox and firebug, and playing with the map and reduce functions in javascript. they're actually quite cool and fun, and appear to be the basis of how to get things done in CouchDB
here's Joel's little article on the subject : http://www.joelonsoftware.com/items/2006/08/01.html

Resources