Best practice for handling many-to-many relationships in Elasticsearch? - elasticsearch

I'm pretty sure I know the answer to this question but am looking for confirmation from someone with more Elasticsearch experience than me.
Let's say I've got a database containing Authors and Books. An author can be associated with 0 or more books, and a book can be associated with 1 or more authors. We want users to be able to search on author name to find the author and all his/her books, and we also want them to be able to search on book title to get back its author(s). We know there will be plenty of multi-author books.
Because Elasticsearch only directly supports one level of parent-child relationships, and because children can only have one parent, it seems to me that we need to denormalize the data and use nested objects to establish this relationship. If we modify properties of an author who has published 23 books, we will need to reindex the author record and all 23 of his/her book records.
In my fantasy world, I'd love to have those 23 books each contain an array of author IDs so that I don't have to reindex books when I reindex authors. It seems like this would definitely be possible using Elasticsearch's parent-child support if a book could only have one author, but because of the many-to-many requirement, I have to use nested objects and reindex any related objects whenever anything changes.
Is this correct? It certainly seems like more work (and certainly more updates), but I want to do this the right way, not the "clever" way that introduces complexity and bugs and madness.
Any guidance would be appreciated.

From your question I can safely assume that ES will not be your primary data-store. So the main question as to how to denormalise your many-to-many relationship is to figure out "how & what" will you use ES. That is what queries are you expected to build.
Thinking of "query command" design and denormalize accordingly. Here are a few pointers:
denormalising Authors IDs into the book: would you expect a user to execute a search such as "all book for userId=XYZ". If not, you would rather need the name of the author as a multi-field in your Book document
duplicate, duplicate and duplicate. Figure out which data will be heavily updated (authors, as book general do not gain author after their publication). Denormalize author into books (names most likely). Duplicate (into another document type) something like "author_books" which will would be a child of authors and support update fairly often (again, denormalise the title and other relevant stuff to search from the author perspective).
Hope this makes some sense ;)

Related

How to index two document types in parent-child relationship in Elasticsearch

I am building a search functionality for two types of related documents, let's call them "blogs" and "posts", respectively a blog website (with a bunch of posts) and the specific posts written in that blog. I'd like to be able to search against both of them. In a relational database (which ES is not), I would have two main tables which would be linked against a foreign key, and I could search the two tables separately or with a join. In Elasticsearch, I am considering a parent-child relationship where "blog" is the parent document, and there are potentially many "post" documents associated with it as the child.
EDIT: I should explain why I want to index them this way. Basically, I want people to be able to search for blogs (the overall series of posts written by the same author), and the search terms might not be in the blog's description alone, but rather in the posts; for instance, a blog about Python might have a general description that talks about python, but the blog posts might talk about django, so if someone searches for "django" I'd like the python blog to come up. Also, I want people to be able to search for specific posts. I also think (prove me wrong!) these need to be separate types of documents because they would have different fields, e.g. a post might have a date field, while a blog would not have that field.
In any case: Ideally, I would like to be able to offer a search function against "blog" which would also search against the "post" text (as the relevant text might be in the post); additionally, I'd like to allow users to search all posts regardless of what blog they are associated with.
What are the best practices for setting this up? From what I can tell, Elasticsearch has removed the ability to have two types of documents on the same index, and parent-child relationships need to be on the same index. With this constraint, it seems like parent-child relationships would only be for relationships between documents of the same type, e.g. if you are indexing people and you can indicate who is a parent and child (literally).
The other option would be to create two indexes, one for blogs (which would include the posts' texts) and a second index which would include only the posts. But my instinct is that this would duplicate a tremendous amount of data, and also a lot more work to keep it updated and in sync with my main relational data store.

Are Doctrine relations affecting application performance?

I am working on a Symfony project with a new team, and they decide to stop using Doctrine relations the most they can because of performances issues.
For instance I have to stock the id of my "relation" instead of using a ManyToOne relation.
But I am wondering if it is a real problem?
The thing is, it changes the way of coding to retrieve information and so on.
The performance issue most likely comes from the fact that queries are not optimised.
If you let Doctrine (Symfony component that handle the queries) do the queries itself (by using findBy(), findAll(), findOneBy(), etc), it will first fetch what you asked, then do more query as it will require data from other tables.
Lets take the most common example, a library.
Entities
Book
Author
Shelf
Relations
One Book have one Author, but one Author can have many Books (Book <= ManyToOne => Author)
One Book is stored in one Shelf (Book <= OneToOne => Sheilf)
Now if you query a Book, Doctrine will also fetch Shelf as it's a OneToOne relation.
But it won't fetch Author. In you object, you will only have access to book.author.id as this information is in the Book itself.
Thus, if in your Twig view, you do something like {{ book.author.name }}, as the information wasn't fetched in the initial query, Doctrine will add an extra query to fetch data about the author of the book.
Thus, to prevent this, you have to customize your query so it get the required data in one go, like this:
public function getBookFullData(Book $book) {
$qb=$this->createQueryBuilder('book');
$qb->addSelect('shelf')
->addSelect('author')
->join('book.shelf', 'shelf')
->join('book.author', 'author');
return $qb->getQuery()->getResult();
}
With this custom query, you can get all the data of one book in one go, thus, Doctrine won't have to do an extra query.
So, while the example is rather simple, I'm sure you can understand that in big projects, letting free rein to Doctrine will just increase the number of extra query.
One of my project, before optimisation, reached 1500 queries per page loading...
On the other hand, it's not good to ignore relations in a database.
In fact, a database is faster with foreign keys and indexes than without.
If you want your app to be as fast as possible, you have to use relations to optimise your database query speed, and optimise Doctrine queries to avoid a foul number of extra queries.
Last, I will say that order matter.
Using ORDER BY to fetch parent before child will also greatly reduce the number of query Doctrine might do on it's own.
[SIDE NOTE]
You can also change the fetch method on your entity annotation to "optimise" Doctrine pre-made queries.
fetch="EXTRA_LAZY
fetch="LAZY
fetch="EAGER
But it's not smart, and often don't really provide what we really need.
Thus, custom queries is the best choice.

HippoCMS translated documents with shared fields

I am evaluating HippoCMS and am trying to model a schema of Venues. I want to model a document that has non-translatable features such as telephoneNumber and emailAddress, plus translatable features such as description.
How do I model this in HippoCMS? How do I ensure that the non-translated fields are shared between the different translations, to avoid each translated document having its own copy of a value. Obviously no matter which language you are reading a site in, the telephoneNumber shouldn't change.
The only way I have found for the moment is to create a document called Venue and another document called VenueTranslation. Venue would contain the telephoneNumber and VenueTranslation would contain its description and a link back to the Venue document. There would then be VenueTranslation documents for each language.
Is this the correct approach?
That could work, but you will run into usability issues. I'd say it depends on how many venues you plan to enter into the system, how many languages you are targeting, and, in the end, how keen are your CMS users to pick the right Venue document for every VenueTranslation corresponding to a language. I can see how this will quickly become error prone and cumbersome, but I don't have the numbers.
Regarding the final question, it's not correct nor incorrect: it's just that since the granularity of the translations in Hippo is at the document level and not at the field level, you have to do it this way. Your model makes sense but is not well supported in the CMS. This use case is trivial in a CMS that supports the notion of translatable field.

Parse - How to include array of pointers

I have two tables, RecCategory and Recommendation
How would I construct a HTTP request to retrieve all the RecCategory entries with their respective recommendations?
https://api.parse.com/1/classes/RecCategory/?include=recommendations results in an error
{
"code": 102,
"error": "field recommendations cannot be included because it is not a pointer to another object"
}
Thanks!
I do not believe this is possible in Parse, and I think it would probably be considered bad database design.
If it is the case that each Recommendation only belongs to One Category then this is what is known in Database terms as a many-to-one scenario, and what you want to do is store the recommendation's category in the table row with it. Then when you want to list the recommendations of a specific category you retrieve all recommendations for which the category field points to the category you are after.
In other words, remove the "Recommendations" field from the categories table, and then add a "Category" field (of type pointer to category) to the recommendations table. Because each recommendation has only one category, no array is needed.
If, however, you have a many-to-many relationship, where recommendations can come under many categories, then you want to create an intermediate table which pairs up recommendation pointers and category pointers.
This isn't possible, but there are ways to work around the problem.
You can read more about it here:
https://www.parse.com/questions/can-i-use-include-in-a-query-to-include-all-members-of-a-parserelation-error-102
You might be better off by pulling all the Recommendation objects and organizing them locally by their category.

design of car booking application using elasticsearch

I need some help in designing car booking application.
There is a document with information about car (title, model, brand, info, etc.)
Problems I'm stuck with are:
How to store available booking days? (I suppose I could use nested
free date range objects in array)
How to store price per day (it's possible to have individual price
per day)?
Booking days and prices could change often. So the third question is: "how to update them cleverly (partially), so I shouldn't read the document, and then store it". I'm looking at script solution using
update api (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-update.html), but it looks ugly. Maybe there are other approaches?
Thanks,
Alex
with the introduction of the range datatypes, there is no need to use a real nested object, if you meant that.
That might also help you with storing the prices, but that could just be any object I suppose (it depends if you want to search for that as well).
Update API was made for exactly that use-case, that you do not need to get the whole document, so that shounds like a plan.

Resources