How to correctly denormalize one-to-many indexes coming from multiple sources - elasticsearch

How can I restructure below elastic indexes to be able to search for registrations that had certain mailing events?
In our application we have the Profile entity which can have one to multiple Registration entities.
The registrations index is used in the majority of searches and contains the data we want to return.
Then we have multiple *Events indexes that contain events that relate to profiles.
A simplified version would look like this:
Registrations
- RegistrationId
- ProfileId
- Location
MailEvents
- ProfileId
- Template
- Actions
A simplified search might be: all the registrations in a certain location with any mailevent action for templates starting with "Solar".
Joining like in a classical RDB is an anti-pattern in elastic Db.
We are considering de-normalizing by adding all the various events for profiles to the registrations index? This wil result in an explosion of data in the registrations index.
Nested objects are also bad for searching, so we should somehow make them into arrays. But how?
We have 100's of rows in the events for every related row in registration. The change rates on the event indexes is way higher then the ones on the registration index.
We are considering doing two requests. One for all the *Events indexes, gathering all the profileIds, unique-ing them, then doing one for the registration part with the result of the first one.
It feels wrong and introduces complicated edge cases where there are more results then the max returned rows in the first request or max Terms values in the second.
By searching around I see many people struggling with this and looking for a way to do join queries.
It feels like de-normalizing is the way to go, but what would be the recommended approach?
What other approaches am I missing?

One approach to consider is using Elasticsearch's parent-child relationship, which allows for denormalization in a way that makes it efficient for search. With parent-child, you would make the Registrations index the parent and the MailEvents index the child. This would allow you to store all the MailEvents data directly under each Registration document and would allow for efficient search and retrieval.
Additionally, you could consider using the has_child query to find all Registrations documents that have a certain MailEvent criteria. For example, to find all Registrations with a MailEvent action for templates starting with "Solar", you could write a query like this:
GET /registrations/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"Location": "some_location"
}
},
{
"has_child": {
"type": "mailevents",
"query": {
"bool": {
"must": [
{
"prefix": {
"Template": "Solar"
}
},
{
"exists": {
"field": "Actions"
}
}
]
}
}
}
}
]
}
}
}
This approach would give you the best of both worlds - you'd have denormalized data that's efficient for search and retrieval, while also avoiding the complexities of multiple requests and potential edge cases.
Another approach is to use Elasticsearch's aggregation feature. In this approach, you would perform a single search query on the Registrations index, filtered by the desired location. Then, you would use the ProfileId field to aggregate the data and retrieve the related MailEvents information. You can achieve this by using a nested aggregation, where you group by ProfileId and retrieve the relevant MailEvents data for each profile.
Here's an example query that performs this aggregation:
GET /registrations/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"Location": "some_location"
}
}
]
}
},
"aggs": {
"profiles": {
"terms": {
"field": "ProfileId"
},
"aggs": {
"mail_events": {
"nested": {
"path": "MailEvents"
},
"aggs": {
"filtered_mail_events": {
"filter": {
"bool": {
"must": [
{
"prefix": {
"MailEvents.Template": "Solar"
}
},
{
"exists": {
"field": "MailEvents.Actions"
}
}
]
}
},
"aggs": {
"actions": {
"terms": {
"field": "MailEvents.Actions"
}
}
}
}
}
}
}
}
}
}
This query will return the Registrations documents that match the desired location, and also provide aggregated information about the related MailEvents data. You can further manipulate the aggregated data to get the information that you need.
Note that this approach can be more complex than the parent-child relationship approach and may have performance implications if your data is large and complex. However, it may be a good solution if you need to perform complex aggregations on the MailEvents data.

As far as I know, the Elasticsearch aggregation function might be another way to do this. You can run search across multiple indices and aggregate the list of profileId from MailEvents and use them to filter Registrations.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
https://discuss.elastic.co/t/aggregation-across-multiple-indices/271350

Related

Single query for terms aggregation with or without nested objects

I want to get an aggregation of all the unique values for a field.
I found that this query works well for a "simple" field with no nested type in the json path.
GET /_search
{
"aggs": {
"myResult": {
"terms": { "field": "school.student.name" }
}
},
"size": 0
}
However, when one of the objects in the path is of nested type, I have to make a different request. Let say student is nested just for the example, then I change the query like this:
GET /_search
{
"aggs": {
"myResult": {
"nested": {
"path": "school.student"
}
"aggs": {
"myResult2": {
"terms": { "field": "school.student.name" }
}
}
}
},
"size": 0
}
However, in my app, I only have the json path such as "school.student.name" but I do not know if there is a nested object, and who is the nested.
I did not find any solution to make a single request that could work to handle both case, and I end up with a configuration file where I have to explicitly list all the nested path that I have in my elastic database, to build the correct request. However, if I have hundreds of fields, this is not maintainable.
Does elastic really do not implement such a simple and basic feature?
I just want to get all unique values corresponding to a json path in my elastic database.
Would you have a better solution than the config file?
No need for configuration file, you should directly retrieve the mapping as it is currently stored in ES, parse it and you'd always have the right fields.

ES: How do quasi-join queries using global aggregation compare to parent-child / nested queries?

At my work, I came across the following pattern for doing quasi-joins in Elasticsearch. I wonder whether this is a good idea, performance-wise.
The pattern:
Connects docs in one index in one-to-many relationship.
Somewhat like ES parent-child, but implemented without it.
Child docs need to be indexed with a field called e.g. "my_parent_id", with value being the parent ID.
Can be used when querying for parent, knowing its ID in advance, to also get the children in the same query.
The query with quasi-join (assume 123 is parent ID):
GET /my-index/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"id": {
"value": 123
}
}
}
]
}
},
"aggs": {
"my-global-agg" : {
"global" : {},
"aggs" : {
"my-filtering-all-but-children": {
"filter": {
"term": {
"my_parent_id": 123
}
},
"aggs": {
"my-returning-children": {
"top_hits": {
"_source": {
"includes": [
"my_child_field1_to_return",
"my_child_field2_to_return"
]
},
"size": 1000
}
}
}
}
}
}
}
}
This query returns:
the parent (as search query result), and
its children (as the aggregation result).
Performance-wise, is the above:
definitively a good idea,
definitively a bad idea,
hard to tell / it depends?
It depends ;-) The idea is good, however, by default the maximum number of hits you can return in a top_hits aggregation is 100, if you try 1000 you'll get an error like this:
Top hits result window is too large, the top hits aggregator [hits]'s from + size must be less than or equal to: [100] but was [1000]. This limit can be set by changing the [index.max_inner_result_window] index level setting.
As the error states, you can increase this limit by changing the index.max_inner_result_window index setting. But, if there's a default, there's usually a good reason. I would take that as a hint that it might not be that great an idea to increase it too much.
So, if your parent documents have less than 100 children, why not, otherwise I'd seriously consider going another approach.

How to specify the execution order of filter and query in an Elasticsearch query

Consider the following query in Elasticsearch:
GET nyc_visionzero/_search
{
"query": {
"bool": {
"must": [{
"fuzzy": {
"on_street_name": "AVENUE"
}
}
],
"filter": {
"term": {
"borough": "MANHATTAN"
}
}
}
}
}
Is the filter part executed first and then fuzzy or its the other way around? What if I want to change the order of their execution! How can I do that?
This question relates to the query vs. filter context topic. Everything in the query context (here query.bool.must) counts to the score of a document whereas the conditions in the filter context (query.filter) are a yes/no decision.
So from a performance perspective, filters are faster and can be cached. On the other side queries allow for some fuzziness.
There is a much more detailed explanation on this in the elasticsearch docs on query and filter context.

Elastic search paginating on multiple types in an index

I have an index with multiple types like below :
songs
books
movies
I am building an API for suggesting indexed items grouped by their type, The problem is that I want a size functionality inside each aggregation, Just like the completion suggester approach which returns an exact number of items for each type. I ended up with multi index query approach to query each type separately, Is there any better approach to handle this ?
Each aggregation you specify can have a filter associated with it, so you could
reduce the context of an aggregation to a specific type that way. Additionally,
you can use the filters aggregation to create buckets for each filter, and run
an aggregation with a certain size on each sub-bucket, like this:
GET /_search
{
"aggs": {
"alltypes": {
"filters": {
"filters": {
"songs": {"term": {"_type": "songs"}},
"books": {"term": {"_type": "books"}},
"movies": {"term": {"_type": "movies"}}
}
},
"aggs": {
... your aggregation for each individual type here ...
}
}
}
}
More info about the filters aggregation can be found at
http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-filters-aggregation.html
Hopefully that helps, let me know if I misunderstood your question (it was a
little uncertain whether you were talking about suggestors or aggregations since
both were mentioned in the question).

Elasticsearch and aggregation of subqueries

I know that elasticsearch allows sub-aggregations (ie. nested aggregation), however I would like to apply aggregation on the result of "first" aggregation (or in generic any query - aggregation or not).
Concrete example: I log events about user actions (for simplicity I have documents with user_id and action). I can make a query that counts number of actions executed by each user. However I would like to find out percentage (or count) of "active users" (e.g. users that have executed more than 10 actions). Ideal result would be a histogram over all users showing how active the users are.
Is there a way how to create such query? Or is there any other approach I can take other than store aggregated results of subquery and compute the histogram out of that?
Note: I have seen Elastic Search and "sub queries" question, but it was about something else and it is over one and half year old and elasticsearch is being actively developed.
Additionally it seems that in version 1.4 there will be available scripted metric aggregation, but anyway that would require to store counter for every user until reduce phase. And some "approximate solution" is good for me - similar to what ES uses internally for its aggregations.
Here is the query I have used, notice the "min_doc_count" in the aggregation.
{
"query": {
"filtered": {
"filter": {
"and": [
{ "term" : { "name": "did x" } },
{ "range": { "created_at": { "gte": "now-7d", "lte": "now" } } }
]
}
}
},
"aggregations": {
"my_agg": {
"terms": {
"field": "user_id",
"min_doc_count": 10,
"size": 0
}
}
}
}
This query returns the list of buckets (users) with more than 9 events in the specified time period. Just 'count' results to get the number of active users.
I have tested this approach with thousands of events and it works well. At a certain scale you will have to use Hadoop.

Resources