Elastic search paginating on multiple types in an index - elasticsearch

I have an index with multiple types like below :
songs
books
movies
I am building an API for suggesting indexed items grouped by their type, The problem is that I want a size functionality inside each aggregation, Just like the completion suggester approach which returns an exact number of items for each type. I ended up with multi index query approach to query each type separately, Is there any better approach to handle this ?

Each aggregation you specify can have a filter associated with it, so you could
reduce the context of an aggregation to a specific type that way. Additionally,
you can use the filters aggregation to create buckets for each filter, and run
an aggregation with a certain size on each sub-bucket, like this:
GET /_search
{
"aggs": {
"alltypes": {
"filters": {
"filters": {
"songs": {"term": {"_type": "songs"}},
"books": {"term": {"_type": "books"}},
"movies": {"term": {"_type": "movies"}}
}
},
"aggs": {
... your aggregation for each individual type here ...
}
}
}
}
More info about the filters aggregation can be found at
http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-filters-aggregation.html
Hopefully that helps, let me know if I misunderstood your question (it was a
little uncertain whether you were talking about suggestors or aggregations since
both were mentioned in the question).

Related

How to correctly denormalize one-to-many indexes coming from multiple sources

How can I restructure below elastic indexes to be able to search for registrations that had certain mailing events?
In our application we have the Profile entity which can have one to multiple Registration entities.
The registrations index is used in the majority of searches and contains the data we want to return.
Then we have multiple *Events indexes that contain events that relate to profiles.
A simplified version would look like this:
Registrations
- RegistrationId
- ProfileId
- Location
MailEvents
- ProfileId
- Template
- Actions
A simplified search might be: all the registrations in a certain location with any mailevent action for templates starting with "Solar".
Joining like in a classical RDB is an anti-pattern in elastic Db.
We are considering de-normalizing by adding all the various events for profiles to the registrations index? This wil result in an explosion of data in the registrations index.
Nested objects are also bad for searching, so we should somehow make them into arrays. But how?
We have 100's of rows in the events for every related row in registration. The change rates on the event indexes is way higher then the ones on the registration index.
We are considering doing two requests. One for all the *Events indexes, gathering all the profileIds, unique-ing them, then doing one for the registration part with the result of the first one.
It feels wrong and introduces complicated edge cases where there are more results then the max returned rows in the first request or max Terms values in the second.
By searching around I see many people struggling with this and looking for a way to do join queries.
It feels like de-normalizing is the way to go, but what would be the recommended approach?
What other approaches am I missing?
One approach to consider is using Elasticsearch's parent-child relationship, which allows for denormalization in a way that makes it efficient for search. With parent-child, you would make the Registrations index the parent and the MailEvents index the child. This would allow you to store all the MailEvents data directly under each Registration document and would allow for efficient search and retrieval.
Additionally, you could consider using the has_child query to find all Registrations documents that have a certain MailEvent criteria. For example, to find all Registrations with a MailEvent action for templates starting with "Solar", you could write a query like this:
GET /registrations/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"Location": "some_location"
}
},
{
"has_child": {
"type": "mailevents",
"query": {
"bool": {
"must": [
{
"prefix": {
"Template": "Solar"
}
},
{
"exists": {
"field": "Actions"
}
}
]
}
}
}
}
]
}
}
}
This approach would give you the best of both worlds - you'd have denormalized data that's efficient for search and retrieval, while also avoiding the complexities of multiple requests and potential edge cases.
Another approach is to use Elasticsearch's aggregation feature. In this approach, you would perform a single search query on the Registrations index, filtered by the desired location. Then, you would use the ProfileId field to aggregate the data and retrieve the related MailEvents information. You can achieve this by using a nested aggregation, where you group by ProfileId and retrieve the relevant MailEvents data for each profile.
Here's an example query that performs this aggregation:
GET /registrations/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"Location": "some_location"
}
}
]
}
},
"aggs": {
"profiles": {
"terms": {
"field": "ProfileId"
},
"aggs": {
"mail_events": {
"nested": {
"path": "MailEvents"
},
"aggs": {
"filtered_mail_events": {
"filter": {
"bool": {
"must": [
{
"prefix": {
"MailEvents.Template": "Solar"
}
},
{
"exists": {
"field": "MailEvents.Actions"
}
}
]
}
},
"aggs": {
"actions": {
"terms": {
"field": "MailEvents.Actions"
}
}
}
}
}
}
}
}
}
}
This query will return the Registrations documents that match the desired location, and also provide aggregated information about the related MailEvents data. You can further manipulate the aggregated data to get the information that you need.
Note that this approach can be more complex than the parent-child relationship approach and may have performance implications if your data is large and complex. However, it may be a good solution if you need to perform complex aggregations on the MailEvents data.
As far as I know, the Elasticsearch aggregation function might be another way to do this. You can run search across multiple indices and aggregate the list of profileId from MailEvents and use them to filter Registrations.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
https://discuss.elastic.co/t/aggregation-across-multiple-indices/271350

elasticsearch: count appearance of terms aggregation on other fields

I want to count how many times, unique values (result of terms aggragation) have appeared in other fields in the same query. Let's say:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"unique_products": {
"terms": {
"field": "products.name.keyword",
"min_doc_count": 10
}
}
}
}
What I want is to count, how many time each of the keys returned in the bucket, appeared in another field.
My ideal output is:
"aggregations": {
"product_stat": {
"key": "<product_name>"
"sold": "<#>" #I want to know how many times the key is appeared in another field like sold
"bought": "<#>"
}
}
Elasticsearch cannot do terms aggregations over multiple fields. In short, if they would, aggregations would not be blazing fast.
As documentation suggests, there are two options:
use script terms aggregation (with performance penalty),
change how the documents are indexed so a normal terms aggregation can be used.
Depending on the structure of your data and your use-cases, you might get by with a complex aggregation + some processing on the client side. This can be done with sub aggregations like here, for example.
Hope that helps!

How to find all duplicate documents in ElasticSearch

We have a need to walk over all of the documents in our AWS ElasticSearch cluster, version 6.0, and gather a count of all the duplicate user ids.
I have tried using a Data Visualization to aggregate counts on the user ids and export them, but the numbers don't match another source of our data that is searchable via traditional SQL.
What we would like to see is like this:
USER ID COUNT
userid1 4
userid22 3
...
I am not an advanced Lucene query person and have yet to find an answer to this question. If anyone can provide some insight into how to do this, I would be appreciative.
The following query will count each id, and filter the ids which have <2 counts, so you'll get something in the likes of:
id:2, count:2
id:4, count:15
GET /index
{
"query":{
"match_all":{}
},
"aggs":{
"user_id":{
"terms":{
"field":"user_id",
"size":100000,
"min_doc_count":2
}
}
}
}
More here:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
If you want to get all duplicate userids with count
First you get to know maximum size of aggs.
find all maximum matches record via aggs cardinality.
GET index/type/_search
{
"size": 0,
"aggs": {
"maximum_match_counts": {
"cardinality": {
"field": "userid",
"precision_threshold": 100
}
}
}
}
get value of maximum_match_counts aggregations
Now you can get all duplicate userids
GET index/type/_search
{
"size": 0,
"aggs": {
"userIds": {
"terms": {
"field": "userid",
"size": maximum_match_counts,
"min_doc_count": 2
}
}
}
}
When you go with terms aggregation (Bharat suggestion) and set aggregation size more than 10K you will get a warning about this approach will throw an error for the feature releases.
Instead of using terms aggregation you should go with composite aggregation to scan all of your documents by pagination/afterkey method.
the composite aggregation can be used to paginate all buckets from a multi-level aggregation efficiently. This aggregation provides a way to stream all buckets of a specific aggregation similarly to what scroll does for documents.

Elasticsearch and aggregation of subqueries

I know that elasticsearch allows sub-aggregations (ie. nested aggregation), however I would like to apply aggregation on the result of "first" aggregation (or in generic any query - aggregation or not).
Concrete example: I log events about user actions (for simplicity I have documents with user_id and action). I can make a query that counts number of actions executed by each user. However I would like to find out percentage (or count) of "active users" (e.g. users that have executed more than 10 actions). Ideal result would be a histogram over all users showing how active the users are.
Is there a way how to create such query? Or is there any other approach I can take other than store aggregated results of subquery and compute the histogram out of that?
Note: I have seen Elastic Search and "sub queries" question, but it was about something else and it is over one and half year old and elasticsearch is being actively developed.
Additionally it seems that in version 1.4 there will be available scripted metric aggregation, but anyway that would require to store counter for every user until reduce phase. And some "approximate solution" is good for me - similar to what ES uses internally for its aggregations.
Here is the query I have used, notice the "min_doc_count" in the aggregation.
{
"query": {
"filtered": {
"filter": {
"and": [
{ "term" : { "name": "did x" } },
{ "range": { "created_at": { "gte": "now-7d", "lte": "now" } } }
]
}
}
},
"aggregations": {
"my_agg": {
"terms": {
"field": "user_id",
"min_doc_count": 10,
"size": 0
}
}
}
}
This query returns the list of buckets (users) with more than 9 events in the specified time period. Just 'count' results to get the number of active users.
I have tested this approach with thousands of events and it works well. At a certain scale you will have to use Hadoop.

ElasticSearch query referencing document

I read some time ago that there was a way to build a query that references another document in your index. At the time, this wasn't helpful to me, but I now have very large GIS areas that I need to query against and sending this data to ElasticSearch in the query body every time seems wasteful.
While my specific use-case relates to GIS, geo_shape, etc, it's a general issue that can be applied to other types of queries.
I have a document type areas that holds all of the predefined search areas (these are things like suburbs, states, etc) and entities that hold all of my search data, including a geo_point type field with lat/lon.
I need to be able to construct a geo_shape query for entities documents that references the mpoly attribute (which is a GeoShape type) on an areas document for it's shape coordinates.
Unfortunately, neither Google nor reading the ElasticSearch docs have proved useful in this case, because generally nested documents (related, but not what I'm looking for) is what people seem to be more interested in.
Finally found the answer myself while looking for something different. Unfortunately, the information about the GeoShape filter is not in the GeoShape query manual pages:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-geo-shape-filter.html#_pre_indexed_shape
{
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"geo_shape": {
"location": {
"indexed_shape": {
"id": "DEU",
"type": "countries",
"index": "shapes",
"path": "location"
}
}
}
}
}
}
If anyone has better information about how to do this generically, I will happily accept their answer instead.

Resources