ElasticSearch performance when querying by element type - elasticsearch

Assume that we have a dataset containing a collection of domains { domain.com, domain2.com } and also a collection of users { user#domain.com, angryuser#domain2.com, elastic#domain3.com }.
Being so lets assume that both domains and users have several attributes in common, such as "domain", and when the attribute name matches, also do the mapping and possible values.
Then we load up our elasticsearch index with all collections separating them by type, domain and user.
Obviously in our system we would have many more users compared to domains so when querying for domain related data, the expectation is that it would be much faster to filter the query by the type of the attribute right?
My question is, having around 5 million users and 200k domains, why is that when my index only contains domain data, users were deleted, queries run much faster than filtering the objects based on their type? Shouldn't it be at least around similar performance ? On my current status we can match 20 domains per second if there are no users on the index, but it drops to 4 when we load up the users, even though we still filter by type.
Maybe it is something that im missing as im new to elasticsearch.
UPDATE:
This is the query basically
"query" : {
"flt_field": {
"domain_address": {
"like_text": "chroma",
"fuzziness": 0.3
}
}
}
And the mapping is something like this
"user": {
"properties": {
...,
"domain_address": {
"type": "string",
"boost": 2.4,
"similarity": "linear"
}
}
},
"domain": {
"properties": {
...,
"domain_address": {
"type": "string",
"boost": 2.4,
"similarity": "linear"
}
}
}
Other fields in .... but their mapping should not influence the outcome ???

Related

How to correctly denormalize one-to-many indexes coming from multiple sources

How can I restructure below elastic indexes to be able to search for registrations that had certain mailing events?
In our application we have the Profile entity which can have one to multiple Registration entities.
The registrations index is used in the majority of searches and contains the data we want to return.
Then we have multiple *Events indexes that contain events that relate to profiles.
A simplified version would look like this:
Registrations
- RegistrationId
- ProfileId
- Location
MailEvents
- ProfileId
- Template
- Actions
A simplified search might be: all the registrations in a certain location with any mailevent action for templates starting with "Solar".
Joining like in a classical RDB is an anti-pattern in elastic Db.
We are considering de-normalizing by adding all the various events for profiles to the registrations index? This wil result in an explosion of data in the registrations index.
Nested objects are also bad for searching, so we should somehow make them into arrays. But how?
We have 100's of rows in the events for every related row in registration. The change rates on the event indexes is way higher then the ones on the registration index.
We are considering doing two requests. One for all the *Events indexes, gathering all the profileIds, unique-ing them, then doing one for the registration part with the result of the first one.
It feels wrong and introduces complicated edge cases where there are more results then the max returned rows in the first request or max Terms values in the second.
By searching around I see many people struggling with this and looking for a way to do join queries.
It feels like de-normalizing is the way to go, but what would be the recommended approach?
What other approaches am I missing?
One approach to consider is using Elasticsearch's parent-child relationship, which allows for denormalization in a way that makes it efficient for search. With parent-child, you would make the Registrations index the parent and the MailEvents index the child. This would allow you to store all the MailEvents data directly under each Registration document and would allow for efficient search and retrieval.
Additionally, you could consider using the has_child query to find all Registrations documents that have a certain MailEvent criteria. For example, to find all Registrations with a MailEvent action for templates starting with "Solar", you could write a query like this:
GET /registrations/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"Location": "some_location"
}
},
{
"has_child": {
"type": "mailevents",
"query": {
"bool": {
"must": [
{
"prefix": {
"Template": "Solar"
}
},
{
"exists": {
"field": "Actions"
}
}
]
}
}
}
}
]
}
}
}
This approach would give you the best of both worlds - you'd have denormalized data that's efficient for search and retrieval, while also avoiding the complexities of multiple requests and potential edge cases.
Another approach is to use Elasticsearch's aggregation feature. In this approach, you would perform a single search query on the Registrations index, filtered by the desired location. Then, you would use the ProfileId field to aggregate the data and retrieve the related MailEvents information. You can achieve this by using a nested aggregation, where you group by ProfileId and retrieve the relevant MailEvents data for each profile.
Here's an example query that performs this aggregation:
GET /registrations/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"Location": "some_location"
}
}
]
}
},
"aggs": {
"profiles": {
"terms": {
"field": "ProfileId"
},
"aggs": {
"mail_events": {
"nested": {
"path": "MailEvents"
},
"aggs": {
"filtered_mail_events": {
"filter": {
"bool": {
"must": [
{
"prefix": {
"MailEvents.Template": "Solar"
}
},
{
"exists": {
"field": "MailEvents.Actions"
}
}
]
}
},
"aggs": {
"actions": {
"terms": {
"field": "MailEvents.Actions"
}
}
}
}
}
}
}
}
}
}
This query will return the Registrations documents that match the desired location, and also provide aggregated information about the related MailEvents data. You can further manipulate the aggregated data to get the information that you need.
Note that this approach can be more complex than the parent-child relationship approach and may have performance implications if your data is large and complex. However, it may be a good solution if you need to perform complex aggregations on the MailEvents data.
As far as I know, the Elasticsearch aggregation function might be another way to do this. You can run search across multiple indices and aggregate the list of profileId from MailEvents and use them to filter Registrations.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
https://discuss.elastic.co/t/aggregation-across-multiple-indices/271350

Solution for runtime price calc for price sorting problem

We have nuxtjs framework for our frontend and build a API with Elasticsearch for searching hotels/accommodations.
We have a two call api when user search. First call is availabilty and second call is price. The price we fetch is based on nightprice. Then on client calculation on runtime for total price.
2 night searched = 2 * night_price = total_price
This works ok, but we cannot add sorting on total_price value because its runtime value.
Ideas to solve this issue?
Our idea is store ALL possible combination a user can be searching for and this store in ES. But those are 100+ mil documents.
remco
Did you try to use Runtime Fields . The benefits are :
saving storage costs and increasing ingestion speed
immediately use it in search requests, aggregations, filtering, and sorting.
doesn’t increase the index size
So, you can define a field during search :
GET my-index-000001/_search
{
"runtime_mappings": {
"total_price": {
"type": "keyword",
"script": {
"source": "emit(doc['night_price'].value * params['multiplier'])"
},
"params": {
"multiplier": 2
}
}
},
"sort": [
{
"total_price": {
"order": "desc"
}
}
]
}
While sending the query, you need to set the value of the multiplier parameter.

Elastic Search indexing for many different mapping types

I have implemented something like Class and Instance logic in my application where I create an object named category which is a blue print for it's instances.
User has freedom to create as many Categories as they like with whatever fields hence I used to use one new TYPE for each category in my elastic search index mapping until it was deprecated in latest upgrades.
With latest upgrades of ES , I can think of only these 2 approaches -
creating one index for each category
keeping one object type field named after the TYPE that holds fields for
each category and keep updating this one mapping every time.
I am trying to decide on which approach to take up for ES upgrade to version 7 from 5 to keep this dynamic nature of my data modelling. Searches would be governed by TYPE string that is system generated ID for each category hence need to have grouping of fields based on the category they belong to.
OLD MAPPINGS - NOW DEPRECATED
first one - one for each TYPE(category)
{
"type_cat1" : {
"dynamic" : "strict"
"mapping" :{
"field11" : {...}
}
}
}
second one and so on
{
"type_cat2" : {
"dynamic" : "strict"
"mapping" :{
"field21" : {...}
}
}
}
}
NEW MAPPING WITH OBJECTS FOR EACH OLD TYPE
{
"mapping" :{
"properties" :{
"type_cat1" : {
"properties" :{
"field11" : {...}
}
},
"type_cat2" : {
"properties" :{
"field11" : {...}
}
}
}
}
}
ALTERNATIVE NEW MAPPING - ONE INDEX PER CATEGORY (not more than 500)
One index would be created separately for each category...
Please advice if a better approach is out there or which one to choose among these...
I have a similar use-case at my workplace where the user can create an object with any number of fields, each field can be of any datatype.
Our approach is similar to one of yours:
All categories will be mapped to a single index.
Whenever a new object is created, the index mappings are updated to accommodate the new object (a category in your case).
This is what our mappings look like when molded to your needs:
{
"mappings": {
"category": { // this is a field present in all documents
"type": "keyword"
},
"createdTime": { // this is a field present in all documents
"type": "date"
},
"id": { // this is a field present in all documents
"type": "long"
},
"fields": {
"properties": {
"type_cat1": {
"properties": {
"field1": {...},
"field2": {...}
}
},
"type_cat2": {
"properties": {
"field1": {...},
"field2": {...}
}
},
{...}
}
}
}
Get all records of a certain category:
"category": "cat1"
Get all records of cat1 where field2 == "dummy_value"
"category": "cat1 AND "fields.cat1.field2.keyword": "dummy_value"
When a new category is created, the fields part of our mappings get updated.
Extracting out the common fields (category, createdTime, id) eliminates redundancy in mappings.
Some worthy points:
As the number of unique categories is only 500, you can also go with a separate index per category. This is more beneficial if there are going to be many records (> 1,00,000) per category.
If the categories are sparse in nature (each category has less number records), then ES can easily handle everything in a single index.
If we assume 50 fields per category on average, then the total fields in the single index approach will be 50*500 = 25000. This is a manageable number.
Of course, in the end, many things will depend upon resources allocated to the cluster.

How to store info regarding which notifications have been read by user?

I have a set of notification or information items stored in elasticsearch. Once a user has seen a notification I need to mark it as seen by that user. A user can filter documents by read/unread status. Notifications will be viewed by lot of users and seen status will constantly get updated. What is the best way to store this data. Shall I store the list of users which have seen that notification in same document itself or shall I create parent child relationship.
For sure you should avoid parent-child or nested type because they are computationial costful. the best way to achieve the relationship with a lot of data is to denormalize your data and put them in different indices. Please read here and here .Example:
PUT notification
{"mappings": {
"properties": {
"content": {
"type": "text"},
"id_notification":{
"type":"keyword" }{
}}
}
}
Then user index:
PUT user
{"mappings": {
"properties": {
"general_information": {
"type": "text"},
"id_user":{
"type":"keyword" }{
}}
}
}
another index for the relationship:
PUT seen
{"mappings": {
"properties": {
"seen": {
"notification_id":{
"type": "keyword",
"fields":{
"user_id":{
"type":"keyword"}}},
"unseen":{
"notification_id":{
"type": "keyword",
"fields":{
"user_id":{
"type":"keyword"}}}}
}
}
Sorry for the text format, i haven't kibana now. You should pay attention that to pass from information indices - user, notification - to the support index - seen - you should make a multi-index query - doc here. it will works because the name and the values of the fields - user_id , notification_id - are the same in different indices. The subfields user_id in seen index are array of keywords. However you could make user_id a single keyword and parent of notification_id keyword array field. In every case they keep the one to many realtionship, the best choice depends from your data

ES: How do quasi-join queries using global aggregation compare to parent-child / nested queries?

At my work, I came across the following pattern for doing quasi-joins in Elasticsearch. I wonder whether this is a good idea, performance-wise.
The pattern:
Connects docs in one index in one-to-many relationship.
Somewhat like ES parent-child, but implemented without it.
Child docs need to be indexed with a field called e.g. "my_parent_id", with value being the parent ID.
Can be used when querying for parent, knowing its ID in advance, to also get the children in the same query.
The query with quasi-join (assume 123 is parent ID):
GET /my-index/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"id": {
"value": 123
}
}
}
]
}
},
"aggs": {
"my-global-agg" : {
"global" : {},
"aggs" : {
"my-filtering-all-but-children": {
"filter": {
"term": {
"my_parent_id": 123
}
},
"aggs": {
"my-returning-children": {
"top_hits": {
"_source": {
"includes": [
"my_child_field1_to_return",
"my_child_field2_to_return"
]
},
"size": 1000
}
}
}
}
}
}
}
}
This query returns:
the parent (as search query result), and
its children (as the aggregation result).
Performance-wise, is the above:
definitively a good idea,
definitively a bad idea,
hard to tell / it depends?
It depends ;-) The idea is good, however, by default the maximum number of hits you can return in a top_hits aggregation is 100, if you try 1000 you'll get an error like this:
Top hits result window is too large, the top hits aggregator [hits]'s from + size must be less than or equal to: [100] but was [1000]. This limit can be set by changing the [index.max_inner_result_window] index level setting.
As the error states, you can increase this limit by changing the index.max_inner_result_window index setting. But, if there's a default, there's usually a good reason. I would take that as a hint that it might not be that great an idea to increase it too much.
So, if your parent documents have less than 100 children, why not, otherwise I'd seriously consider going another approach.

Resources