ElasticSearch, matching via large array values - performance

Version: v7.6.0
My app is similar to Tinder in that it shows users people they have swiped on.
The user can swipe an unlimited amount of users. Potentially people could have 100k+ UID's they swiped on.
I use a query like so (the 20k is 20k more 'match' objects)
{
"query": {
"bool": {
"must_not": [{"match":{"uid":"876123c4-7b63-4a90-843b-a0c61f175524"}},{"match":{"uid":"a5db9040-0704-49d8-95b5-7441263a6c5c"}},+20,534
}
}
}
This works, the problem is it's a little slow (14 seconds+).
How else could I do a query like this?

Related

How to correctly denormalize one-to-many indexes coming from multiple sources

How can I restructure below elastic indexes to be able to search for registrations that had certain mailing events?
In our application we have the Profile entity which can have one to multiple Registration entities.
The registrations index is used in the majority of searches and contains the data we want to return.
Then we have multiple *Events indexes that contain events that relate to profiles.
A simplified version would look like this:
Registrations
- RegistrationId
- ProfileId
- Location
MailEvents
- ProfileId
- Template
- Actions
A simplified search might be: all the registrations in a certain location with any mailevent action for templates starting with "Solar".
Joining like in a classical RDB is an anti-pattern in elastic Db.
We are considering de-normalizing by adding all the various events for profiles to the registrations index? This wil result in an explosion of data in the registrations index.
Nested objects are also bad for searching, so we should somehow make them into arrays. But how?
We have 100's of rows in the events for every related row in registration. The change rates on the event indexes is way higher then the ones on the registration index.
We are considering doing two requests. One for all the *Events indexes, gathering all the profileIds, unique-ing them, then doing one for the registration part with the result of the first one.
It feels wrong and introduces complicated edge cases where there are more results then the max returned rows in the first request or max Terms values in the second.
By searching around I see many people struggling with this and looking for a way to do join queries.
It feels like de-normalizing is the way to go, but what would be the recommended approach?
What other approaches am I missing?
One approach to consider is using Elasticsearch's parent-child relationship, which allows for denormalization in a way that makes it efficient for search. With parent-child, you would make the Registrations index the parent and the MailEvents index the child. This would allow you to store all the MailEvents data directly under each Registration document and would allow for efficient search and retrieval.
Additionally, you could consider using the has_child query to find all Registrations documents that have a certain MailEvent criteria. For example, to find all Registrations with a MailEvent action for templates starting with "Solar", you could write a query like this:
GET /registrations/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"Location": "some_location"
}
},
{
"has_child": {
"type": "mailevents",
"query": {
"bool": {
"must": [
{
"prefix": {
"Template": "Solar"
}
},
{
"exists": {
"field": "Actions"
}
}
]
}
}
}
}
]
}
}
}
This approach would give you the best of both worlds - you'd have denormalized data that's efficient for search and retrieval, while also avoiding the complexities of multiple requests and potential edge cases.
Another approach is to use Elasticsearch's aggregation feature. In this approach, you would perform a single search query on the Registrations index, filtered by the desired location. Then, you would use the ProfileId field to aggregate the data and retrieve the related MailEvents information. You can achieve this by using a nested aggregation, where you group by ProfileId and retrieve the relevant MailEvents data for each profile.
Here's an example query that performs this aggregation:
GET /registrations/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"Location": "some_location"
}
}
]
}
},
"aggs": {
"profiles": {
"terms": {
"field": "ProfileId"
},
"aggs": {
"mail_events": {
"nested": {
"path": "MailEvents"
},
"aggs": {
"filtered_mail_events": {
"filter": {
"bool": {
"must": [
{
"prefix": {
"MailEvents.Template": "Solar"
}
},
{
"exists": {
"field": "MailEvents.Actions"
}
}
]
}
},
"aggs": {
"actions": {
"terms": {
"field": "MailEvents.Actions"
}
}
}
}
}
}
}
}
}
}
This query will return the Registrations documents that match the desired location, and also provide aggregated information about the related MailEvents data. You can further manipulate the aggregated data to get the information that you need.
Note that this approach can be more complex than the parent-child relationship approach and may have performance implications if your data is large and complex. However, it may be a good solution if you need to perform complex aggregations on the MailEvents data.
As far as I know, the Elasticsearch aggregation function might be another way to do this. You can run search across multiple indices and aggregate the list of profileId from MailEvents and use them to filter Registrations.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
https://discuss.elastic.co/t/aggregation-across-multiple-indices/271350

Elastic Search Query the number of new vs returning users

Given a time range, I want to know how many users are new users and how many users are returning users.
My Elastic search index mapping contain field user_id, event_time and etc.
e.g. Given a record (user_id: jack, event_time: 2019-10-31 00:00:00:000). If the user_id does not exist in the past 2 weeks (from 2019-10-17 00:00:00:000 to 2019-10-31 00:00:00:000), then we consider the record with user_id 'jack' is a new user. Otherwise it's considered as a returning user.
I was wondering that if Elastic search support such kind of query which can tell me the number of new users and returning users?
Thanks in advance!
I think you could use a range aggregation for that.
{
"aggs": {
"activity": {
"range": {
"field": "event_time",
"ranges": [
{ "to": "now-2w/d" },
{ "from": "now-2w/d" }
]
}
}
}
}
This should go through your whole index and create two buckets based on event time. One for anything that's older than two weeks, another one for everything that's newer than two weeks.

Elasticsearch: Get sort index of the record

Let me describe my scenario with the real example.
I have a page where I need to show the list of the companies sorted by a field "overallRank" and with few filters (like companyType and employeeSize).
Now, it's easy to get the results from the ES index for the filter and then sort them by overallRank. But, I also want to know the rank of the company among all the company data and not only in the filtered result.
For example. Amazon is the 3rd company in the location US and companyType=Private. But, it is the 5th company in the US if we remove the companyType filter. While showing the result with the filter companyType I want to know this overall ranking (i.e 5th). Is it possible to include this field in the result somehow?
What I am currently doing is first getting the filtered result by companyType and location US. Then getting the sorted result by only location. This second query gives the result by overall ranking in the location (where Amazon is coming at 5th place). Now I iterate the first result and see where that company is in the second result to determine it's overall ranking.
The problem with this approach is that second method to determine the overall ranking in the whole company data is very expensive because it has to retrieve around 60k result. By giving the batch size 1000 it has to do a round trip around 60 times to ES to get all the results in the memory. It's time and space consuming both.
Can somebody please suggest a better way of doing this?
I think you can solve it using filtered aggregations: with top hits aggregation
As an example you can do something like:
{
"aggs": {
"filtered_companies_by_us": {
"filter": {
"term": {
"location": "US"
}
},
"aggs": {
"top_companies": {
"top_hits": {
"sort": [
{
"overallRank": {
"order": "desc"
}
}
],
"size": 5
}
}
}
}
}
}

How do I architect my Elasticsearch indexes for rapidly changing data?

I have two models in my MySQL database: Users and Posts
users have geolocation attributes (lat/long)
posts simply have a body of text.
I want to use Elasticsearch to find all posts that match a string of text plus use the user's location as a filter. The problem is -- the user's location always changes (as people walk around the city). I will be frequently updating the lat/long of each user.
This is my current solution:
Index the posts and have a geolocation attribute in each
document. When a user changes location, run an elasticsearch batch
update on all that user's posts, and modify the geolocation attribute
on those documents.
Obviously this is not a scalable solution -- what if the user has 2000 posts and walks around the city? I'd have to update 2000 documents every minute.
Is there a way to "relationally map" the posts to the user's object and use it as a filter, so that when the location changes, I only need to update that user's object instead of all his posts?
Updating 2000 posts per minute is not a big deal either with the update by query plugin or with the upcoming reindex API. However, if you have many users with many posts and you need to update them in short intervals (e.g. 1 min), it might not be that scalable, indeed. Say if it takes 500 milliseconds to update all posts from a user, you'd start to lag behind at around 120 users.
Clearly, since the users' posts need to "follow" the user and don't keep the location the user had when she posted them, I would first query the users around a given location and get their IDs, and then run a second query on posts filtered by those user IDs and the matching body text.
It is perfectly OK to keep both of your indices simple and only update the location in a single user's document every minute. Those two queries I'm suggesting should be quite fast and you should not be worried of running them. People are often worried when they need to run two or more queries in order to find their results. Sometimes, trying to tie the documents to tight together is not the solution and simply running two queries over two indices is the key and works perfectly well.
The query to retrieve users would look similar to the first one below, where you only retrieve the _id property of the user. I'm making the assumption that your user documents have the id of the user as their ES doc _id, so you do not have to retrieve the _source at all (i.e. "_source": false) which is even faster and you can simply return the _id with response filtering:
POST /users/_search?filter_path=hits.hits._id
{
"size": 1000,
"_source": false,
"query": {
"bool": {
"filter": [
{
"geo_distance": {
"distance": "100m",
"location": {
"lat": 32.5362723,
"lon": -80.3654783
}
}
}
]
}
}
}
You'll get all the _id values of the users who are currently 100 meters around the desired geographic location. Then the next query consists of filtering the posts by those ids while matching their body text.
POST /posts/_search
{
"size": 50,
"query": {
"bool": {
"must": {
"match": {
"body": "some text"
}
},
"filter": [
{
"terms": {
"user_id": [ 1, 2, 3, 4 ]
}
}
]
}
}
}

Is it possible to make elasticsearch highlights linkable?

I'm successfully using ES for indexing documents and higlighting searched text. But now I have a new requirement - make all yellow highlights linkable, i.e user have to be able to dive into the page with selected occurence.
I haven't implemented page preview of document yet but I'm sure that there exists some software which gets page number or bytes offset and returns docx or pdf page as image. So, I want elastic to return index of occurence (most likely, byte offset from the beginning). After that I probably may use indexToImage soft for showing occurence page to user. Even if such software does not exist I may open RandomAccessFile and read occurence page and somehow show it to user. But anyway I need occurence index. is it possible to get it from elastic?
My search request looks like:
http://localhost:9200/mongofilesindex/_search?pretty&source={
"_source": ["filename",
"metadata"],
"query": {
"filtered": {
"query": {
"query_string": {
"query": "*test*"
}
}
}
},
"highlight": {
"pre_tags": ["<mark>"],
"post_tags": ["</mark>"],
"fields": {
"content": {
"fragment_size": 200,
"number_of_fragments": 10
}
}
}
}&size=10&from=0
Of course, I may use ES just for extracting matching documents and after that manually apply KMP in input stream which works in linear time. But I want something better than linear because I know that suffix automatas and other complex data structures may return occurences in O(search_string_len+occurences_count) which is much more better than O(doc_len).
I'm sure that elastic uses such cool data structures and probably I'm missing some API for getting occurences indices.

Resources