ElasticSearch | Efficient Pagination with more than 10k documents - elasticsearch

I have a microservice with elasticsearch as a backend store. Now I have multiple indices with hundreds of thousands of documents inserted in those indices.
Now I need to expose GET APIs for those indices. GET /employees/get.
I have gone through ES pagination using scroll and search_after. But both of them require meta information like scroll_id and search_after(key) to do pagination.
Now the concern is my microservice shouldn't expose these scroll_ids or search_after. With current approach, I can list up to 10k docs but not after that. And I don't want users of microservice to know about the backend store or anything about it. So How can I achieve this in elasticservice?
I have below approach in mind:
Store the scroll_id in-memory and retrieve the results based on that for subsequent queries. Get query will be as below:
GET /employees/get?page=1 By default each page will have 10k documents.
Implement scroll API internally over GET API and return all matching documents to users. But this increases the latency and memory. Because at times I may end up returning 100k docs to user in a single call.
Expose GET API with search string. By default return 10k documents and further the results will be refreshed with searchstring as explained:
Lets say GET /employees/get return 10k documents. And accept query_string to enrich the 10k like auto suggestion using n gram. Then we show most valid 10k docs everytime. I know this is not actual pagination but somehow this too solves the problem in a hacky way. This is my Plan-B.
Edited:
This is my usecase:
Return list of employees of a company. There are more than 100k employees. So I have to return the results in pages. GET /employees/get?from=0&size=1000 and GET /employees/get?from=1001&size=1000
But once I reach from+size to 10k, ES rejects the query.
Please suggest what would be the ideal way to implement pagination in microservice with ES as backend store and not letting user to know about internals of ES.

Related

Filter result in memory to search in elasticsearch from multiple indexes

I have 2 indexes and they both have one common field (basically relationship).
Now as elastic search is not giving filters from multiple indexes, should we store them in memory in variable and filter them in node.js (which basically means that my application itself is working as a database server now).
We previously were using MongoDB which is also a NoSQL DB but we were able to manage it through aggregate queries but seems the elastic search is not providing that.
So even if we use both databases combined, we have to store results of them somewhere to further filter data from them as we are giving users advanced search functionality where they are able to filter data from multiple collections.
So should we store results in memory to filter data further? We are currently giving advanced search in 100 million records to customers but that was not having the advanced text search that elastic search provides, now we are planning to provide elastic search text search to customers.
What do you suggest should we use the approach here to make MongoDB and elastic search together? We are using node.js to serve data.
Or which option to choose from
Denormalizing: Flatten your data
Application-side joins: Run multiple queries on normalized data
Nested objects: Store arrays of objects
Parent-child relationships: Store multiple documents through joins
https://blog.mimacom.com/parent-child-elasticsearch/
https://spoon-elastic.com/all-elastic-search-post/simple-elastic-usage/denormalize-index-elasticsearch/
Storing things client side in memory is not the solution.
First of all the simplest way to solve this problem is to simply make one combined index. Its very trivial to do this. Just insert all the documents from index 2 into index 1. Prefix all fields coming from index-2 by some prefix like "idx2". That way you won't overwrite any similar fields. You can use an ingestion pipeline to do this, or just do it client side. You only will ever do this once.
After that you can perform aggregations on the single index, since you have all the data in one-index.
If you are using somehting other than ES as your primary data-store you need to reconfigure the indexing operation to redirect everything that was earlier going into index-2 to go into index-1 as well(with the prefixed terms).
100 million records is trivial for something like ELasticsearch. Doing anykind of "joins" client side is NOT RECOMMENDED, as this will obviate the entire value of using ES.
If you need any further help on executing this, feel free to contact me. I have 11 years exp in ES. And I have seen people struggle with "joins" for 99% of the time. :)
The first thing to do when coming from MySQL/PostGres or even Mongodb is to restructure the indices to suit the needs of data-querying. Never try to work with multiple indices, ES is not built for that.
HTH.

What can be an alternative of Elastic Search + DynamoDB being used in combination?

I am new to DynamoDB and I am looking for suggestions / recommendations. There's a use case where we have a paginated API and we have to search for multiple values of an indexed attribute. Since DynamoDB allows only one value to be searched for an indexed attribute in a single query, a batch call should be done. However, since it requires pagination (batch call would make the pagination complicated), therefore currently, the required IDs are fetched from ElasticSearch for those multiple values (in a paginated way) after which the complete documents are fetched from DynamoDB based on IDs obtained from ElasticSearch. Is this the correct approach or is there any better alternative?

Check if document is part of Elasticsearch query?

Curious if there is some way to check if document ID is part of a large (million+ results) Elasticsearch query/filter.
Essentially I’ll have a group of related document ID’s and only want to return them if they are part of a larger query. Hoping to do database side. Theoretically seemed possible since ES has to cache stuff related to large scrolls.
It's a interesting use-case but you need to understand that Elasticsearch(ES) doesn't return all the matching documents ids in the search result and return by default only the 10 documents in the response, which can be changed by the size parameter.
And if you increase the size param and have millions of matching docs in your query then ES query performance would be very bad and it might bring even entire cluster down if you frequently fire such queries(in absence of circuit breaker) so be cautious about it.
You are right that, ES cache the stuff, but again that if you try to cache huge amount of data and that is getting invalidate very frequent then you will not get the required performance benefits, so better do the benchmark against it.
You are already on the correct path to use, scroll API to iterate on millions on search result, just see below points to improve further.
First get the count of search result, this is included in default search response with eq or greater value which will give you idea that how many search results you have based on which you can give size param for subsequent calls to see if your id is present or not.
See if you effectively utilize the filters context in your query, which is by default cached at ES.
Benchmark your some heavy scroll API calls with your data.
Refer this thread to fine tune your cluster and index configuration to optimize ES response further.

How to show only results that user has access to

Database structure for my Python application is very similar to Instagram one. I have users, posts, and users can follow each other. There are public and private accounts.
I am indexing this data in ElasticSearch and searching works fine so far. However, there is a problem that search returns all posts, without filtering by criteria if user has access to it (e.g. post is created by another user who has private account, and current user isn't following that user).
My data in ElasticSearch is indexed simply across several indexes in a flat format, one index for users, one for posts.
I can post-process results that ElasticSearch returns, and remove posts that current access doesn't have access to, but this introduces additional query to the database to retrieve that user followers list, and possibly blocklist (I don't want to show posts to users that block each other too).
I can also add list of follower IDs for each user to ElasticSearch upon indexing and then match against them, but in case where user has thousands of followers, these lists will be huge, and I am not sure how convenient it will be to keep them in ElasticSearch.
How can I efficiently do this? My stack is backend Python + Flask, PostgreSQL database and ElasticSearch as search index.
Maybe you already found a solution...
Using elastic "terms lookup" can solve this problem if you have an index with the list of followers you can filter on, as you said here:
I can also add list of follower IDs for each user to ElasticSearch
upon indexing and then match against them, but in case where user has
thousands of followers, these lists will be huge, and I am not sure
how convenient it will be to keep them in ElasticSearch.
More details in the doc:
https://www.elastic.co/guide/en/elasticsearch/reference/7.5/query-dsl-terms-query.html#query-dsl-terms-lookup
Note that there's a limitation of 65 536 terms (but it can be overwritten) so if your service don't have millions of users default limit will be fine.

Elasticsearch slow performance for huge data retrieval with source field

I'm using ElasticSearch to search from more than 10 million records, most records contains 1 to 25 words. I want to retrieve data from it, the method I'm using now is drastically slow for big data retrieval as I'm trying to get data from the source field. I want a method that can make this process faster. I'm free to use other database or anything with ElasticSearch. Can anyone suggest some good Ideas and Example for this?
I've tried searching for solution on google and one solution I found was pagination and I've already applied it wherever it's possible but pagination is not an option when I want to retrieve many(5000+) hits in one query.
Thanks in advance.
Try using scroll
While a search request returns a single “page” of results, the scroll
API can be used to retrieve large numbers of results (or even all
results) from a single search request, in much the same way as you
would use a cursor on a traditional database.

Resources