I am trying to fetch about 2.5 million records from elastic search using elastic search's Java High Level Client. Which is taking too much time (15 to 22 minutes based on number of records) to fetch all the record using scroll API as it has a limitation of fetch 10,000 record in one request. I tried sliced scroll also but that is taking more time than normal scroll. Following is my assumption about sliced scroll API:
I divided my scroll request into five slices. Which creates 5 requests.
I send 5 request in different threads.
Because every sliced scroll request is an individual request. I guess for each sliced scroll request first it fetches all the records (2.5 million) then filters out the records which belongs to that particular slice.
Which is resulting in more time.
Can anyone tell me more efficient way to fetch all the records.
Related
I'm new at Grafana and I'm trying to create a graph that shows the requests count together with the average response time for the requests, I was able to create my requests count but now I'm struggling to add the information with the requests time, there is an option to show both information inside a panel? Or do I need to create two panels, one with the request count and another with the average time?
And another question, there is an option to show the average time in milliseconds?
I have a microservice with elasticsearch as a backend store. Now I have multiple indices with hundreds of thousands of documents inserted in those indices.
Now I need to expose GET APIs for those indices. GET /employees/get.
I have gone through ES pagination using scroll and search_after. But both of them require meta information like scroll_id and search_after(key) to do pagination.
Now the concern is my microservice shouldn't expose these scroll_ids or search_after. With current approach, I can list up to 10k docs but not after that. And I don't want users of microservice to know about the backend store or anything about it. So How can I achieve this in elasticservice?
I have below approach in mind:
Store the scroll_id in-memory and retrieve the results based on that for subsequent queries. Get query will be as below:
GET /employees/get?page=1 By default each page will have 10k documents.
Implement scroll API internally over GET API and return all matching documents to users. But this increases the latency and memory. Because at times I may end up returning 100k docs to user in a single call.
Expose GET API with search string. By default return 10k documents and further the results will be refreshed with searchstring as explained:
Lets say GET /employees/get return 10k documents. And accept query_string to enrich the 10k like auto suggestion using n gram. Then we show most valid 10k docs everytime. I know this is not actual pagination but somehow this too solves the problem in a hacky way. This is my Plan-B.
Edited:
This is my usecase:
Return list of employees of a company. There are more than 100k employees. So I have to return the results in pages. GET /employees/get?from=0&size=1000 and GET /employees/get?from=1001&size=1000
But once I reach from+size to 10k, ES rejects the query.
Please suggest what would be the ideal way to implement pagination in microservice with ES as backend store and not letting user to know about internals of ES.
I am working on a project using ElasticSearch and querying it to fetch the member information. It has 3 million records.
I am running a campaign for 2 million users and the user data is present on elasticsearch6.2. I query the ES and fetches the records in batches (50 records at a time) using the scroll. Also, I want to keep the SEARCH context for 1 day because if the campaign running process fails due to any reason, I can resume the campaign from where it was stopped. In this way, I will escape from starting the campaign again from starting. I am also saving the scrollID and will use it to resume campaign.
While testing I found CPU Utilization increased by 50% (ES config: 2 nodes with 4 shards running on aws, Instance Type:i3.xlarge.elasticsearch) and its CPU Utilization remains consistent to 50%.
Is there any relation between CPU Utilization and keeping the search context for 1day. BTW campaigns take 6 hours to finish.
From the documentation
Normally, the background merge process optimizes the index by merging
together smaller segments to create new bigger segments, at which time
the smaller segments are deleted. This process continues during
scrolling, but an open search context prevents the old segments from
being deleted while they are still in use. This is how Elasticsearch
is able to return the results of the initial search request,
regardless of subsequent changes to documents.
So with your scroll cursor expiration to 24h it seems you forbid Lucene to merge your segments, increasing to load of your shards.
Later in the documentation there is an explanation on how to clear your scroll cursor :
Search context are automatically removed when the scroll timeout has
been exceeded. However keeping scrolls open has a cost, as discussed
in the previous section so scrolls should be explicitly cleared as
soon as the scroll is not being used anymore using the clear-scroll
API:
You should try to clear your cursor after a campaign is completed.
When we query ES for records it returns 10 records by default. How can I get all the records in the same query without using any scroll API.
There is an option to specify the size, but size is not known in advance.
You can retrieve up to 10k results in one request (setting "size": 10000). If you have less than 10k matching documents then you can paginate over them using a combination of from/size parameters. If it is more, then you would have to use other methods:
Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000. See the Scroll or Search After API for more efficient ways to do deep scrolling.
To be able to do pagination over unknown number of documents you will have to get the total count from the first returned query.
Note that if there are concurrent changes in the data, results of paginated retrieval may not be consistent (for example, if one document gets inserted or deleted while you are paginating). Scroll is consistent since it is created from a "snapshot" of the index created at query start time.
I have an Elasticsearch/kibana stack that stores every request the application receives. It stores gereneral information about the request (RequestTimestamp, IP, Headers, HttpStatus, Route etc), and there's at least some requests per minute.
I would like to know if there's some way to query Kibana/Elastic to know the points in time that the application didn't receive any request for, let's say, 3 minutes.
I know it can be done programmatically, but it needs to be purely done with querys (so I can show it on the Dashboard).
You could do date histogram aggregation.
You could specify 3m interval and query for a specified day.
So you would get 24*60/3 = 480 values for each day.
You could plot it on the chart and see the gaps.
If you are an expert ES user you could try filtering the aggregations using bucket selector pipeline aggregation or create a moving average using moving average aggregation.