ElasticSearch Scroll API not going past 10000 limit

ElasticSearch Scroll API not going past 10000 limit - elasticsearch

I am using the Scroll API to get more than 10,000 documents from our Elastic Search, however, whenever I the code tries to query past 10k, I get the below error:
Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]
This is my code:
try {
// 1. Build Search Request
final Scroll scroll = new Scroll(TimeValue.timeValueMinutes(1L));
SearchRequest searchRequest = new SearchRequest(eventId);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.query(queryBuilder);
searchSourceBuilder.size(limit);
searchSourceBuilder.profile(true); // used to profile the execution of queries and aggregations for a specific search
searchSourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS)); // optional parameter that controls how long the search is allowed to take
if(CollectionUtils.isNotEmpty(sortBy)){
for (int i = 0; i < sortBy.size(); i++) {
String sortByField = sortBy.get(i);
String orderByField = orderBy.get(i < orderBy.size() ? i : orderBy.size() - 1);
SortOrder sortOrder = (orderByField != null && orderByField.trim().equalsIgnoreCase("asc")) ? SortOrder.ASC : SortOrder.DESC;
if(keywordFields.contains(sortByField)) {
sortByField = sortByField + ".keyword";
} else if(rawFields.contains(sortByField)) {
sortByField = sortByField + ".raw";
}
searchSourceBuilder.sort(new FieldSortBuilder(sortByField).order(sortOrder));
}
}
searchSourceBuilder.sort(new FieldSortBuilder("_id").order(SortOrder.ASC));
if (includes != null) {
String[] excludes = {""};
searchSourceBuilder.fetchSource(includes, excludes);
}
if (CollectionUtils.isNotEmpty(aggregations)) {
aggregations.forEach(searchSourceBuilder::aggregation);
}
searchRequest.scroll(scroll);
searchRequest.source(searchSourceBuilder);
SearchResponse resp = null;
try {
resp = client.search(searchRequest, RequestOptions.DEFAULT);
String scrollId = resp.getScrollId();
SearchHit[] searchHits = resp.getHits().getHits();
// Pagination - will continue to call ES until there are no more pages
while(searchHits != null && searchHits.length > 0){
SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
scrollRequest.scroll(scroll);
resp = client.scroll(scrollRequest, RequestOptions.DEFAULT);
scrollId = resp.getScrollId();
searchHits = resp.getHits().getHits();
}
// Clear scroll request to release the search context
ClearScrollRequest clearScrollRequest = new ClearScrollRequest();
clearScrollRequest.addScrollId(scrollId);
client.clearScroll(clearScrollRequest, RequestOptions.DEFAULT);
} catch (Exception e) {
String msg = "Could not get search result. Exception=" + ExceptionUtilsEx.getExceptionInformation(e);
throw new Exception(msg);
I am implementing the solution from this link: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-search-scroll.html
Can anyone tell me what I am doing wrong and what I need to do to get past 10,000 with the scroll api?

If your iterations take more than 5 minutes, then you need to adapt the scroll time. Change this line to make sure the scroll context doesn't disappear after 1 minute
final Scroll scroll = new Scroll(TimeValue.timeValueMinutes(10L));
And remove this one:
searchSourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS)); // optional parameter that controls how long the search is allowed to take

Related

Scroll API misses some documents

I am trying to get all the documents from multiple indexes with Scroll Api but it doesn't return all of them. I found a similar question but op was obviously missing first set of documents. Link to the question: Elasticsearch Search Scroll API doesn't retrieve all the documents from an index
Here is my code:
//Code to get indexes
for (String indexName : indexNames) {
final Scroll scroll = new Scroll(TimeValue.timeValueSeconds(45L));
SearchRequest searchRequest = new SearchRequest(indexName);
searchRequest.scroll(scroll);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryBuilder query = QueryBuilders.boolQuery()
.filter(QueryBuilders.termQuery(sourceId, 2))
.filter(QueryBuilders.rangeQuery(date).gte(01-05-2021).lte(31-05-2021));
searchSourceBuilder.query(query);
searchSourceBuilder.size(10000);
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
String scrollId = searchResponse.getScrollId();
SearchHit[] searchHits = searchResponse.getHits().getHits();
List<Model> model = new ArrayList<>();
while(searchHits != null && searchHits.length > 0) {
for (SearchHit document : searchHits){
//add document to model list created above
} //end of for loop
// insert model list to database
SearchScrollRequest searchScrollRequest = new SearchScrollRequest(scrollId);
searchScrollRequest.scroll(scroll);
searchResponse = client.scroll(searchScrollRequest, RequestOptions.DEFAULT);
scrollId = searchResponse.getScrollId();
searchHits = searchResponse.getHits().getHits();
} //end of while loop
ClearScrollRequest clear = new ClearScrollRequest();
clear.addScrollId(scrollId);
} //end of for loop at the top
Total number of documents I should get is 115 millions but I am missing more than 2 millions documents. I repeatedly checked my code but no idea what I am missing.

Elasticsearch: Pattern in high CPU usage after we started using elasticsearch high level client

We started using elasticsearch high level client recently and we use scroll API to fetch large set of data from ES. We see a pattern in high CPU utilization as follows:
It's pattern repeating every 30 minutes. No clue what's going on. We see exception in elasticsearch too -
[2021-05-12T04:19:29,516][DEBUG][o.e.a.s.TransportSearchScrollAction]
[node-2] [93486247] Failed to execute query phase
org.elasticsearch.transport.RemoteTransportException:
[node-3][10.160.86.222:7550][indices:data/read/search[phase/query/scroll]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No
search context found for id [93486247]
at org.elasticsearch.search.SearchService.getExecutor(SearchService.java:496)
~[elasticsearch-6.8.9.jar:6.8.9]
at org.elasticsearch.search.SearchService.runAsync(SearchService.java:373)
~[elasticsearch-6.8.9.jar:6.8.9]
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:435)
~[elasticsearch-6.8.9.jar:6.8.9]
at org.elasticsearch.action.search.SearchTransportService$8.messageReceived(SearchTransportService.java:376)
~[elasticsearch-6.8.9.jar:6.8.9]
at org.elasticsearch.action.search.SearchTransportService$8.messageReceived(SearchTransportService.java:373)
~[elasticsearch-6.8.9.jar:6.8.9]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:250)
~[?:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
~[elasticsearch-6.8.9.jar:6.8.9]
The high level client code being used is the usual code given in the official documentation-
final Scroll scroll = new Scroll(TimeValue.timeValueMinutes(1L));
SearchRequest searchRequest = new SearchRequest(INDEX_NAME);
searchRequest.scroll(scroll);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
if (StringUtils.isNotBlank(keyword)) {
LOG.info("Searching for keyword: {}", keyword);
boolQueryBuilder.must(QueryBuilders.multiMatchQuery(keyword, INDEXED_FIELDS));
}
if(StringUtils.isNotBlank(param1)) {
boolQueryBuilder.filter(QueryBuilders.termQuery("param1", param1));
}
if(Objects.nonNull(param1)) {
boolQueryBuilder.filter(QueryBuilders.termsQuery("param1", param1));
}
if(Objects.nonNull(param1)) {
boolQueryBuilder.filter(QueryBuilders.termsQuery("param1", param1));
}
if(Objects.nonNull(param1)) {
boolQueryBuilder.filter(QueryBuilders.termsQuery("param1", param1));
}
if(Objects.nonNull(param1)) {
boolQueryBuilder.filter(QueryBuilders.termsQuery("param1", param1));
}
searchSourceBuilder.query(boolQueryBuilder);
searchRequest.source(searchSourceBuilder);
List<Object1> statuses = new ArrayList<>();
try {
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
String scrollId = searchResponse.getScrollId();
SearchHit[] searchHits = searchResponse.getHits().getHits();
while (searchHits != null && searchHits.length > 0) {
for (SearchHit hit : searchHits) {
Object1 agent = JsonUtil.parseJson(hit.getSourceAsString(),
Object1.class);
statuses.add(agent);
}
SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
scrollRequest.scroll(scroll);
searchResponse = client.scroll(scrollRequest, RequestOptions.DEFAULT);
scrollId = searchResponse.getScrollId();
searchHits = searchResponse.getHits().getHits();
}
ClearScrollRequest clearScrollRequest = new ClearScrollRequest();
clearScrollRequest.addScrollId(scrollId);
ClearScrollResponse clearScrollResponse = client.clearScroll(clearScrollRequest, RequestOptions.DEFAULT);
boolean succeeded = clearScrollResponse.isSucceeded();

Return ldap entries on paginated form in springboot

I have a ldap method that returns all users that are in it (almost 1300 users) and I want to return them by page, similar to what PagingAndSortingRepository does in Springboot:
If I have this endpoint ( users/?page=0&size=1 )and I wnat to return on page 0 just 1 entry.
Is there any way to do that?
Currently I have this but it doesn´t work:
SearchRequest searchRequest = new SearchRequest(ldapConfig.getBaseDn(), SearchScope.SUB,
Filter.createEqualityFilter("objectClass", "person"));
ASN1OctetString resumeCookie = null;
while (true) {
searchRequest.setControls(new SimplePagedResultsControl(pageable.getPageSize(), resumeCookie));
SearchResult searchResult = ldapConnection.search(searchRequest);
numSearches++;
totalEntriesReturned += searchResult.getEntryCount();
for (SearchResultEntry e : searchResult.getSearchEntries()) {
String[] completeDN = UaaUtils.searchCnInDn(e.getDN());
String[] username = completeDN[0].split("=");
UserEntity u = new UserEntity(username[1]);
list.add(u);
System.out.println("TESTE");
}
SimplePagedResultsControl responseControl = SimplePagedResultsControl.get(searchResult);
if (responseControl.moreResultsToReturn()) {
// The resume cookie can be included in the simple paged results
// control included in the next search to get the next page of results.
System.out.println("Antes "+resumeCookie);
resumeCookie = responseControl.getCookie();
System.out.println("Depois "+resumeCookie);
} else {
break;
}
Page<UserEntity> newPage = new PageImpl<>(list, pageable, totalEntriesReturned);
System.out.println("content " + newPage.getContent());
System.out.println("total elements " + newPage.getTotalElements());
System.out.println(totalEntriesReturned);
}

I'm unsure if this is the proper way, but here's how I went about it:
public PaginatedLookup getAll(String page, String perPage) {
PagedResultsCookie cookie = null;
List<LdapUser> results;
try {
if ( page != null ) {
cookie = new PagedResultsCookie(Hex.decode(page));
} // end if
Integer pageSize = perPage != null ? Integer.parseInt(perPage) : PROCESSOR_PAGE_SIZE;
PagedResultsDirContextProcessor processor = new PagedResultsDirContextProcessor(pageSize, cookie);
LdapName base = LdapUtils.emptyLdapName();
SearchControls sc = new SearchControls();
sc.setSearchScope(SearchControls.SUBTREE_SCOPE);
sc.setTimeLimit(THREE_SECONDS);
sc.setCountLimit(pageSize);
sc.setReturningAttributes(new String[]{"cn", "title"});
results = ldapTemplate.search(base, filter.encode(), sc, new PersonAttributesMapper(), processor);
cookie = processor.getCookie();
} catch ( Exception e ) {
log.error(e.getMessage());
return null;
} // end try-catch
String nextPage = null;
if ( cookie != null && cookie.getCookie() != null ) {
nextPage = new String(Hex.encode(cookie.getCookie()));
} // end if
return new PaginatedLookup(nextPage, results);
}
The main issue I kept on hitting was trying to get the cookie as something that could be sent to the client, which is where my Hex.decode and Hex.encode came in handy.
PersonAttributesMapper is a private mapper that I have to make the fields more human readable, and PaginatedLookup is a custom class I use for API responses.

ElasticsearchTemplate retrieve big data sets

I am new to ElasticsearchTemplate. I want to get 1000 documents from Elasticsearch based on my query.
I have used QueryBuilder to create my query , and it is working perfectly.
I have gone through the following links , which states that it is possible to achieve big data sets using scan and scroll.
link one
link two
I am trying to implement this functionality in the following section of code, which I have copy pasted from one of the link , mentioned above.
But I am getting following error :
The type ResultsMapper is not generic; it cannot be parameterized with arguments <myInputDto>.
MyInputDto is a class with #Document annotation in my project.
End of the day , I just want to retrieve 1000 documents from Elasticsearch.
I tried to find size parameter but I think it is not supported.
String scrollId = esTemplate.scan(searchQuery, 1000, false);
List<MyInputDto> sampleEntities = new ArrayList<MyInputDto>();
boolean hasRecords = true;
while (hasRecords) {
Page<MyInputDto> page = esTemplate.scroll(scrollId, 5000L,
new ResultsMapper<MyInputDto>() {
#Override
public Page<MyInputDto> mapResults(SearchResponse response) {
List<MyInputDto> chunk = new ArrayList<MyInputDto>();
for (SearchHit searchHit : response.getHits()) {
if (response.getHits().getHits().length <= 0) {
return null;
}
MyInputDto user = new MyInputDto();
user.setId(searchHit.getId());
user.setMessage((String) searchHit.getSource().get("message"));
chunk.add(user);
}
return new PageImpl<MyInputDto>(chunk);
}
});
if (page != null) {
sampleEntities.addAll(page.getContent());
hasRecords = page.hasNextPage();
} else {
hasRecords = false;
}
}
What is the issue here ?
Is there any other alternative to achieve this?
I will be thankful if somebody could tell me how this ( code ) is working in the back end.

Solution 1
If you want to use ElasticsearchTemplate, it would be much simpler and readable to use CriteriaQuery, as it allows to set the page size with setPageable method. With scrolling, you can get next sets of data:
CriteriaQuery criteriaQuery = new CriteriaQuery(Criteria.where("productName").is("something"));
criteriaQuery.addIndices("prods");
criteriaQuery.addTypes("prod");
criteriaQuery.setPageable(PageRequest.of(0, 1000));
ScrolledPage<TestDto> scroll = (ScrolledPage<TestDto>) esTemplate.startScroll(3000, criteriaQuery, TestDto.class);
while (scroll.hasContent()) {
LOG.info("Next page with 1000 elem: " + scroll.getContent());
scroll = (ScrolledPage<TestDto>) esTemplate.continueScroll(scroll.getScrollId(), 3000, TestDto.class);
}
esTemplate.clearScroll(scroll.getScrollId());
Solution 2
If you'd like to use org.elasticsearch.client.Client instead of ElasticsearchTemplate, then SearchResponse allows to set the number of search hits to return:
QueryBuilder prodBuilder = ...;
SearchResponse scrollResp = client.
prepareSearch("prods")
.setScroll(new TimeValue(60000))
.setSize(1000)
.setTypes("prod")
.setQuery(prodBuilder)
.execute().actionGet();
ObjectMapper mapper = new ObjectMapper();
List<TestDto> products = new ArrayList<>();
try {
do {
for (SearchHit hit : scrollResp.getHits().getHits()) {
products.add(mapper.readValue(hit.getSourceAsString(), TestDto.class));
}
LOG.info("Next page with 1000 elem: " + products);
products.clear();
scrollResp = client.prepareSearchScroll(scrollResp.getScrollId())
.setScroll(new TimeValue(60000))
.execute()
.actionGet();
} while (scrollResp.getHits().getHits().length != 0);
} catch (IOException e) {
LOG.error("Exception while executing query {}", e);
}

Scroll example in ElasticSearch NEST API

I am using .From() and .Size() methods to retrieve all documents from Elastic Search results.
Below is sample example -
ISearchResponse<dynamic> bResponse = ObjElasticClient.Search<dynamic>(s => s.From(0).Size(25000).Index("accounts").AllTypes().Query(Query));
Recently i came across scroll feature of Elastic Search. This looks better approach than From() and Size() methods specifically to fetch large data.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
I looking for example on Scroll feature in NEST API.
Can someone please provide NEST example?
Thanks,
Sameer

Here's an example of using scroll with NEST and C#. Works with 5.x and 6.x
public IEnumerable<T> GetAllDocumentsInIndex<T>(string indexName, string scrollTimeout = "2m", int scrollSize = 1000) where T : class
{
ISearchResponse<T> initialResponse = this.ElasticClient.Search<T>
(scr => scr.Index(indexName)
.From(0)
.Take(scrollSize)
.MatchAll()
.Scroll(scrollTimeout));
List<T> results = new List<T>();
if (!initialResponse.IsValid || string.IsNullOrEmpty(initialResponse.ScrollId))
throw new Exception(initialResponse.ServerError.Error.Reason);
if (initialResponse.Documents.Any())
results.AddRange(initialResponse.Documents);
string scrollid = initialResponse.ScrollId;
bool isScrollSetHasData = true;
while (isScrollSetHasData)
{
ISearchResponse<T> loopingResponse = this.ElasticClient.Scroll<T>(scrollTimeout, scrollid);
if (loopingResponse.IsValid)
{
results.AddRange(loopingResponse.Documents);
scrollid = loopingResponse.ScrollId;
}
isScrollSetHasData = loopingResponse.Documents.Any();
}
this.ElasticClient.ClearScroll(new ClearScrollRequest(scrollid));
return results;
}
It's from: http://telegraphrepaircompany.com/elasticsearch-nest-scroll-api-c/

Internal implementation of NEST Reindex uses scroll to move documents from one index to another.
It should be good starting point.
Below you can find interesting for you code from github.
var page = 0;
var searchResult = this.CurrentClient.Search<T>(
s => s
.Index(fromIndex)
.AllTypes()
.From(0)
.Size(size)
.Query(this._reindexDescriptor._QuerySelector ?? (q=>q.MatchAll()))
.SearchType(SearchType.Scan)
.Scroll(scroll)
);
if (searchResult.Total <= 0)
throw new ReindexException(searchResult.ConnectionStatus, "index " + fromIndex + " has no documents!");
IBulkResponse indexResult = null;
do
{
var result = searchResult;
searchResult = this.CurrentClient.Scroll<T>(s => s
.Scroll(scroll)
.ScrollId(result.ScrollId)
);
if (searchResult.Documents.HasAny())
indexResult = this.IndexSearchResults(searchResult, observer, toIndex, page);
page++;
} while (searchResult.IsValid && indexResult != null && indexResult.IsValid && searchResult.Documents.HasAny());
Also you can take a look at integration test for Scroll
[Test]
public void SearchTypeScan()
{
var scanResults = this.Client.Search<ElasticsearchProject>(s => s
.From(0)
.Size(1)
.MatchAll()
.Fields(f => f.Name)
.SearchType(SearchType.Scan)
.Scroll("2s")
);
Assert.True(scanResults.IsValid);
Assert.False(scanResults.FieldSelections.Any());
Assert.IsNotNullOrEmpty(scanResults.ScrollId);
var results = this.Client.Scroll<ElasticsearchProject>(s=>s
.Scroll("4s")
.ScrollId(scanResults.ScrollId)
);
var hitCount = results.Hits.Count();
while (results.FieldSelections.Any())
{
Assert.True(results.IsValid);
Assert.True(results.FieldSelections.Any());
Assert.IsNotNullOrEmpty(results.ScrollId);
var localResults = results;
results = this.Client.Scroll<ElasticsearchProject>(s=>s
.Scroll("4s")
.ScrollId(localResults.ScrollId));
hitCount += results.Hits.Count();
}
Assert.AreEqual(scanResults.Total, hitCount);
}

I took the liberty of rewriting the fine answer from Michael to async and a bit less verbose (v. 6.x Nest):
public async Task<IList<T>> RockAndScroll<T>(
string indexName,
string scrollTimeoutMinutes = "2m",
int scrollPageSize = 1000
) where T : class
{
var searchResponse = await this.ElasticClient.SearchAsync<T>(sd => sd
.Index(indexName)
.From(0)
.Take(scrollPageSize)
.MatchAll()
.Scroll(scrollTimeoutMinutes));
var results = new List<T>();
while (true)
{
if (!searchResponse.IsValid || string.IsNullOrEmpty(searchResponse.ScrollId))
throw new Exception($"Search error: {searchResponse.ServerError.Error.Reason}");
if (!searchResponse.Documents.Any())
break;
results.AddRange(searchResponse.Documents);
searchResponse = await ElasticClient.ScrollAsync<T>(scrollTimeoutMinutes, searchResponse.ScrollId);
}
await this.ElasticClient.ClearScrollAsync(new ClearScrollRequest(searchResponse.ScrollId));
return results;
}

I took Frederick's answer and made it an extension method that leverages IAsyncEnumerable:
// Adopted from https://stackoverflow.com/a/56261657/1072030
public static async IAsyncEnumerable<T> ScrollAllAsync<T>(
this IElasticClient elasticClient,
string indexName,
string scrollTimeoutMinutes = "2m",
int scrollPageSize = 1000,
[EnumeratorCancellation] CancellationToken ct = default
) where T : class
{
var searchResponse = await elasticClient.SearchAsync<T>(
sd => sd
.Index(indexName)
.From(0)
.Take(scrollPageSize)
.MatchAll()
.Scroll(scrollTimeoutMinutes),
ct);
try
{
while (true)
{
if (!searchResponse.IsValid || string.IsNullOrEmpty(searchResponse.ScrollId))
throw new Exception($"Search error: {searchResponse.ServerError.Error.Reason}");
if (!searchResponse.Documents.Any())
break;
foreach(var item in searchResponse.Documents)
{
yield return item;
}
searchResponse = await elasticClient.ScrollAsync<T>(scrollTimeoutMinutes, searchResponse.ScrollId, ct: ct);
}
}
finally
{
await elasticClient.ClearScrollAsync(new ClearScrollRequest(searchResponse.ScrollId), ct: ct);
}
}
I'm guessing we should really be using search_after instead of the scroll api, but meh.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

ElasticSearch Scroll API not going past 10000 limit - elasticsearch

Related

Scroll API misses some documents

Elasticsearch: Pattern in high CPU usage after we started using elasticsearch high level client

Return ldap entries on paginated form in springboot

ElasticsearchTemplate retrieve big data sets

Scroll example in ElasticSearch NEST API

Categories

Resources