ElasticsearchTemplate retrieve big data sets

ElasticsearchTemplate retrieve big data sets - spring

I am new to ElasticsearchTemplate. I want to get 1000 documents from Elasticsearch based on my query.
I have used QueryBuilder to create my query , and it is working perfectly.
I have gone through the following links , which states that it is possible to achieve big data sets using scan and scroll.
link one
link two
I am trying to implement this functionality in the following section of code, which I have copy pasted from one of the link , mentioned above.
But I am getting following error :
The type ResultsMapper is not generic; it cannot be parameterized with arguments <myInputDto>.
MyInputDto is a class with #Document annotation in my project.
End of the day , I just want to retrieve 1000 documents from Elasticsearch.
I tried to find size parameter but I think it is not supported.
String scrollId = esTemplate.scan(searchQuery, 1000, false);
List<MyInputDto> sampleEntities = new ArrayList<MyInputDto>();
boolean hasRecords = true;
while (hasRecords) {
Page<MyInputDto> page = esTemplate.scroll(scrollId, 5000L,
new ResultsMapper<MyInputDto>() {
#Override
public Page<MyInputDto> mapResults(SearchResponse response) {
List<MyInputDto> chunk = new ArrayList<MyInputDto>();
for (SearchHit searchHit : response.getHits()) {
if (response.getHits().getHits().length <= 0) {
return null;
}
MyInputDto user = new MyInputDto();
user.setId(searchHit.getId());
user.setMessage((String) searchHit.getSource().get("message"));
chunk.add(user);
}
return new PageImpl<MyInputDto>(chunk);
}
});
if (page != null) {
sampleEntities.addAll(page.getContent());
hasRecords = page.hasNextPage();
} else {
hasRecords = false;
}
}
What is the issue here ?
Is there any other alternative to achieve this?
I will be thankful if somebody could tell me how this ( code ) is working in the back end.

Solution 1
If you want to use ElasticsearchTemplate, it would be much simpler and readable to use CriteriaQuery, as it allows to set the page size with setPageable method. With scrolling, you can get next sets of data:
CriteriaQuery criteriaQuery = new CriteriaQuery(Criteria.where("productName").is("something"));
criteriaQuery.addIndices("prods");
criteriaQuery.addTypes("prod");
criteriaQuery.setPageable(PageRequest.of(0, 1000));
ScrolledPage<TestDto> scroll = (ScrolledPage<TestDto>) esTemplate.startScroll(3000, criteriaQuery, TestDto.class);
while (scroll.hasContent()) {
LOG.info("Next page with 1000 elem: " + scroll.getContent());
scroll = (ScrolledPage<TestDto>) esTemplate.continueScroll(scroll.getScrollId(), 3000, TestDto.class);
}
esTemplate.clearScroll(scroll.getScrollId());
Solution 2
If you'd like to use org.elasticsearch.client.Client instead of ElasticsearchTemplate, then SearchResponse allows to set the number of search hits to return:
QueryBuilder prodBuilder = ...;
SearchResponse scrollResp = client.
prepareSearch("prods")
.setScroll(new TimeValue(60000))
.setSize(1000)
.setTypes("prod")
.setQuery(prodBuilder)
.execute().actionGet();
ObjectMapper mapper = new ObjectMapper();
List<TestDto> products = new ArrayList<>();
try {
do {
for (SearchHit hit : scrollResp.getHits().getHits()) {
products.add(mapper.readValue(hit.getSourceAsString(), TestDto.class));
}
LOG.info("Next page with 1000 elem: " + products);
products.clear();
scrollResp = client.prepareSearchScroll(scrollResp.getScrollId())
.setScroll(new TimeValue(60000))
.execute()
.actionGet();
} while (scrollResp.getHits().getHits().length != 0);
} catch (IOException e) {
LOG.error("Exception while executing query {}", e);
}

Related

ElasticSearch Scroll API not going past 10000 limit

I am using the Scroll API to get more than 10,000 documents from our Elastic Search, however, whenever I the code tries to query past 10k, I get the below error:
Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]
This is my code:
try {
// 1. Build Search Request
final Scroll scroll = new Scroll(TimeValue.timeValueMinutes(1L));
SearchRequest searchRequest = new SearchRequest(eventId);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.query(queryBuilder);
searchSourceBuilder.size(limit);
searchSourceBuilder.profile(true); // used to profile the execution of queries and aggregations for a specific search
searchSourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS)); // optional parameter that controls how long the search is allowed to take
if(CollectionUtils.isNotEmpty(sortBy)){
for (int i = 0; i < sortBy.size(); i++) {
String sortByField = sortBy.get(i);
String orderByField = orderBy.get(i < orderBy.size() ? i : orderBy.size() - 1);
SortOrder sortOrder = (orderByField != null && orderByField.trim().equalsIgnoreCase("asc")) ? SortOrder.ASC : SortOrder.DESC;
if(keywordFields.contains(sortByField)) {
sortByField = sortByField + ".keyword";
} else if(rawFields.contains(sortByField)) {
sortByField = sortByField + ".raw";
}
searchSourceBuilder.sort(new FieldSortBuilder(sortByField).order(sortOrder));
}
}
searchSourceBuilder.sort(new FieldSortBuilder("_id").order(SortOrder.ASC));
if (includes != null) {
String[] excludes = {""};
searchSourceBuilder.fetchSource(includes, excludes);
}
if (CollectionUtils.isNotEmpty(aggregations)) {
aggregations.forEach(searchSourceBuilder::aggregation);
}
searchRequest.scroll(scroll);
searchRequest.source(searchSourceBuilder);
SearchResponse resp = null;
try {
resp = client.search(searchRequest, RequestOptions.DEFAULT);
String scrollId = resp.getScrollId();
SearchHit[] searchHits = resp.getHits().getHits();
// Pagination - will continue to call ES until there are no more pages
while(searchHits != null && searchHits.length > 0){
SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
scrollRequest.scroll(scroll);
resp = client.scroll(scrollRequest, RequestOptions.DEFAULT);
scrollId = resp.getScrollId();
searchHits = resp.getHits().getHits();
}
// Clear scroll request to release the search context
ClearScrollRequest clearScrollRequest = new ClearScrollRequest();
clearScrollRequest.addScrollId(scrollId);
client.clearScroll(clearScrollRequest, RequestOptions.DEFAULT);
} catch (Exception e) {
String msg = "Could not get search result. Exception=" + ExceptionUtilsEx.getExceptionInformation(e);
throw new Exception(msg);
I am implementing the solution from this link: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-search-scroll.html
Can anyone tell me what I am doing wrong and what I need to do to get past 10,000 with the scroll api?

If your iterations take more than 5 minutes, then you need to adapt the scroll time. Change this line to make sure the scroll context doesn't disappear after 1 minute
final Scroll scroll = new Scroll(TimeValue.timeValueMinutes(10L));
And remove this one:
searchSourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS)); // optional parameter that controls how long the search is allowed to take

How to get the object related to hibernate search result list?

I need to do a search for several documents indexed on elasticsearch. The search works, but I need to know the type of object that returns the search.
public List search(String terms) {
FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(entityManager);
QueryBuilder authorQB = fullTextEntityManager.getSearchFactory().buildQueryBuilder()
.forEntity(Author.class).get();
QueryBuilder postQB = fullTextEntityManager.getSearchFactory().buildQueryBuilder()
.forEntity(Post.class).get();
QueryBuilder commentQB = fullTextEntityManager.getSearchFactory().buildQueryBuilder()
.forEntity(Comment.class).get();
Query authorLQ = authorQB
.keyword().fuzzy().withEditDistanceUpTo(1).withPrefixLength(1)
.onFields(AUTHOR_FIELDS).matching(terms)
.createQuery();
Query postLQ = postQB
.keyword().fuzzy().withEditDistanceUpTo(1).withPrefixLength(1)
.onFields(POST_FIELDS).matching(terms)
.createQuery();
Query commentLQ = commentQB
.keyword().fuzzy().withEditDistanceUpTo(1).withPrefixLength(1)
.onFields(COMMENT_FIELDS).matching(terms)
.createQuery();
Query luceneQuery = authorQB.bool()
.should(authorLQ)
.should(postLQ)
.should(commentLQ)
.createQuery();
javax.persistence.Query jpaQuery = fullTextEntityManager.
createFullTextQuery(luceneQuery, Author.class, Post.class, Comment.class);
List<Object> result; // need to know object type
try {
result = jpaQuery.getResultList();
} catch (NoResultException nre) {
throw new NoResultException("The search for " + terms + " did not get any results");
}
return result;
}
That gives me a list of all the objects, but I need to know exactly what type it is (Author, Post or Comment). It is possible to do it, thanks.

You could just use instanceof... but if you really want Hibernate Search to return that, you can use projections:
FullTextQuery jpaQuery = fullTextEntityManager.
createFullTextQuery(luceneQuery, Author.class, Post.class, Comment.class);
jpaQuery.setProjection( ProjectionConstants.OBJECT_CLASS, ProjectionConstants.THIS );
List<Object[]> results = jpaQuery.list();
for ( Object[] result : results ) {
Class<?> resultClass = result[0];
Object resultObject = result[1];
// ... do stuff ...
}

Return ldap entries on paginated form in springboot

I have a ldap method that returns all users that are in it (almost 1300 users) and I want to return them by page, similar to what PagingAndSortingRepository does in Springboot:
If I have this endpoint ( users/?page=0&size=1 )and I wnat to return on page 0 just 1 entry.
Is there any way to do that?
Currently I have this but it doesn´t work:
SearchRequest searchRequest = new SearchRequest(ldapConfig.getBaseDn(), SearchScope.SUB,
Filter.createEqualityFilter("objectClass", "person"));
ASN1OctetString resumeCookie = null;
while (true) {
searchRequest.setControls(new SimplePagedResultsControl(pageable.getPageSize(), resumeCookie));
SearchResult searchResult = ldapConnection.search(searchRequest);
numSearches++;
totalEntriesReturned += searchResult.getEntryCount();
for (SearchResultEntry e : searchResult.getSearchEntries()) {
String[] completeDN = UaaUtils.searchCnInDn(e.getDN());
String[] username = completeDN[0].split("=");
UserEntity u = new UserEntity(username[1]);
list.add(u);
System.out.println("TESTE");
}
SimplePagedResultsControl responseControl = SimplePagedResultsControl.get(searchResult);
if (responseControl.moreResultsToReturn()) {
// The resume cookie can be included in the simple paged results
// control included in the next search to get the next page of results.
System.out.println("Antes "+resumeCookie);
resumeCookie = responseControl.getCookie();
System.out.println("Depois "+resumeCookie);
} else {
break;
}
Page<UserEntity> newPage = new PageImpl<>(list, pageable, totalEntriesReturned);
System.out.println("content " + newPage.getContent());
System.out.println("total elements " + newPage.getTotalElements());
System.out.println(totalEntriesReturned);
}

I'm unsure if this is the proper way, but here's how I went about it:
public PaginatedLookup getAll(String page, String perPage) {
PagedResultsCookie cookie = null;
List<LdapUser> results;
try {
if ( page != null ) {
cookie = new PagedResultsCookie(Hex.decode(page));
} // end if
Integer pageSize = perPage != null ? Integer.parseInt(perPage) : PROCESSOR_PAGE_SIZE;
PagedResultsDirContextProcessor processor = new PagedResultsDirContextProcessor(pageSize, cookie);
LdapName base = LdapUtils.emptyLdapName();
SearchControls sc = new SearchControls();
sc.setSearchScope(SearchControls.SUBTREE_SCOPE);
sc.setTimeLimit(THREE_SECONDS);
sc.setCountLimit(pageSize);
sc.setReturningAttributes(new String[]{"cn", "title"});
results = ldapTemplate.search(base, filter.encode(), sc, new PersonAttributesMapper(), processor);
cookie = processor.getCookie();
} catch ( Exception e ) {
log.error(e.getMessage());
return null;
} // end try-catch
String nextPage = null;
if ( cookie != null && cookie.getCookie() != null ) {
nextPage = new String(Hex.encode(cookie.getCookie()));
} // end if
return new PaginatedLookup(nextPage, results);
}
The main issue I kept on hitting was trying to get the cookie as something that could be sent to the client, which is where my Hex.decode and Hex.encode came in handy.
PersonAttributesMapper is a private mapper that I have to make the fields more human readable, and PaginatedLookup is a custom class I use for API responses.

NEST Elasticsearch Reindex examples

my objective is to reindex an index with 10 million shards for the purposes of changing field mappings to facilitate significant terms analysis.
My problem is that I am having trouble using the NEST library to perform a re-index, and the documentation is (very) limited. If possible I need an example of the following in use:
http://nest.azurewebsites.net/nest/search/scroll.html
http://nest.azurewebsites.net/nest/core/bulk.html

NEST provides a nice Reindex method you can use, although the documentation is lacking. I've used it in a very rough-and-ready fashion with this ad-hoc WinForms code.
private ElasticClient client;
private double count;
private void reindex_Completed()
{
MessageBox.Show("Done!");
}
private void reindex_Next(IReindexResponse<object> obj)
{
count += obj.BulkResponse.Items.Count();
var progress = 100 * count / (double)obj.SearchResponse.Total;
progressBar1.Value = (int)progress;
}
private void reindex_Error(Exception ex)
{
MessageBox.Show(ex.ToString());
}
private void button1_Click(object sender, EventArgs e)
{
count = 0;
var reindex = client.Reindex<object>(r => r.FromIndex(fromIndex.Text).NewIndexName(toIndex.Text).Scroll("10s"));
var o = new ReindexObserver<object>(onError: reindex_Error, onNext: reindex_Next, completed: reindex_Completed);
reindex.Subscribe(o);
}
And I've just found the blog post that showed me how to do it: http://thomasardal.com/elasticsearch-migrations-with-c-and-nest/

Unfortunately the NEST implementation is not quite what I expected. In my opinion it's a bit over-engineered for possibly the most common use case.
Alot of people just want to update their mappings with zero downtime...
In my case - I had already taken care of creating the index with all its settings and mappings, but NEST insists that it must create a new index when reindexing. That among many other things. Too many other things.
I found it much less complicated to just implement directly - since NEST already has Search, Scroll, and Bulk methods. (this is adopted from NEST's implementation):
// Assuming you have already created and setup the index yourself
public void Reindex(ElasticClient client, string aliasName, string currentIndexName, string nextIndexName)
{
Console.WriteLine("Reindexing documents to new index...");
var searchResult = client.Search<object>(s => s.Index(currentIndexName).AllTypes().From(0).Size(100).Query(q => q.MatchAll()).SearchType(SearchType.Scan).Scroll("2m"));
if (searchResult.Total <= 0)
{
Console.WriteLine("Existing index has no documents, nothing to reindex.");
}
else
{
var page = 0;
IBulkResponse bulkResponse = null;
do
{
var result = searchResult;
searchResult = client.Scroll<object>(s => s.Scroll("2m").ScrollId(result.ScrollId));
if (searchResult.Documents != null && searchResult.Documents.Any())
{
searchResult.ThrowOnError("reindex scroll " + page);
bulkResponse = client.Bulk(b =>
{
foreach (var hit in searchResult.Hits)
{
b.Index<object>(bi => bi.Document(hit.Source).Type(hit.Type).Index(nextIndexName).Id(hit.Id));
}
return b;
}).ThrowOnError("reindex page " + page);
Console.WriteLine("Reindexing progress: " + (page + 1) * 100);
}
++page;
}
while (searchResult.IsValid && bulkResponse != null && bulkResponse.IsValid && searchResult.Documents != null && searchResult.Documents.Any());
Console.WriteLine("Reindexing complete!");
}
Console.WriteLine("Updating alias to point to new index...");
client.Alias(a => a
.Add(aa => aa.Alias(aliasName).Index(nextIndexName))
.Remove(aa => aa.Alias(aliasName).Index(currentIndexName)));
// TODO: Don't forget to delete the old index if you want
}
And the ThrowOnError extension method in case you want it:
public static T ThrowOnError<T>(this T response, string actionDescription = null) where T : IResponse
{
if (!response.IsValid)
{
throw new CustomExceptionOfYourChoice(actionDescription == null ? string.Empty : "Failed to " + actionDescription + ": " + response.ServerError.Error);
}
return response;
}

I second Ben Wilde's answer above. Better to have full control over index creation and the re-index process.
What's missing from Ben's code is support for parent/child relationship. Here is my code to fix that:
Replace the following lines:
foreach (var hit in searchResult.Hits)
{
b.Index<object>(bi => bi.Document(hit.Source).Type(hit.Type).Index(nextIndexName).Id(hit.Id));
}
With this:
foreach (var hit in searchResult.Hits)
{
var jo = hit.Source as JObject;
JToken jt;
if(jo != null && jo.TryGetValue("parentId", out jt))
{
// Document is child-document => add parent reference
string parentId = (string)jt;
b.Index<object>(bi => bi.Document(hit.Source).Type(hit.Type).Index(nextIndexName).Id(hit.Id).Parent(parentId));
}
else
{
b.Index<object>(bi => bi.Document(hit.Source).Type(hit.Type).Index(nextIndexName).Id(hit.Id));
}
}

HBase Aggregation

I'm having some trouble doing aggregation on a particular column in HBase.
This is the snippet of code I tried:
Configuration config = HBaseConfiguration.create();
AggregationClient aggregationClient = new AggregationClient(config);
Scan scan = new Scan();
scan.addColumn(Bytes.toBytes("drs"), Bytes.toBytes("count"));
ColumnInterpreter<Long, Long> ci = new LongColumnInterpreter();
Long sum = aggregationClient.sum(Bytes.toBytes("DEMO_CALCULATIONS"), ci , scan);
System.out.println(sum);
sum returns a value of null.
The aggregationClient API works fine if I do a rowcount.
I was trying to follow the directions in http://michaelmorello.blogspot.in/2012/01/row-count-hbase-aggregation-example.html
Could there be a problem with me using a LongColumnInterpreter when the 'count' field was an int? What am I missing in here?

You can only use long(8bytes) to do sum with default setting.
Cause in the code of AggregateImplementation's getSum method, it handle all the returned KeyValue as long.
List<KeyValue> results = new ArrayList<KeyValue>();
try {
boolean hasMoreRows = false;
do {
hasMoreRows = scanner.next(results);
for (KeyValue kv : results) {
temp = ci.getValue(colFamily, qualifier, kv);
if (temp != null)
sumVal = ci.add(sumVal, ci.castToReturnType(temp));
}
results.clear();
} while (hasMoreRows);
} finally {
scanner.close();
}
and in LongColumnInterpreter
public Long getValue(byte[] colFamily, byte[] colQualifier, KeyValue kv)
throws IOException {
if (kv == null || kv.getValueLength() != Bytes.SIZEOF_LONG)
return null;
return Bytes.toLong(kv.getBuffer(), kv.getValueOffset());
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

ElasticsearchTemplate retrieve big data sets - spring

Related

ElasticSearch Scroll API not going past 10000 limit

How to get the object related to hibernate search result list?

Return ldap entries on paginated form in springboot

NEST Elasticsearch Reindex examples

HBase Aggregation

Categories

Resources