Why are Elasticsearch Scan/Scroll results so different? - elasticsearch

I'm performing a scan/scroll to remap an index in my cluster (v2.4.3) and I'm having trouble understanding the results. In the head plugin my original index has this size/doc count:
size: 1.74Gi (3.49Gi)
docs: 708,108 (1,416,216)
If I perform a _reindex command on this index I get a new index with the same number of docs and the same size.
But if I perform a scan/scroll to copy the index I end up with may more records in my new index. I'm in the middle of the process right now and here is the current state of the new index:
size: 1.81Gi (3.61Gi)
docs: 6,492,180 (12,981,180)
Why are there so many more documents in the new index versus the old one? The mapping file declares 13 nested objects but I did not change the number of nested objects between the two indices.
Here is my scan/scroll code:
SearchResponse response = client.prepareSearch("nas")
.addSort(SortParseElement.DOC_FIELD_NAME, SortOrder.ASC)
.setScroll(new TimeValue(120000))
.setQuery(matchAllQuery())
.setSize(pageable.getPageSize()).execute().actionGet();
while (true) {
if (response.getHits().getHits().length <= 0) break; //break out of the loop if failed
long startTime = System.currentTimeMillis();
List<IndexQuery> indexQueries = new ArrayList<>();
Arrays.stream(response.getHits().getHits()).forEach(hit -> {
NasProduct nasProduct = null;
try {
nasProduct = objectMapper.readValue(hit.getSourceAsString(), NasProduct.class);
} catch (IOException e) {
logger.error("Problem parsing nasProductJson json: || " + hit.getSourceAsString() + " ||", e);
}
if (nasProduct != null) {
IndexQuery indexQuery = new IndexQueryBuilder()
.withObject(nasProduct)
.withId(nasProduct.getProductKey())
.withIndexName(name)
.withType("product")
.build();
indexQueries.add(indexQuery);
}
});
elasticsearchTemplate.bulkIndex(indexQueries);
logger.info("Index updated update count: " + indexQueries.size() + " duration: " + (System.currentTimeMillis() - startTime) + " ms");
response = client.prepareSearchScroll(response.getScrollId())
.setScroll(new TimeValue(120000))
.execute().actionGet();
}

Related

Save millions of rows from CSV to Oracle DB Using Spring boot JPA

On regular Basis another application dumps a CSV that contains more than 7-8 millions of rows. I have a cron job that loads the data from CSV ans saves the data into my oracle DB. Here's my code snippet
String line = "";
int count = 0;
LocalDate localDateTime;
Instant from = Instant.now();
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("dd-MMM-yy");
List<ItemizedBill> itemizedBills = new ArrayList<>();
try {
BufferedReader br=new BufferedReader(new FileReader("/u01/CDR_20210325.csv"));
while((line=br.readLine())!=null) {
if (count >= 1) {
String [] data= line.split("\\|");
ItemizedBill customer = new ItemizedBill();
customer.setEventType(data[0]);
String date = data[1].substring(0,2);
String month = data[1].substring(3,6);
String year = data[1].substring(7,9);
month = WordUtils.capitalizeFully(month);
String modifiedDate = date + "-" + month + "-" + year;
localDateTime = LocalDate.parse(modifiedDate, formatter);
customer.setEventDate(localDateTime.atStartOfDay(ZoneId.systemDefault()).toInstant());
customer.setaPartyNumber(data[2]);
customer.setbPartyNumber(data[3]);
customer.setVolume(Long.valueOf(data[4]));
customer.setMode(data[5]);
if(data[6].contains("0")) { customer.setFnfNum("Other"); }
else{ customer.setFnfNum("FNF Number"); }
itemizedBills.add(customer);
}
count++;
}
itemizedBillRepository.saveAll(itemizedBills);
} catch (IOException e) {
e.printStackTrace();
}
}
This feature works but takes a lot of time to process. How can I make it efficent and make this process faster?
There are a couple of things you should do to your code.
String.split, while convenient, is relatively slow as it will recompile the regexp each time. Better to use Pattern and the split method on that to reduce overhead.
Use proper JPA batching strategies as explained in this blog.
First enable batch processing in your Spring application.properties. We will use a batch size of 50 (you will need to experiment on what is a proper batch-size for your case).
spring.jpa.properties.hibernate.jdbc.batch_size=50
spring.jpa.properties.hibernate.order_inserts=true
spring.jpa.properties.hibernate.order_updates=true
Then directly save entities to the database and each 50 items do a flush and clear. This will flush the state to the database and clear the first level cache (which will prevent excessive dirty-checks).
With all the above your code should look something like this.
int count = 0;
Instant from = Instant.now();
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("dd-MMM-yy");
Pattern splitter = Pattern.compile("\\|");
try {
BufferedReader br=new BufferedReader(new FileReader("/u01/CDR_20210325.csv"));
while((line=br.readLine())!=null) {
if (count >= 1) {
String [] data= splitter.split(Line);
ItemizedBill customer = new ItemizedBill();
customer.setEventType(data[0]);
String date = data[1].substring(0,2);
String month = data[1].substring(3,6);
String year = data[1].substring(7,9);
month = WordUtils.capitalizeFully(month);
String modifiedDate = date + "-" + month + "-" + year;
LocalDate localDate = LocalDate.parse(modifiedDate, formatter);
customer.setEventDate(localDate.atStartOfDay(ZoneId.systemDefault()).toInstant());
customer.setaPartyNumber(data[2]);
customer.setbPartyNumber(data[3]);
customer.setVolume(Long.valueOf(data[4]));
customer.setMode(data[5]);
if(data[6].contains("0")) {
customer.setFnfNum("Other");
} else {
customer.setFnfNum("FNF Number");
}
itemizedBillRepository.save(customer);
}
count++;
if ( (count % 50) == 0) {
this.entityManager.flush(); // sync with database
this.entityManager.clear(); // clear 1st level cache
}
}
} catch (IOException e) {
e.printStackTrace();
}
2 other optimizations you could do:
If your volume property is a long rather then a Long you should use Long.parseLong(data[4]); instead. It saves the Long creation and unboxing. With just 10 rows this might not be an issue, but with millions of rows, those milliseconds will add up.
Use ddMMMyy as the DateTimeFormatter and remove the substring parts in your code. Just do LocalDate.parse(date[1].toUpperCase(), formatted) to achieve the same result without the additional overhead of 5 additional String objects.
int count = 0;
Instant from = Instant.now();
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("ddMMMyy");
Pattern splitter = Pattern.compile("\\|");
try {
BufferedReader br=new BufferedReader(new FileReader("/u01/CDR_20210325.csv"));
while((line=br.readLine())!=null) {
if (count >= 1) {
String [] data= splitter.split(Line);
ItemizedBill customer = new ItemizedBill();
customer.setEventType(data[0]);
LocalDate localDate = LocalDate.parse(data[1].toUpperCase(), formatter);
customer.setEventDate(localDate.atStartOfDay(ZoneId.systemDefault()).toInstant());
customer.setaPartyNumber(data[2]);
customer.setbPartyNumber(data[3]);
customer.setVolume(Long.parseLong(data[4]));
customer.setMode(data[5]);
if(data[6].contains("0")) {
customer.setFnfNum("Other");
} else {
customer.setFnfNum("FNF Number");
}
itemizedBillRepository.save(customer);
}
count++;
if ( (count % 50) == 0) {
this.entityManager.flush(); // sync with database
this.entityManager.clear(); // clear 1st level cache
}
}
} catch (IOException e) {
e.printStackTrace();
}
you can use spring data batch insert.This links explains how to do : https://www.baeldung.com/spring-data-jpa-batch-inserts
You can try streaming MySQL results using Java 8 Streams and Spring Data JPA. The below link explains it in details
http://knes1.github.io/blog/2015/2015-10-19-streaming-mysql-results-using-java8-streams-and-spring-data.html

ElasticSearch Scroll API not going past 10000 limit

I am using the Scroll API to get more than 10,000 documents from our Elastic Search, however, whenever I the code tries to query past 10k, I get the below error:
Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]
This is my code:
try {
// 1. Build Search Request
final Scroll scroll = new Scroll(TimeValue.timeValueMinutes(1L));
SearchRequest searchRequest = new SearchRequest(eventId);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.query(queryBuilder);
searchSourceBuilder.size(limit);
searchSourceBuilder.profile(true); // used to profile the execution of queries and aggregations for a specific search
searchSourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS)); // optional parameter that controls how long the search is allowed to take
if(CollectionUtils.isNotEmpty(sortBy)){
for (int i = 0; i < sortBy.size(); i++) {
String sortByField = sortBy.get(i);
String orderByField = orderBy.get(i < orderBy.size() ? i : orderBy.size() - 1);
SortOrder sortOrder = (orderByField != null && orderByField.trim().equalsIgnoreCase("asc")) ? SortOrder.ASC : SortOrder.DESC;
if(keywordFields.contains(sortByField)) {
sortByField = sortByField + ".keyword";
} else if(rawFields.contains(sortByField)) {
sortByField = sortByField + ".raw";
}
searchSourceBuilder.sort(new FieldSortBuilder(sortByField).order(sortOrder));
}
}
searchSourceBuilder.sort(new FieldSortBuilder("_id").order(SortOrder.ASC));
if (includes != null) {
String[] excludes = {""};
searchSourceBuilder.fetchSource(includes, excludes);
}
if (CollectionUtils.isNotEmpty(aggregations)) {
aggregations.forEach(searchSourceBuilder::aggregation);
}
searchRequest.scroll(scroll);
searchRequest.source(searchSourceBuilder);
SearchResponse resp = null;
try {
resp = client.search(searchRequest, RequestOptions.DEFAULT);
String scrollId = resp.getScrollId();
SearchHit[] searchHits = resp.getHits().getHits();
// Pagination - will continue to call ES until there are no more pages
while(searchHits != null && searchHits.length > 0){
SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
scrollRequest.scroll(scroll);
resp = client.scroll(scrollRequest, RequestOptions.DEFAULT);
scrollId = resp.getScrollId();
searchHits = resp.getHits().getHits();
}
// Clear scroll request to release the search context
ClearScrollRequest clearScrollRequest = new ClearScrollRequest();
clearScrollRequest.addScrollId(scrollId);
client.clearScroll(clearScrollRequest, RequestOptions.DEFAULT);
} catch (Exception e) {
String msg = "Could not get search result. Exception=" + ExceptionUtilsEx.getExceptionInformation(e);
throw new Exception(msg);
I am implementing the solution from this link: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-search-scroll.html
Can anyone tell me what I am doing wrong and what I need to do to get past 10,000 with the scroll api?
If your iterations take more than 5 minutes, then you need to adapt the scroll time. Change this line to make sure the scroll context doesn't disappear after 1 minute
final Scroll scroll = new Scroll(TimeValue.timeValueMinutes(10L));
And remove this one:
searchSourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS)); // optional parameter that controls how long the search is allowed to take

ElasticsearchTemplate retrieve big data sets

I am new to ElasticsearchTemplate. I want to get 1000 documents from Elasticsearch based on my query.
I have used QueryBuilder to create my query , and it is working perfectly.
I have gone through the following links , which states that it is possible to achieve big data sets using scan and scroll.
link one
link two
I am trying to implement this functionality in the following section of code, which I have copy pasted from one of the link , mentioned above.
But I am getting following error :
The type ResultsMapper is not generic; it cannot be parameterized with arguments <myInputDto>.
MyInputDto is a class with #Document annotation in my project.
End of the day , I just want to retrieve 1000 documents from Elasticsearch.
I tried to find size parameter but I think it is not supported.
String scrollId = esTemplate.scan(searchQuery, 1000, false);
List<MyInputDto> sampleEntities = new ArrayList<MyInputDto>();
boolean hasRecords = true;
while (hasRecords) {
Page<MyInputDto> page = esTemplate.scroll(scrollId, 5000L,
new ResultsMapper<MyInputDto>() {
#Override
public Page<MyInputDto> mapResults(SearchResponse response) {
List<MyInputDto> chunk = new ArrayList<MyInputDto>();
for (SearchHit searchHit : response.getHits()) {
if (response.getHits().getHits().length <= 0) {
return null;
}
MyInputDto user = new MyInputDto();
user.setId(searchHit.getId());
user.setMessage((String) searchHit.getSource().get("message"));
chunk.add(user);
}
return new PageImpl<MyInputDto>(chunk);
}
});
if (page != null) {
sampleEntities.addAll(page.getContent());
hasRecords = page.hasNextPage();
} else {
hasRecords = false;
}
}
What is the issue here ?
Is there any other alternative to achieve this?
I will be thankful if somebody could tell me how this ( code ) is working in the back end.
Solution 1
If you want to use ElasticsearchTemplate, it would be much simpler and readable to use CriteriaQuery, as it allows to set the page size with setPageable method. With scrolling, you can get next sets of data:
CriteriaQuery criteriaQuery = new CriteriaQuery(Criteria.where("productName").is("something"));
criteriaQuery.addIndices("prods");
criteriaQuery.addTypes("prod");
criteriaQuery.setPageable(PageRequest.of(0, 1000));
ScrolledPage<TestDto> scroll = (ScrolledPage<TestDto>) esTemplate.startScroll(3000, criteriaQuery, TestDto.class);
while (scroll.hasContent()) {
LOG.info("Next page with 1000 elem: " + scroll.getContent());
scroll = (ScrolledPage<TestDto>) esTemplate.continueScroll(scroll.getScrollId(), 3000, TestDto.class);
}
esTemplate.clearScroll(scroll.getScrollId());
Solution 2
If you'd like to use org.elasticsearch.client.Client instead of ElasticsearchTemplate, then SearchResponse allows to set the number of search hits to return:
QueryBuilder prodBuilder = ...;
SearchResponse scrollResp = client.
prepareSearch("prods")
.setScroll(new TimeValue(60000))
.setSize(1000)
.setTypes("prod")
.setQuery(prodBuilder)
.execute().actionGet();
ObjectMapper mapper = new ObjectMapper();
List<TestDto> products = new ArrayList<>();
try {
do {
for (SearchHit hit : scrollResp.getHits().getHits()) {
products.add(mapper.readValue(hit.getSourceAsString(), TestDto.class));
}
LOG.info("Next page with 1000 elem: " + products);
products.clear();
scrollResp = client.prepareSearchScroll(scrollResp.getScrollId())
.setScroll(new TimeValue(60000))
.execute()
.actionGet();
} while (scrollResp.getHits().getHits().length != 0);
} catch (IOException e) {
LOG.error("Exception while executing query {}", e);
}

How to obtain a specific person's tweets in a specific time interval (Twitter4J)?

I want to collect some specific people's tweets in recent one year. I'm using Twitter4J, like this:
Paging paging = new Paging(i, 200);
try {
statuses = twitter.getUserTimeline("martinsuchan",paging);
} catch (TwitterException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
But how can I filter the Tweets of that user for a certain time interval?
Any answer appreciated
You can filter the statuses locally, based on status.getCreatedAt(). Example:
try {
int statusesPerPage = 200;
int page = 1;
String username = "username";
Calendar cal = Calendar.getInstance();
cal.add(Calendar.YEAR, -1);
Twitter twitter = new TwitterFactory().getInstance();
Paging paging = new Paging(page, statusesPerPage);
List<Status> statuses = twitter.getUserTimeline(username, paging);
page_loop:
while (statuses.size() > 0) {
System.out.println("Showing #" + username + "'s home timeline, page " + page);
for (Status status : statuses) {
if (status.getCreatedAt().before(cal.getTime())) {
break page_loop;
}
System.out.println(status.getCreatedAt() + " - " + status.getText());
}
paging = new Paging(++page, statusesPerPage);
statuses = twitter.getUserTimeline(username, paging);
}
} catch (TwitterException te) {
te.printStackTrace();
}

Elasticsearch Indexing by BulkRequestBuilder getting slow down

Hi all elasticsearch masters.
I have millions of data to be indexed by elasticsearch Java API.
The number of cluster nodes for elasticsearch are three (1 as master + 2 nodes).
My code snippet is below.
Settings settings = ImmutableSettings.settingsBuilder()
.put("cluster.name", "MyClusterName").build();
TransportClient client = new TransportClient(settings);
String hostname = "myhost ip";
int port = 9300;
client.addTransportAddress(new InetSocketTransportAddress(hostname, port));
BulkRequestBuilder bulkBuilder = client.prepareBulk();
BufferedReader br = new BufferedReader(new InputStreamReader(new DataInputStream(new FileInputStream("my_file_path"))));
long bulkBuilderLength = 0;
String readLine = "";
String index = "my_index_name";
String type = "my_type_name";
String id = "";
while((readLine = br.readLine()) != null){
id = somefunction(readLine);
String json = new ObjectMapper().writeValueAsString(readLine);
bulkBuilder.add(client.prepareIndex(index, type, id)
.setSource(json));
bulkBuilderLength++;
if(bulkBuilderLength % 1000== 0){
logger.info("##### " + bulkBuilderLength + " data indexed.");
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
logger.error("##### Bulk Request failure with error: " + bulkRes.buildFailureMessage());
}
}
}
br.close();
if(bulkBuilder.numberOfActions() > 0){
logger.info("##### " + bulkBuilderLength + " data indexed.");
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
logger.error("##### Bulk Request failure with error: " + bulkRes.buildFailureMessage());
}
bulkBuilder = client.prepareBulk();
}
It works fine but the performance getting SLOW DOWN RAPIDLY after thousands of document.
I've already tried to change settings value of "refresh_interval" as -1 and "number_of_replicas" as 0.
However, the situation of performance decreasing is the same.
If I monitor the status of my cluster using bigdesk, the GC value reaches 1 in every seconds like the screenshot below.
Anyone can help me?
Thanks in advance.
=================== UPDATED ===========================
Finally, I've solved this problem. (See the answer).
The cause of the problem is that I've missed recreate a new BulkRequestBuilder.
Performance degradation is never occurred after I've changed my code snippet like below.
Thank you very much.
Settings settings = ImmutableSettings.settingsBuilder()
.put("cluster.name", "MyClusterName").build();
TransportClient client = new TransportClient(settings);
String hostname = "myhost ip";
int port = 9300;
client.addTransportAddress(new InetSocketTransportAddress(hostname, port));
BulkRequestBuilder bulkBuilder = client.prepareBulk();
BufferedReader br = new BufferedReader(new InputStreamReader(new DataInputStream(new FileInputStream("my_file_path"))));
long bulkBuilderLength = 0;
String readLine = "";
String index = "my_index_name";
String type = "my_type_name";
String id = "";
while((readLine = br.readLine()) != null){
id = somefunction(readLine);
String json = new ObjectMapper().writeValueAsString(readLine);
bulkBuilder.add(client.prepareIndex(index, type, id)
.setSource(json));
bulkBuilderLength++;
if(bulkBuilderLength % 1000== 0){
logger.info("##### " + bulkBuilderLength + " data indexed.");
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
logger.error("##### Bulk Request failure with error: " + bulkRes.buildFailureMessage());
}
bulkBuilder = client.prepareBulk(); // This line is my mistake and the solution !!!
}
}
br.close();
if(bulkBuilder.numberOfActions() > 0){
logger.info("##### " + bulkBuilderLength + " data indexed.");
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
logger.error("##### Bulk Request failure with error: " + bulkRes.buildFailureMessage());
}
bulkBuilder = client.prepareBulk();
}
The problem here is that you don't recreate again a new Bulk after Bulk execution.
It means that you are reindexing the same first data again and again.
BTW, look at BulkProcessor class. Definitely better to use.

Resources