What we are trying to do is to index a bunch of documents in batches i.e.
foreach (var batch in props.ChunkBy(100))
{
var result = await client.IndexManyAsync<Type>(batch, indexName);
}
We would like to STOP Elasticsearch REFRESHING the Index until we have finished indexing all the batches. Then enable and refresh the index.
How can we achieve this with the NEST library
Many thanks
You can effectively disable the index refresh by setting the interval value to -1. Below is a code sample that shows how to set the refresh interval to -1 using the Nest client. Then you can do your bulk operations and afterwards set the refresh interval back to the default of 1 second.
//Set Index Refresh Interval to -1, essentially disabling the refresh
var updateDisableIndexRefresh = new UpdateIndexSettingsRequest();
updateDisableIndexRefresh.IndexSettings.RefreshInterval = Time.MinusOne;
client.UpdateIndexSettings(updateDisableIndexRefresh);
//Do your bulk operations here...
//Reset the Index Refresh Interval back to 1 second, the default setting.
var updateEnableIndexRefresh = new UpdateIndexSettingsRequest();
updateEnableIndexRefresh.IndexSettings.RefreshInterval = new Time(1, TimeUnit.Second);
client.UpdateIndexSettings(updateEnableIndexRefresh);
Related
I have a Spring Boot 2.x project with a big Table in my Cassandra Database. In my Liquibase Migration Class, I need to replace a value from one column in all rows.
For me its a big perfomance hit, when I try to solve this with
SELECT * FROM BOOKING
forEach Row
Update Row
Because of the total number of rows. Even when I select only 1 Column.
Is it possible to make something like "partwise/pagination" loop?
Pseudecode
Take first 1000 rows
do Update
Take next 1000 rows
do Update
loop.
Im also happy about all other solution approaches you have.
Must known:
Make sure there is a way to group the updates by partition. If you try a batchUpdate on 1000 rows not in same partition the coordinator of the request will suffer, you are moving the load from your client to the coordinator, and you want the parallelize the writes instead. A batchUpdate with cassandra has nothing to do with the one in relational databases.
For fined-grained operations like this you want to go back to the usage of the drivers with CassandraOperations and CqlSession for maximum control
There is a way to paginate with Spring Data cassandra using Slice but do not have control over how operations are implemented.
Spring Data Cassandra core
Slice<MyEntity> slice = MyEntityRepo.findAll(CassandraPageRequest.first(size));
while(slice.hasNext() && currpage < page) {
slice = personrepo.findAll(slice.nextPageable());
currpage++;
}
slice.getContent();
Drivers:
// Prepare Statements to speed up queries
PreparedStatement selectPS = session.prepare(QueryBuilder
.selectFrom( "myEntity").all()
.build()
.setPageSize(1000) // 1000 per pages
.setTimeout(Duration.ofSeconds(10)); // 10s timeout
PreparedStatement updatePS = session.prepare(QueryBuilder
.update("mytable")
.setColumn("myColumn", QueryBuilder.bindMarker())
.whereColumn("myPK").isEqualTo(QueryBuilder.bindMarker())
.build()
.setConsistencyLevel(ConsistencyLevel.ONE)); // Fast writes
// Paginate
ResultSet page1 = session.execute(selectPS);
Iterator<Row> page1Iter = page1.iterator();
while (0 < page1.getAvailableWithoutFetching()) {
Row row = page1Iter.next();
cqlsession.executeAsync(updatePS.bind(...));
}
ByteBuffer pagingStateAsBytes =
page1.getExecutionInfo().getPagingState();
selectPS.setPagingState(pagingStateAsBytes);
ResultSet page2 = session.execute(selectPS);
You could of course include this pagination in a loop and track progress.
In my elasticsearch query I have following:
"from":0,
"size":100,
I have thousands of records in database which I want to fetch in batches of 100.
I process one batch, and then fetch next batch of 100 and so on. I know how many records are to be fetched in total.
So value for 'from' needs to be changed dynamically.
How can I modify "from" in code?
Edit: I am programming in groovy.
There are two ways to do this depending on what do you need it for-
1) First one is simply using pagination and you can keep updating the "from" variable by the desired result size in a loop till you have retrieved all the results (considering you have the total count at the start) , but the problem with this approach is - till 'from' is < 9000 it works fine, but after it exceeds 9000 you get this size restriction error-
"Result window is too large, from + size must be less than or equal to: [10000] but was [100000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting"
which can be countered, as mentioned in the error by changing the index.max_result_window setting.However if you are instead planning to use this call as a one time operation(example for re-indexing) its is better to use to the scroll api as mentioned in the next point. (reference - How to retrieve all documents(size greater than 10000) in an elasticsearch index )
2) You can use the scroll api, something like this in java :
public String getJSONResponse() throws IOException {
String res = "";
int docParsed = 0;
String fooResourceUrl
= "http://localhost:9200/myindex/mytype/_search?scroll=5m&size=100";
ResponseEntity<String> response
= restTemplate.getForEntity(fooResourceUrl, String.class);
JSONObject fulMappingOuter = (JSONObject) new JSONObject(response.getBody());
String scroll_id = fulMappingOuter.getString("_scroll_id");
JSONObject fulMapping = fulMappingOuter.getJSONObject("hits");
int totDocCount = fulMapping.getInt("total");
JSONArray hitsArr = (JSONArray) fulMapping.getJSONArray("hits");
System.out.println("total hits:" + hitsArr.length());
while (docParsed < totDocCount) {
for (int i = 0; i < hitsArr.length(); i++) {
docParsed++;
//do your stuff
}
String uri
= "http://localhost:9200/_search/scroll";
// set headers
HttpHeaders headers = new HttpHeaders();
headers.setContentType(MediaType.APPLICATION_JSON);
JSONObject searchBody = new JSONObject();
searchBody.put("scroll", "5m");
searchBody.put("scroll_id", scroll_id);
HttpEntity<String> entity = new HttpEntity<>(searchBody.toString(), headers);
// // send request and parse result
ResponseEntity<String> responseScroll = restTemplate
.exchange(uri, HttpMethod.POST, entity, String.class);
fulMapping = (JSONObject) new JSONObject(responseScroll.getBody()).get("hits");
hitsArr = (JSONArray) fulMapping.getJSONArray("hits");
// System.out.println("response when trying to upload to local: "+response.getBody());
}
return res;
}
Calling the scroll api initialises a 'Scroller' . This returns the first set of results along with a scroll_id the number of results being 100 as set when creating the scroller in the first call. Notice the 5m in the first url's parameter? That is for setting the scroll time, that is the time in minutes for which ElasticSearch will keep the search context alive,if this time is expired, no results can be further fetched using this scroll id(also its a good practice to remove the scroll context if your job has finished before the scroll time expires, as keeping the scroll context alive is quite resource intensive)
For each subsequent scroll request, the updated scroll_id is sent and next batch of results is returned.
Note: Here I have used Springboot's RestTemplate Client to make the calls and then parsed the response JSONs by using JSON parsers. However the same can be achieved by using elastic-search's own high level REST client for Groovy . here's a reference to the scroll api -
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-scroll.html
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-high-search-scroll.html
I'm using DataTables 1.10.5. My table uses server side processing via ajax.
$('#' + id).dataTable({
processing: true,
serverSide: true,
ajax: 'server-side-php-script-url',
"pagingType": "simple_incremental_bootstrap"
});
Everything will work properly if I send 'recordsTotal' in the server response. But I don't want to count the total entries because of performance issues. So I tried to use the pagination plugin simple_incremental_bootstrap. However it is not working as expected. The next button always return the first page itself. If I give 'recordsTotal' in server response this plugin will work properly. I found out that If we don't give 'recordsTotal', the 'start' param sent by datatable to server side script is always 0. So my server side script will always return the first page.
According to this discussion, server side processing without calculating total count is not possible because “DataTables uses the record count that is passed back to it to deal with the paging controls”. The suggested workaround is “So the display records are needed, but it would be possible to just pass back a static number (like 1'000'000 or whatever) which would make DataTables think there are a million rows. You could hide the information element if this information is totally bogus!”
I wonder if anybody have a solution for this. Basically I want to have a simple pagination in my datatable with ajax without sending total count from server.
A workaround worth to try..
If we don't send recordsTotal from server, the pagination won't work properly. If we send a high static number as recordsTotal, table will show an active Next button even if there is no data in next page.
So I ended up in a solution which utilizes two parameters received in ajax script - 'start' and 'length'.
If rows in current page is less than 'limit' there is no data in next page. So total count will be 'start' + 'current page count'. This will disable Next button in the last page.
If rows in current page is equal to or greater than 'limit' there is more data in next pages. Then I will fetch data for next page. If there is at least one row in next page, send recordsTotal something larger than 'start + limit'. This will display an active Next button.
Sample code:
$limit = require_param('length');
$offset = require_param('start');
$current_page_data = fn_to_calculate_data($limit, $offset); // in my case, mysqli result.
$data = “fetch data $current_page_data”;
$current_page_count = mysqli_num_rows($current_page_data);
if($current_page_count >= $limit) {
$next_page_data = fn_to_calculate_data($limit, $offset+$limit);
$next_page_count = mysqli_num_rows($next_page_data);
if($next_page_count >= $limit) {
// Not the exact count, just indicate that we have more pages to show.
$total_count = $offset+(2*$limit);
} else {
$total_count = $offset+$limit+$next_page_count;
}
} else {
$total_count = $offset+$current_page_count;
}
$filtered_count = $total_count;
send_json(array(
'draw' => $params['draw'],
'recordsTotal' => $total_count,
'recordsFiltered' => $filtered_count,
'data' => $data)
);
However this solution adds some load to server as it additionally calculate count of rows in next page. Anyway it is a small load as compared to the calculation total rows.
We need to hide the count information from table footer and use simple pagination.
dtOptions = {};
dtOptions.pagingType = "simple";
dtOptions.fnDrawCallback = function() {
$('#'+table_id+"_info").hide();
};
$('#' + table_id).dataTable(dtOptions);
I'm trying to create fast pagination with ElasticSearch. I have read this doc page about search_after operator. I understand how to create a "forward" pagination. But I can't realize how to move to the previous page in this case.
In a project we are working on we’re going to simply reverse the sort direction and then use search_after as if it was a search_before.
This is a late answer, but it’s a little bit better than having to keep track of results on the application. For that specific scenario the Scroll APIs (that I don’t know if it was available at the time) should be more suited.
Although the API doesn't have a search previous, you have this workaround.
It's easy to move backward and I had to do it as well. Just keep track of it in variables in whatever language you're using.
I indexed an object with the searches with a pointer to keep track of where I was in the data. For example:
var $scope.search.display = 0;
var $scope.searchIndex = {};
var data = getElasticSearchQuery() //This is your data from the elastic query
if (!$scope.searchIndex[$scope.search.display + 10] && data.hits.length > 0) {
$scope.searchIndex[$scope.search.display + 10] = data.hits[data.hits.length - 1].sort;
}
If you have 'next' and 'previous' buttons, then in your POST request for elastic just assign the search_after parameter with the correct index:
$scope.prevButton = function(){
$scope.search.display -= 10;
if($scope.search.display < 10){
$scope.search.searchAfter = null;
}
if($scope.searchIndex[$scope.search.display]){
$scope.search.searchAfter = $scope.searchIndex[$scope.search.display]
}
$scope.sendResults(); //send the post in an elastic search query
};
$scope.nextButton = function() {
$scope.search.display += 10;
if($scope.searchIndex[$scope.search.display]){
$scope.search.searchAfter = $scope.searchIndex[$scope.search.display];
}
$scope.sendResults(); //send the post in an elastic search query
};
That should get you on your feet. The 10 is my size, meaning I have a pagination of 10 results.
Just want to get some ideas from anyone who have encountered similar problems and how did you guys come up with the solution.
Basically, we have around 10K documents stored in RavenDB. And we need the ability to allow users to perform filter and search against those documents. I am aware that there is a maximum of 1024 page size within RavenDb. So in order for the filter and search to work, I need to do my own paging. But my solution gives me the following error:
The maximum number of requests (30) allowed for this session has been reached.
I have tried many different ways of disposing the session by wrapping it around using keyword and also explicitly calling Dispose after every call to RavenDb with no success.
Does anyone know how to get around this issue? what's the best practice for this kind of scenario?
var pageSize = 1024;
var skipSize = 0;
var maxSize = 0;
using (_documentSession)
{
maxSize = _documentSession.Query<LogEvent>().Count();
}
while (skipSize < maxSize)
{
using (_documentSession)
{
var events = _documentSession.Query<LogEvent>().Skip(skipSize).Take(pageSize).ToList();
_documentSession.Dispose();
//building finalPredicate codes..... which i am not providing here....
results.AddRange(events.Where(finalPredicate.Compile()).ToList());
skipSize += pageSize;
}
}
Raven limits the number of Request (Load, Query, ...) to 30 per Session. This behavior is documented.
I can see that you dispose the session in your code. But I don't see where you recreating the session. Anyways loading data they way you intend to do is not a good idea.
We're using indexes and paging and never load more than 1024.
If you're expecting thousands of documents or your precicate logic doesn't work as an index and you don't care about how long your query will take use the unbounded results API.
var results = new List<LogEvent>();
var query = session.Query<LogEvent>();
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
if (predicate(enumerator.Current.Document)) {
results.Add(enumerator.Current.Document);
}
}
}
Depending on the amount of document this will use a lot of RAM.