Get the only failed document response in Bulk API Elasticsearch - elasticsearch

I am struggling with Bulk API. I am sending 100 request (index, update) in every Bulk request. It gives me response with status of each request. Suppose my 97th request got fail, I have to make it loop to find the particular error document. I think its not optimize way. If i am sending more number of Bulk request, It makes my process slow. Is there any way where i will get only failed document or count of fail/success document in response? I am using php-elasticsearch SDK .

for count of fail/success you can use this method:
get count of index before bulk action
you can ignore if if there is no index
$parameters = ["index" => "your_index","type" => "your_type"];
$response = $esclient->count($params);
$old_count = $response['count'];
use refresh key with true value in parameters that you send as bulk
this refresh this index after performing bulk action
$params['refresh'] = true;
$params['body'] = ...;
$total_count = count($params['body']) / 2; //get the total request count
$esclient->bulk($params);
after that you can use count method to find out how many index exist
$response = $esclient->count($parameters);
$new_count = $response['count'];
get the total success
$total_success = $new_count - $old_count;
get the total failed
$total_fail = $total_count- $total_success;

Related

Power Query sends multiple requests instead of one. How to deal with it?

My code for requesting data is simple. Server returns 404 (and simple message "Broken data") every request except every 10th request. On 10th request server returns 200 (With other simple text "Data from server")
So in power query I found this part of code:
producer = (val) =>
let
result = Web.Contents(url, [ManualStatusHandling = {404}]) // (1)
status = Value.Metadata(result)[Response.Status] // (2)
actualResult = if status = 404 then null else result // (3)
in
Text.FromBinary(actualResult)
`
So, when I run request result (1) is okay. And status (2) is okay. But when line (3) executed - instead of using (1) it resends request and gets wrong result.
I've tried to convert (1) to Binary.Buffer but in this case Value.Metadata gives empty record.
How can I force PQ to send only one request or somehow make it manually. Thanks!

Web.Content calling API service and merging pages with List.Transform started to fail

I created PowerBI report which which is connecting to data source via API service. Returning json contains thousands of entities. API service is called via Web.Content function. API service returns always total record count and so we are able to calculate nr. of pages which has to be called to obtain whole dataset. This report is displaying data from our servicedesk app, which is deployed on many servers and for many customers and use Query parameters to connect to any of these servers.
Detail of Power query is below.
Why am I writing here. This report was working without any issue more than 1,5 year but on August 17th one of servers start causing erros in step Pages where are some random lines (pages) with errors - see attached picture labeled "Errors in step Pages". and this is reason that next step Entities (List.Union) in query is stopping refresh and generate errors with message:
Expression.Error: We cannot apply field access to the type List. Details: Value=[List] Key=requests
What is notable
API service si returning records in the same order but faulty lists are random when calling with same parameters
some times is refresh without any error
The same power query called on another server is working correctly , problem is only with one specific server.
This problem started without notice on the most important server after 1,5 year without any problem.
Here is full text power of query for this main source, which is used later in other queries to extract all necessary data. Json is really complicated and I extract from it list of requests, list of solvers, list of solver groups,.... and this base query and its output is input for many referenced queries.
Errors in step Pages
let
BaseAPIUrl = apiurl&"apiservice?", /*apiurl is parameter - name of server e.g. https://xxxx.xxxxxx.sk/ */
EntitiesPerPage = RecordsPerPage, /*RecordsPerPage is parameter and defines nr. of record per page - we used as optimum 200-400 record per pages, but is working also with 4000 record per page*/
ApiToken = FnApiToken(), /*this function is returning apitoken value which is returning value of another api service apiurl&"api/auth/login", which use username and password in body of call to get apitoken */
GetJson = (QParm) => /*definiton general function to get data from data source*/
let
Options =
[ Query= QParm,
Headers=
[
Accept="application/json",
ApiKeyName="apitoken",
Authorization=ApiToken
]
],
RawData = Web.Contents(BaseAPIUrl, Options),
Json = Json.Document(RawData)
in Json,
GetEntityCount = () => /*one times called function to get nr of records using GetJson, which is returned as a part of each call*/
let
QParm = [pp="1", pg="1" ],
Json = GetJson(QParm),
Count = Json[totalRecord]
in
Count,
GetPage = (Index) => /*repeatadly called function to get each page of json using GetJson*/
let
PageNr = Text.From(Index+1),
PerPage = Text.From(EntitiesPerPage),
QParm = [pg = PageNr, pp=PerPage],
Json = GetJson(QParm),
Value = Json[data][requests]
in Value,
EntityCount = List.Max({ EntitiesPerPage, GetEntityCount() }), /*setup of nr. of records to variable*/
PageCount = Number.RoundUp(EntityCount / EntitiesPerPage), /*setup of nr. of pages */
PageIndices = { 0 .. PageCount - 1 },
Pages = List.Transform(PageIndices, each GetPage(_) /*Function.InvokeAfter(()=>GetPage(_),#duration(0,0,0,1))*/), /*here we call for each page GetJson function to get whole dataset - there is in comment test with delay between getpages but was not neccessary*/
Entities = List.Union(Pages),
Table = Table.FromList(Entities, Splitter.SplitByNothing(), null, null, ExtraValues.Error)
I also tried another way of appending pages to list using List.Generate. This is also bringing random errors in list but
it is bringing possibility to transform to table in contrast with original way with using List.Transform, but other referenced queries are failing and contains on the last row errors
When I am exploring content of faulty page/list extracting it via Add as New Query there are always all record without any fail.....
Source = List.Generate( /*another way to generate list of all pages*/
() => [Page = 0, ReqPageData = GetPage(0) ],
each [Page] < PageCount,
each [ReqPageData = GetPage( [Page] ),
Page = [Page] + 1 ],
each [ReqPageData]
),
#"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error), /*here i am able to generate table from list in contrast when is used List.Generate*/
#"Expanded Column1" = Table.ExpandListColumn(#"Converted to Table", "Column1"), /*here aj can expand list to column*/
#"Removed Errors" = Table.RemoveRowsWithErrors(#"Expanded Column1", {"Column1"}) /*here i try to exclude errors, but i dont know what happend and which records (if any) are excluded*/
Extracting errored page
and finnaly I am tottaly clueless not able to find the cause of this behavior on this specific server. I tested to call pages which are errored via POSTMAN, I discused this issue with author of API service and He also tried to call this API service with all parameters but server is returning every page OK, only Power query is not able to List.Transform ...
I will be grateful and appreciate any tips or advice or if somebody solved the same issue in the past ....
Kuby
No, each error line of list in step List.Transform coud by extracted as new query and there are all records from one page OK. hmmmm
Finnaly, problem described in this issue was caused by "corrupted" content of returning json. The provider of core system informed me that they found bug and after fixing on the side of servisdesk is everything OK again. I tried to find problem in Power query and problem was in servisdesk. :(

Changing "from" value in elasticsearch query dynamically

In my elasticsearch query I have following:
"from":0,
"size":100,
I have thousands of records in database which I want to fetch in batches of 100.
I process one batch, and then fetch next batch of 100 and so on. I know how many records are to be fetched in total.
So value for 'from' needs to be changed dynamically.
How can I modify "from" in code?
Edit: I am programming in groovy.
There are two ways to do this depending on what do you need it for-
1) First one is simply using pagination and you can keep updating the "from" variable by the desired result size in a loop till you have retrieved all the results (considering you have the total count at the start) , but the problem with this approach is - till 'from' is < 9000 it works fine, but after it exceeds 9000 you get this size restriction error-
"Result window is too large, from + size must be less than or equal to: [10000] but was [100000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting"
which can be countered, as mentioned in the error by changing the index.max_result_window setting.However if you are instead planning to use this call as a one time operation(example for re-indexing) its is better to use to the scroll api as mentioned in the next point. (reference - How to retrieve all documents(size greater than 10000) in an elasticsearch index )
2) You can use the scroll api, something like this in java :
public String getJSONResponse() throws IOException {
String res = "";
int docParsed = 0;
String fooResourceUrl
= "http://localhost:9200/myindex/mytype/_search?scroll=5m&size=100";
ResponseEntity<String> response
= restTemplate.getForEntity(fooResourceUrl, String.class);
JSONObject fulMappingOuter = (JSONObject) new JSONObject(response.getBody());
String scroll_id = fulMappingOuter.getString("_scroll_id");
JSONObject fulMapping = fulMappingOuter.getJSONObject("hits");
int totDocCount = fulMapping.getInt("total");
JSONArray hitsArr = (JSONArray) fulMapping.getJSONArray("hits");
System.out.println("total hits:" + hitsArr.length());
while (docParsed < totDocCount) {
for (int i = 0; i < hitsArr.length(); i++) {
docParsed++;
//do your stuff
}
String uri
= "http://localhost:9200/_search/scroll";
// set headers
HttpHeaders headers = new HttpHeaders();
headers.setContentType(MediaType.APPLICATION_JSON);
JSONObject searchBody = new JSONObject();
searchBody.put("scroll", "5m");
searchBody.put("scroll_id", scroll_id);
HttpEntity<String> entity = new HttpEntity<>(searchBody.toString(), headers);
// // send request and parse result
ResponseEntity<String> responseScroll = restTemplate
.exchange(uri, HttpMethod.POST, entity, String.class);
fulMapping = (JSONObject) new JSONObject(responseScroll.getBody()).get("hits");
hitsArr = (JSONArray) fulMapping.getJSONArray("hits");
// System.out.println("response when trying to upload to local: "+response.getBody());
}
return res;
}
Calling the scroll api initialises a 'Scroller' . This returns the first set of results along with a scroll_id the number of results being 100 as set when creating the scroller in the first call. Notice the 5m in the first url's parameter? That is for setting the scroll time, that is the time in minutes for which ElasticSearch will keep the search context alive,if this time is expired, no results can be further fetched using this scroll id(also its a good practice to remove the scroll context if your job has finished before the scroll time expires, as keeping the scroll context alive is quite resource intensive)
For each subsequent scroll request, the updated scroll_id is sent and next batch of results is returned.
Note: Here I have used Springboot's RestTemplate Client to make the calls and then parsed the response JSONs by using JSON parsers. However the same can be achieved by using elastic-search's own high level REST client for Groovy . here's a reference to the scroll api -
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-scroll.html
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-high-search-scroll.html

Simple pagination in datatable using ajax without sending total count from server

I'm using DataTables 1.10.5. My table uses server side processing via ajax.
$('#' + id).dataTable({
processing: true,
serverSide: true,
ajax: 'server-side-php-script-url',
"pagingType": "simple_incremental_bootstrap"
});
Everything will work properly if I send 'recordsTotal' in the server response. But I don't want to count the total entries because of performance issues. So I tried to use the pagination plugin simple_incremental_bootstrap. However it is not working as expected. The next button always return the first page itself. If I give 'recordsTotal' in server response this plugin will work properly. I found out that If we don't give 'recordsTotal', the 'start' param sent by datatable to server side script is always 0. So my server side script will always return the first page.
According to this discussion, server side processing without calculating total count is not possible because “DataTables uses the record count that is passed back to it to deal with the paging controls”. The suggested workaround is “So the display records are needed, but it would be possible to just pass back a static number (like 1'000'000 or whatever) which would make DataTables think there are a million rows. You could hide the information element if this information is totally bogus!”
I wonder if anybody have a solution for this. Basically I want to have a simple pagination in my datatable with ajax without sending total count from server.
A workaround worth to try..
If we don't send recordsTotal from server, the pagination won't work properly. If we send a high static number as recordsTotal, table will show an active Next button even if there is no data in next page.
So I ended up in a solution which utilizes two parameters received in ajax script - 'start' and 'length'.
If rows in current page is less than 'limit' there is no data in next page. So total count will be 'start' + 'current page count'. This will disable Next button in the last page.
If rows in current page is equal to or greater than 'limit' there is more data in next pages. Then I will fetch data for next page. If there is at least one row in next page, send recordsTotal something larger than 'start + limit'. This will display an active Next button.
Sample code:
$limit = require_param('length');
$offset = require_param('start');
$current_page_data = fn_to_calculate_data($limit, $offset); // in my case, mysqli result.
$data = “fetch data $current_page_data”;
$current_page_count = mysqli_num_rows($current_page_data);
if($current_page_count >= $limit) {
$next_page_data = fn_to_calculate_data($limit, $offset+$limit);
$next_page_count = mysqli_num_rows($next_page_data);
if($next_page_count >= $limit) {
// Not the exact count, just indicate that we have more pages to show.
$total_count = $offset+(2*$limit);
} else {
$total_count = $offset+$limit+$next_page_count;
}
} else {
$total_count = $offset+$current_page_count;
}
$filtered_count = $total_count;
send_json(array(
'draw' => $params['draw'],
'recordsTotal' => $total_count,
'recordsFiltered' => $filtered_count,
'data' => $data)
);
However this solution adds some load to server as it additionally calculate count of rows in next page. Anyway it is a small load as compared to the calculation total rows.
We need to hide the count information from table footer and use simple pagination.
dtOptions = {};
dtOptions.pagingType = "simple";
dtOptions.fnDrawCallback = function() {
$('#'+table_id+"_info").hide();
};
$('#' + table_id).dataTable(dtOptions);

Elasticsearch NEST client library

What we are trying to do is to index a bunch of documents in batches i.e.
foreach (var batch in props.ChunkBy(100))
{
var result = await client.IndexManyAsync<Type>(batch, indexName);
}
We would like to STOP Elasticsearch REFRESHING the Index until we have finished indexing all the batches. Then enable and refresh the index.
How can we achieve this with the NEST library
Many thanks
You can effectively disable the index refresh by setting the interval value to -1. Below is a code sample that shows how to set the refresh interval to -1 using the Nest client. Then you can do your bulk operations and afterwards set the refresh interval back to the default of 1 second.
//Set Index Refresh Interval to -1, essentially disabling the refresh
var updateDisableIndexRefresh = new UpdateIndexSettingsRequest();
updateDisableIndexRefresh.IndexSettings.RefreshInterval = Time.MinusOne;
client.UpdateIndexSettings(updateDisableIndexRefresh);
//Do your bulk operations here...
//Reset the Index Refresh Interval back to 1 second, the default setting.
var updateEnableIndexRefresh = new UpdateIndexSettingsRequest();
updateEnableIndexRefresh.IndexSettings.RefreshInterval = new Time(1, TimeUnit.Second);
client.UpdateIndexSettings(updateEnableIndexRefresh);

Resources