Changing "from" value in elasticsearch query dynamically

Changing "from" value in elasticsearch query dynamically - elasticsearch

In my elasticsearch query I have following:
"from":0,
"size":100,
I have thousands of records in database which I want to fetch in batches of 100.
I process one batch, and then fetch next batch of 100 and so on. I know how many records are to be fetched in total.
So value for 'from' needs to be changed dynamically.
How can I modify "from" in code?
Edit: I am programming in groovy.

There are two ways to do this depending on what do you need it for-
1) First one is simply using pagination and you can keep updating the "from" variable by the desired result size in a loop till you have retrieved all the results (considering you have the total count at the start) , but the problem with this approach is - till 'from' is < 9000 it works fine, but after it exceeds 9000 you get this size restriction error-
"Result window is too large, from + size must be less than or equal to: [10000] but was [100000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting"
which can be countered, as mentioned in the error by changing the index.max_result_window setting.However if you are instead planning to use this call as a one time operation(example for re-indexing) its is better to use to the scroll api as mentioned in the next point. (reference - How to retrieve all documents(size greater than 10000) in an elasticsearch index )
2) You can use the scroll api, something like this in java :
public String getJSONResponse() throws IOException {
String res = "";
int docParsed = 0;
String fooResourceUrl
= "http://localhost:9200/myindex/mytype/_search?scroll=5m&size=100";
ResponseEntity<String> response
= restTemplate.getForEntity(fooResourceUrl, String.class);
JSONObject fulMappingOuter = (JSONObject) new JSONObject(response.getBody());
String scroll_id = fulMappingOuter.getString("_scroll_id");
JSONObject fulMapping = fulMappingOuter.getJSONObject("hits");
int totDocCount = fulMapping.getInt("total");
JSONArray hitsArr = (JSONArray) fulMapping.getJSONArray("hits");
System.out.println("total hits:" + hitsArr.length());
while (docParsed < totDocCount) {
for (int i = 0; i < hitsArr.length(); i++) {
docParsed++;
//do your stuff
}
String uri
= "http://localhost:9200/_search/scroll";
// set headers
HttpHeaders headers = new HttpHeaders();
headers.setContentType(MediaType.APPLICATION_JSON);
JSONObject searchBody = new JSONObject();
searchBody.put("scroll", "5m");
searchBody.put("scroll_id", scroll_id);
HttpEntity<String> entity = new HttpEntity<>(searchBody.toString(), headers);
// // send request and parse result
ResponseEntity<String> responseScroll = restTemplate
.exchange(uri, HttpMethod.POST, entity, String.class);
fulMapping = (JSONObject) new JSONObject(responseScroll.getBody()).get("hits");
hitsArr = (JSONArray) fulMapping.getJSONArray("hits");
// System.out.println("response when trying to upload to local: "+response.getBody());
}
return res;
}
Calling the scroll api initialises a 'Scroller' . This returns the first set of results along with a scroll_id the number of results being 100 as set when creating the scroller in the first call. Notice the 5m in the first url's parameter? That is for setting the scroll time, that is the time in minutes for which ElasticSearch will keep the search context alive,if this time is expired, no results can be further fetched using this scroll id(also its a good practice to remove the scroll context if your job has finished before the scroll time expires, as keeping the scroll context alive is quite resource intensive)
For each subsequent scroll request, the updated scroll_id is sent and next batch of results is returned.
Note: Here I have used Springboot's RestTemplate Client to make the calls and then parsed the response JSONs by using JSON parsers. However the same can be achieved by using elastic-search's own high level REST client for Groovy . here's a reference to the scroll api -
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-scroll.html
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-high-search-scroll.html

Related

Excessive Metadata when getting all states in Test Corda vault

Im testing a Corda 4 Cordapp and set up a spring web server to make api calls to my cordapps. I have one api called named ```get-all-contract1-states`` which does exactly what it says. It gets all of my contract1 states in the vault.
When I call this function, it does return the states, but also returns an excessive amount of repetitive metadata making the output for 1 state more than 600k lines long.
#GetMapping(value = "/get-contract1-states", produces = arrayOf(MediaType.APPLICATION_JSON_VALUE))
fun getContract1s() = rpcOps.vaultQueryBy(criteria = VaultQueryCriteria(status = Vault.StateStatus.ALL), paging = PageSpecification(DEFAULT_PAGE_NUM, 200), sorting = Sort(emptySet()), contractStateType = contract1State::class.java).states
Most of the repetitive metadata (which makes up about 85% of the 600k lines) is at the end of the Json regarding "zero":false,"one":false,"fieldSize":256,"fieldName":"SecP256R1Field". Are there any flags, options, or simply any way to get back a clean version of the contract without so much excess data. I only care about the variables from the contract, nothing more.

What you currently have will return you a collection of:
data class Page<out T : ContractState>(val states: List<StateAndRef<T>>,
val statesMetadata: List<StateMetadata>,
val totalStatesAvailable: Long,
val stateTypes: StateStatus,
val otherResults: List<Any>)
Hence why you're getting all the metadata. What you're after in this data object is states (which actually returns StateAndRef) and then just state within each.
The following code should get you what you're after:
#GetMapping(value = "/get-contract1-states", produces = arrayOf(MediaType.APPLICATION_JSON_VALUE))
fun getContract1s() = proxy.vaultQueryBy(criteria = QueryCriteria.VaultQueryCriteria(status =
Vault.StateStatus.ALL), paging = PageSpecification(DEFAULT_PAGE_NUM, 200),
sorting = Sort(emptySet()), contractStateType = IOUState::class.java).states.map { it.state.data }
Note: the key bit here is the mapping to state.data

Get the only failed document response in Bulk API Elasticsearch

I am struggling with Bulk API. I am sending 100 request (index, update) in every Bulk request. It gives me response with status of each request. Suppose my 97th request got fail, I have to make it loop to find the particular error document. I think its not optimize way. If i am sending more number of Bulk request, It makes my process slow. Is there any way where i will get only failed document or count of fail/success document in response? I am using php-elasticsearch SDK .

for count of fail/success you can use this method:
get count of index before bulk action
you can ignore if if there is no index
$parameters = ["index" => "your_index","type" => "your_type"];
$response = $esclient->count($params);
$old_count = $response['count'];
use refresh key with true value in parameters that you send as bulk
this refresh this index after performing bulk action
$params['refresh'] = true;
$params['body'] = ...;
$total_count = count($params['body']) / 2; //get the total request count
$esclient->bulk($params);
after that you can use count method to find out how many index exist
$response = $esclient->count($parameters);
$new_count = $response['count'];
get the total success
$total_success = $new_count - $old_count;
get the total failed
$total_fail = $total_count- $total_success;

Elasticsearch NEST client library

What we are trying to do is to index a bunch of documents in batches i.e.
foreach (var batch in props.ChunkBy(100))
{
var result = await client.IndexManyAsync<Type>(batch, indexName);
}
We would like to STOP Elasticsearch REFRESHING the Index until we have finished indexing all the batches. Then enable and refresh the index.
How can we achieve this with the NEST library
Many thanks

You can effectively disable the index refresh by setting the interval value to -1. Below is a code sample that shows how to set the refresh interval to -1 using the Nest client. Then you can do your bulk operations and afterwards set the refresh interval back to the default of 1 second.
//Set Index Refresh Interval to -1, essentially disabling the refresh
var updateDisableIndexRefresh = new UpdateIndexSettingsRequest();
updateDisableIndexRefresh.IndexSettings.RefreshInterval = Time.MinusOne;
client.UpdateIndexSettings(updateDisableIndexRefresh);
//Do your bulk operations here...
//Reset the Index Refresh Interval back to 1 second, the default setting.
var updateEnableIndexRefresh = new UpdateIndexSettingsRequest();
updateEnableIndexRefresh.IndexSettings.RefreshInterval = new Time(1, TimeUnit.Second);
client.UpdateIndexSettings(updateEnableIndexRefresh);

Load test randomization: How to set up WCAT to use different scenario for each virtual client?

I would like to run load test of one of POST action in my web application. The problem is that the action can be completed only if it receives unique email address in POST data. I generated wcat script with few thousands requests each with unique email, like:
transaction
{
id = "1";
weight = 1;
request
{
verb = POST; postdata = "Email=test546546546546%40loadtest.com&...";
setheader { name="Content-Length"; value="...";
}
// more requests like that
}
My UBR settings file is like:
settings
{
counters
{
interval = 10;
counter = "Processor(_Total)\\% Processor Time";
counter = "Processor(_Total)\\% Privileged Time";
counter = "Processor(_Total)\\% User Time";
counter = "Processor(_Total)\\Interrupts/sec";
}
clientfile = "<above-wcat-script>";
server = "<host name>";
clients = 3;
virtualclients = 100;
}
When I run the test 3x100 = 300 clients starts sending requests, but they are doing it in the same order so the first request from the first client is processed, and then the next 299 requests from other clients are not unique anymore. Then the second request from some client is processed, and 299 identical requests from other clients are not unique.
I need a way to randomize the requests or run them in different order or set up separate scenario scripts for each virtual client so that each request carry unique email address.
Is it possible to do that with WCAT?
Or maybe there is some other tool that can do such a test?

Have you considered using the rand(x,y) WCAT internal function to add randomized integer to the email address? By doing so you could conceivably have a single transaction with single request that uses a randomized email address. So instead of manually creating (say) 1000 requests with unique email addresses, you can use the single randomized transaction 1000 times.
Your new randomized transaction might look something like this:
transaction
{
id = "1";
weight = 1;
request
{
verb = POST;
postdata = "Email=" + rand("100000", "1000000") + "#loadtest.com&...";
setheader { name="Content-Length"; value="...";
}
}
If using rand(x,y) doesn't make it random enough then you could experiment with using additional functions to make the data more random. Perhaps something like this:
postdata = "Email=" + rand("100000", "1000000") + "#loadtest" + clientindex() + vclientindex() + ".com&...";
You can find the WCAT 6.3 documentation here, including a list of the internal functions that are available. If the built in functions don't suffice, you can even build your own.

How to synchronize HttpRequest or WebClient in Wp7?

Now I know i can only dowload a string asynchronously in Windows Phone Seven, but in my app i want to know which request has completed.
Here is the scenario:
I make a certain download request using WebClient()
i use the following code for download completed
WebClient stringGrab = new WebClient();
stringGrab.DownloadStringCompleted += ClientDownloadStringCompleted;
stringGrab.DownloadStringAsync(new Uri(<some http string>, UriKind.Absolute));
i give the user the option of giving another download request if this request takes long for the user's liking.
my problem is when/if the two requests return, i have no method/way of knowing which is which i.e. which was the former request and which was second!
is there a method of knowing/sychronizing the requests?
I can't change the requests to return to different DownloadStringCompleted methods!
Thanks in Advance!

Why not do something like this:
void DownloadAsync(string url, int sequence)
{
var stringGrab = new WebClient();
stringGrab.DownloadStringCompleted += (s, e) => HandleDownloadCompleted(e, sequence);
stringGrab.DownloadStringAsync(new Uri(url, UriKind.Absolute));
}
void HandleDownloadCompleted(DownloadStringCompletedEventArgs e, int sequence)
{
// The sequence param tells you which request was completed
}

It is an interesting question because by default WebClient doesn't carry any unique identifiers. However, you are able to get the hash code, that will be unique for each given instance.
So, for example:
WebClient client = new WebClient();
client.DownloadStringCompleted += new DownloadStringCompletedEventHandler(client_DownloadStringCompleted);
client.DownloadStringAsync(new Uri("http://www.microsoft.com", UriKind.Absolute));
WebClient client2 = new WebClient();
client2.DownloadStringCompleted += new DownloadStringCompletedEventHandler(client_DownloadStringCompleted);
client2.DownloadStringAsync(new Uri("http://www.microsoft.com", UriKind.Absolute));
Each instance will have its own hash code - you can store it before actually invoking the DownloadStringAsync method. Then you will add this:
int FirstHash = client.GetHashCode();
int SecondHash = client2.GetHashCode();
Inside the completion event handler you can have this:
if (sender.GetHashCode() = FirstHash)
{
// First completed
}
else
{
// Second completed
}
REMEMBER: A new hash code is given for every re-instantiation.

If the requests are essentially the same, rather than keep track of which request is being returned. Why not just keep track of if one has previously been returned? Or, how long since the last one returned.
If you're only interested in getting this data once, but are trying to allow the user to reissue the request if it takes a long time, you can just ignore all but the first successfully returned result. This way it doesn't matter how many times the user makes additional requests and you don't need to track anything unique to each request.
Similarly, if the user can request/update data from the remote service at any point, you could keep track of how long since you last got successfull data back and not bother updating the model/UI if you get another resoponse shortly after that. It'd be preferable to not make requests in this scenario but if you've got to deal with long delays and race conditions in responses you could use this technique and still keep the UI/data up to date within a threshold of a few minutes (or however long you specify).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Changing "from" value in elasticsearch query dynamically - elasticsearch

Related

Excessive Metadata when getting all states in Test Corda vault

Get the only failed document response in Bulk API Elasticsearch

Elasticsearch NEST client library

Load test randomization: How to set up WCAT to use different scenario for each virtual client?

How to synchronize HttpRequest or WebClient in Wp7?

Categories

Resources