Offset calculation in Spring Pagination - spring

I have a service with pagination index starting from 1.I get the list of entity after some logic I return the same(responses) as below
totalCount = responses.size();
return new PageImpl<>(responses, pageable, totalCount);
and when I have requested the 1st page as
new PageRequest(1, 100)
I get back the response as
{"content": [
{
"id": "e1",
}{
"id": "2",
}
],
"last": false,
"totalElements": 102,
"totalPages": 2,
"size": 100,
"number": 1,
"sort": null,
"first": true,
"numberOfElements": 2
}
Here even though I have "numberOfElements": 2 I get back "totalElements": 102
The issue which I found is because of pageable.getOffset() calculation in PageImpl
this.total = !content.isEmpty() && pageable != null && pageable.getOffset()
+ pageable.getPageSize() > total
? pageable.getOffset() + content.size() : total;
In my scenario for the 1st Page I'am getting offset as 100 (1*100). How do I resolve this.
Note : I use a third party service to get the responses which is indexed 1 . So I am trying to align my service to that so that the entire logic follow the same indexing.

The result you get is correct, since PageRequest used zero-based pages as stated in the API docs:
Parameters:
page - zero-based page index.
size - the size of the page to be returned.
So that means you're retrieving the second page (not the first one), and since you have a limit of 100 records and a total of 102 records, you'll only retrieve the last two of them.
You can still expose a 1-based number though:
new PageRequest(page-1, 100);
Alternatively, you can customize this by implementing Pageable. This allows you to override the actual offset being used by Spring data.
Nonetheless, this doesn't change the fact that Spring data expects getPageNumber() to be a zero based number. You cannot change that, you can only add an abstraction layer on top of it to make it meet your requirements.

And what's wrong with that? totalElements tells you how many elements are stored within the data source. numberOfElements tells you how many elements the current page contains.
When have 102 elements in total and you requesting page 2 with size 100, you should get exactly the response you received.
What probably confuses you:
With new PageRequest(1, 100) you are requesting the 2nd page as the index starts at 0.

Related

Data Operation - Select (Json Array)

I have a JSON Array with the following structure:
{
"InvoiceNumber": "11111",
"AccountName": "Hospital",
"items": {
"item": [
{
"Quantity": "48.000000",
"Rate": "0.330667",
"Total": "15.87"
},
{
"Quantity": "1.000000",
"Rate": "25.000000",
"Total": "25.00"
}
]
}
}
I would like to use Data Operation "Select" to select invoice numbers with invoice details.
Select:
From body('Parse_Json')?['invoices']?['invoice']
Key: Invoice Number;Map:item()['InvoiceNumber'] - this line works
Key: Rate; Map: item()['InvoiceNumber']?['items']?['item']?['Rate']- this line doesnt work.
The error message says "Array elements can only be selected using an integer index". Is it possible to select the Invoice Number AND all the invoice details such as rate etc.? Thank you in advance! Also, I am trying not to use "Apply to each"
You have to use a loop in some form, the data resides in a array. The only way you can avoid looping is if you know that the number of items in the array will always be of a certain length.
Without looping, you can't be sure that you've processed each item.
To answer your question though, if you want to select a specific item in an array, as the error describes, you need to provide the index.
This is the sort of expression you need. In this one, I am selecting the item at position 1 (arrays start at 0) ...
body('Parse_JSON')?['items']?['item'][1]['rate']
Using your JSON ...
You can always extract just the items object individually but you'll still need to loop to process each item IF the length is never a static two items (for example).
To extract the items, you select the object from the dynamic content ...
Result ...

Elasticsearch number of results changes with pagination

I'm using Elasticsearch 7.6.0 and have paginated one of my queries. It seems to work well, and I can vary the number of results per page and the selected page using the search from and size parameters.
query = 'sample query'
items_per_page = 12
page = 0
es_query = {'query': {
'bool': {
'must': [{
'multi_match': {
'query': query,
"fuzziness": "AUTO",
"operator": "and",
'fields': ['title^2', 'description']
},
}]
}
}, 'min_score': 5.0}
res = es.search(index='my-index', body=es_query, size=items_per_page, from_=items_per_page*page)
hits = sorted(res['hits']['hits'], key=lambda x: x['_score'], reverse=True)
print(res['hits']['total']['value']) # This changes depending on the page provided
I've noticed that the number of results returned depends on the page provided, which makes no sense to me! The number of results also oscillates which further confuses me: Page 0, 233 items. Page 1, 157 items. Page 2, 157 items. Page 3, 233 items...
Why does res['hits']['total']['value'] depend on the size and from parameters?
The search is distributed and being sent to all the nodes holding shards matching the searched indices. Then all the results will be merged and returned. Sometimes, not all shards can be searched. This happens when
The cluster is very busy
The specific shard is not available due to recovery process
The search has been optimized and the shard has been omitted.
In the response, there is a _shards section like this:
{
"took": 1,
"timed_out": false,
"_shards":{
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits":{...}
}
Check if there is any value other than 0 for failed shards. If so, check the logs and cluster and index status.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-track-total-hits
Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents. The track_total_hits parameter allows you to control how the total number of hits should be tracked. Given that it is often enough to have a lower bound of the number of hits, such as "there are at least 10000 hits", the default is set to 10,000. This means that requests will count the total hit accurately up to 10,000 hits. It’s is a good trade off to speed up searches if you don’t need the accurate number of hits after a certain threshold.
When set to true the search response will always track the number of hits that match the query accurately (e.g. total.relation will always be equal to "eq" when track_total_hits is set to true). Otherwise the "total.relation" returned in the "total" object in the search response determines how the "total.value" should be interpreted. A value of "gte" means that the "total.value" is a lower bound of the total hits that match the query and a value of "eq" indicates that "total.value" is the accurate count.
len(res['hits']['hits']) will always return the same number as specified in items_per_page (i.e. 12 in your case), except for the last page, where it might return a number smaller or equal to 12.
However, res['hits']['total']['value'] is the total number of documents in your index, not the number of results returned. If the number of documents increases, it means that new documents got indexed between the last query and the current one.

Detecting changes when comparing documents within an index in ElasticSearch

I'm using elastic search to store website crawl data in one index. Docs look something like this:
{"crawl_id": 1, url": "http://www.example.com", "status": 200}
{"crawl_id": 1, url": "http://www.example.com/test", "status": 200}
{"crawl_id": 2, url": "http://www.example.com", "status": 200}
{"crawl_id": 2, url": "http://www.example.com/test", "status": 500}
How would I compare 2 different crawls? For instance
I want to know which pages have changed their status code from 200 to 500, in crawl_id 2 when I compare crawl_id 2 with crawl_id 1.
I'd like to get the list of documents, but also aggregate on those results.
For instance 1 page changed from 200 to 500.
Any ideas?
I would use parent/child documents for that. Parents representing each URL, children representing each different crawl event. Then I'd select parents by searching the children (I ignore if this feature is still maintained or if it has changed its name to join data types).
I'd have also have a look to document versions and see which one fits my requirements better.

Elastic - Search across object without key specification

I have an index with hundreds of millions docs and each of them has an object "histogram" with values for each day:
"_source": {
"proxy": {
"histogram": {
"2017-11-20": 411,
"2017-11-21": 34,
"2017-11-22": 0,
"2017-11-23": 2,
"2017-11-24": 1,
"2017-11-25": 2692,
"2017-11-26": 11673
}
}
}
And I need one of two solutions:
Find docs where any value inside histogram object is greater then XX
Find docs where avg of values in histogram object is greater then XX
In point 1 I can use range query, but I must specify exactly name of field (i.e. proxy.histogram.2017-11-20). And wildcard version (proxy.histogram.*) doesnot work.
In point 2 I found in ES only average aggregation, but I don't want aggregation of these fields after query (because large of data), I want to only search these docs.

couchdb view/reduce. sometimes you can return values, sometimes you cant..?

This is on a recent version of couchbase server.
The end goal is for the reduce/groupby to aggregate the values of the duplicate keys in to a single row with an array value.
view result with no reduce/grouping (in reality there are maybe 50 rows like this emitted):
{
"total_rows": 3,
"offset": 0,
"rows": [
{
"id": "1806a62a75b82aa6071a8a7a95d1741d",
"key": "064b6b4b-8e08-4806-b095-9e59495ac050",
"value": "1806a62a75b82aa6071a8a7a95d1741d"
},
{
"id": "47abb54bf31d39946117f6bfd1b088af",
"key": "064b6b4b-8e08-4806-b095-9e59495ac050",
"value": "47abb54bf31d39946117f6bfd1b088af"
},
{
"id": "ed6a3dd3-27f9-4845-ac21-f8a5767ae90f",
"key": "064b6b4b-8e08-4806-b095-9e59495ac050",
"value": "ed6a3dd3-27f9-4845-ac21-f8a5767ae90f"
}
}
with reduce + group_level=1:
function(keys,values,re){
return values;
}
yields an error from couch with the actual 50 or so rows from the real view (even fails with fewer view rows). couch says something about the data not shrinking rapidly enough. However this same type of thing works JUST FINE when the view keys are integers and there is a small amount of data.
Can someone please explain the difference to me?
Reduce values need to remain as small as possible, due to the nature of how they are stored in the internal b-tree data format. There's a little bit of information in the wiki about why this is.
If you want to identify unique values, this needs to be done in your map function. This section on the same wiki page shows you one method you can use to do so. (I'm sure there are others)
I am almost always going to be querying this view with a "key" parameter, so there really is no need to aggregate values via couch, it can be easily and efficiently done in the app.

Resources