Elasticsearch Scroll - elasticsearch

Elasticsearch Scroll - elasticsearch

I am little bit confused over Elasticsearch by its scroll functionality.
In elasticsearch is it possible to call search API everytime whenever the user scrolls on the result set?
From documentation
"search_type" => "scan", // use search_type=scan
"scroll" => "30s", // how long between scroll requests. should be small!
"size" => 50, // how many results *per shard* you want back
Is that mean it will perform search for every 30 seconds and returns all the sets of results until there is no records?
For example my ES returns total 500 records. I am getting an data from ES as two sets of records each with 250 records. Is there any way I can display first set of 250 records first, when user scrolls then second set of 250 records.Please suggest

What you are looking for is pagination.
You can achieve your objective by querying for a fixed size and setting the from parameter. Since you want to set display in batches of 250 results, you can set size = 250 and with each consecutive query, increment the value of from by 250.
GET /_search?size=250 ---- return first 250 results
GET /_search?size=250&from=250 ---- next 250 results
GET /_search?size=250&from=500 ---- next 250 results
On the contrary, Scan & scroll lets you retrieve a large set of results with a single search and is ideally meant for operations like re-indexing data into a new index. Using it for displaying search results in real-time is not recommended.
To explain Scan & scroll briefly, what it essentially does is that it scans the index for the query provided with the scan request and returns a scroll_id. This scroll_id can be passed to the next scroll request to return the next batch of results.
Consider the following example-
# Initialize the scroll
page = es.search(
index = 'yourIndex',
doc_type = 'yourType',
scroll = '2m',
search_type = 'scan',
size = 1000,
body = {
# Your query's body
}
)
sid = page['_scroll_id']
scroll_size = page['hits']['total']
# Start scrolling
while (scroll_size > 0):
print "Scrolling..."
page = es.scroll(scroll_id = sid, scroll = '2m')
# Update the scroll ID
sid = page['_scroll_id']
# Get the number of results that we returned in the last scroll
scroll_size = len(page['hits']['hits'])
print "scroll size: " + str(scroll_size)
# Do something with the obtained page
In above example, following events happen-
Scroller is initialized. This returns the first batch of results along with the scroll_id
For each subsequent scroll request, the updated scroll_id (received in the previous scroll request) is sent and next batch of results is returned.
Scroll time is basically the time for which the search context is kept alive. If the next scroll request is not sent within the set timeframe, the search context is lost and results will not be returned. This is why it should not be used for real-time results display for indexes with a huge number of docs.

You are understanding wrong the purpose of the scroll property. It does not mean that elasticsearch will fetch next page data after 30 seconds. When you are doing first scroll request you need to specify when scroll context should be closed. scroll parameter is telling to close scroll context after 30 seconds.
After doing first scroll request you will get back scroll_idparameter in response. For next pages you need to pass that value to get next page of the scroll response. If you will not do the next scroll request within 30 seconds, the scroll request will be closed and you will not be able to get next pages for that scroll request.

What you described as an example use case is actually search results pagination, which is available for any search query and is limited by 10k results. scroll requests are needed for the cases when you need to go over that 10k limit, with scroll query you can fetch even the entire collection of documents.
Probably the source of confusion here is that scroll term is ambiguous: it means the type of a query, and also it is a name of a parameter of such query (as was mentioned in other comments, it is time ES will keep waiting for you to fetch next chunk of scrolling).
scroll queries are heavy, and should be avoided until absolutely necessary. In fact, in the docs
it says:
Scrolling is not intended for real time user requests, but rather for processing large amounts of data, ...
Now regarding your another question:
In elasticsearch is it possible to call search API everytime whenever the user scrolls on the result set?
Yes, even several parallel scroll requests
are possible:
Each scroll is independent and can be processed in parallel like any scroll request.

The documentation of the Scroll API at elastic explains this behaviour also.
The result size of 10k is a default value and can be overwritten during runtime, if necessary:
PUT { "index" : { "max_result_window" : 500000} }
The life time of the scroll id is defined in each scroll request with the parameter "scroll", e.g.
..
"scroll" : "5m"
..

In recent versions of Elasticsearch, you'll use search_after. The keep_alive you set there, much like the timeout in the scroll, is only the time needed for you to process one page.
That's because Elasticsearch will keep your "search context" alive for that amount of time, then removes it. Also, Elasticsearch won't fetch the next page for you automatically, you'll have to do that by sending requests with the ID from the last request.

It is wise to use the scroll api as one can not get more than 10K data at a time in elasticsearch.

Related

How to perform a "from" in Elasticsearch scroll context?

I have a large dataset to query and display in website on an array.
I made a pagination system with a scroll but i can only display a maximum of 100 items at a time so i'm facing issue when i want to display data of page 200 and more because i have to scroll until them and it take too long.
I have check other parts of my code and i didn't find other perf issue, is just the scroll queries which make my api call too long. I tried setting the request size from 100 to 10000 but it doesn't change anything.
I don't think sliced scroll can be a solution or then I didn't understand the functionality.
I'm desperately searching a way to skip the scroll queries before datas that i'm searching even it's not a precise method.
Hoping someone has a solution or at least a clue.
Edit:
More details about what i'm trying to achieve.
I log some actions of my users like calls in Elasticsearch indexes. They do millions of actions per month so Elasticsearch seems like a good option to store them knowing that i don't have to update them after they are stored .
I'm creating a page where my users can search for actions they've performed, but they're doing the "query" themselves. I mean they can select the period and many other parameters, order them by many parameters, etc. The number of result can be 1 or 100,000 items, but I can't show 100,000 items on my page for UI reasons, so I have to manage a pagination and send only part of the result to the page.
I made a scroll query to do it for now with a size of 1000, and i scroll until i'm in the current page of my pagination. I tried to vary the size but it's not really concluent because I can't know the number of result before the query is made.
And the deeper my user go in the pagination, the longer the query take.
I could increase the index.max_result_window with an unreachable number (but I don't know what that implies) make a simple query with a from and a second scroll query for export case but I wonder if they are a way to skip some step in a scroll when i know i'm going to take 100 items after the 1 000 000th item ?
Edit: I watched how google design its pagination and i notice that if you want to go deep in search results you can't unless you go step by step. You can't go directly to the 500th page.
This is how I done mine
So I just redesign my pagination to do the same as Google and force my users to use more precise filters to get less result. Thank you #Val for getting me to ask the right questions :)

Displaying data dynamically from SQL database with Golang, JSON, and JavaScript

I have a Golang server that fetches all the rows from a database table.
SELECT * FROM people;
The fetched data is 'marshaled' into JSON:
json.NewEncoder(w).Encode(people)
JavaScript can fetch the rows through its Fetch API.
Now let's say the database table has 10,000 rows but I only want to display as many rows as they fit on the screen.
As I scroll the page I'd like more rows to be fetched dynamically from the database.
Does the client need to send data to the server telling the server to fetch the JSON again with more data?
I would be grateful for any suggestion. Thank you!

Assuming what you're looking for is pagination, then the answer is quite simple, but it requires changes both on the client and the server side:
Getting the data: You'll want the client-side to tell the server how big the batches of data should be (typically 10, 20, 30, 40, or 50 results per call, with a default value of 10).
The second parameter you'll want from the client is to indicate how many results the client has already loaded on their end.
With these two values, you'll be able to enrich the query to include a limit and offset value. The default offset being 0, and default limit being 10.
That means the first query will look something like:
SELECT * FROM people LIMIT 10 OFFSET 0;
This is often shortened to LIMIT 10; or LIMIT 0, 10;.
Now if the client scrolls down, to load the next 10 records, you'll want them to perform an AJAX call providing you the batch size, and the offset value (how many records are already displayed), then just plug in these values, and you'll get:
SELECT * FROM people LIMIT 10 OFFSET 10;
This query tells the DB to skip the first 10 results, and return the next 10.
If you're using actual pages, another common way to handle this is to have the client provide the page size value, and the page number. For example: I want to see 20 people per page, and I want to jump directly from page 1 to page 5, the parameters passed to the server would be something like: page_size=20&page=5
That means that I need a query that skips the first 80 records (20 times 4 pages), a trivial computation:
offset := pageNr * pageSize - pageSize // or (pageNr -1 ) * pageSize
For the query to be:
SELECT * FROM people LIMIT 20 OFFSET 80;
Some general tips:
Avoid SELECT * as much as possible, always be explicit about the fields you select. Adding fields to a table is common, and using select * can result in exposing data you don't want people to see, or your code messing up because it can't handle new fields
Specify the order for your data. Have a created_at field in your table, and sort by that value, or sort by name or whatever. If you don't sort the results in a particular way, how can you guarantee that the first 10 results will always be the same? If you can't guarantee that, why wouldn't it be possible that you're going to skip some records, and display others twice?
Create a People type server-side that represents the records in the DB if you haven't already
It's been quite a number of years since I've done any front-end work, but to know when you need to load more records, you'll want to write some JS that handles the scroll event. There's a number of plugins out there already, and my JS is most likely outdated, but here's an attempt (untested, and to be treated as pseudo-code):
document.querySelector('#person_container').addEventListener('scroll', (function(ls) {
var last = ls[ls.length-1] // last person already loaded
let batchSize = ls.length // IIRC, let is used instead of var now
// returning the actual handler
return function(e) {
// scroll reached bottom of the page
if ((window.innerHeight + window.scrollY) >= document.body.offsetHeight) {
// make AJAX call with batchSize + offset, append new elements after last element
}
};
}(document.querySelectorAll('.person_item')))

Infinite/paginated scrolling with caching

I have a requirement where I need to display a long table. It doesn't have to be displayed all at once, so ajax loading it is (load first 50 recs, then get another 50 rows everytime the user scrolls to/past the tenth row from the last).
But I'm not sure which of the two, pagination and infinite scrolling, is better. I'd like the user to be able to skip to the last scrolled-to point when returning to the page (through Back button, definitely; if I can do that whenever, however user visits the page, even better!) with the previous rows visible as well. At the same time, for performance, I want to restrict the number of ajax calls to as low as I can keep it.
Any thoughts?

To implement such scenerio, first consume an api with page no and number of records as request params in API calls
For Ex- 'www.abc.com/v1/tableData/pageId=1&noOfRecords=50'
Then you will get the first 50 records. Its response should also provide you the total number of recors avaiallbe in database after callling first api .
When you scroll down, increase the pageId with +1
For ex - 'www.abc.com/v1/tableData/pageId=2&noOfRecords=50'
In the same way, you will increase the pageId untill you check the total records you got till now, should be equals to the total records, you are getting from API key.
In this way you can able to impmentent it.
Talking about performance, its up to you whther you are using pagination or scroll, it does not matter, since you are restricting the number of records to display.

Smart pagination algorithm that works with local data cache

This is a problem I have been thinking about for a long time but I haven't written any code yet because I first want to solve some general problems I am struggling with. This is the main one.
Background
A single page web application makes requests for data to some remote API (which is under our control). It then stores this data in a local cache and serves pages from there. Ideally, the app remains fully functional when offline, including the ability to create new objects.
Constraints
Assume a server side database of products containing +- 50000 products (50Mb)
Assume no db type, we interact with it via REST/GraphQL interface
Assume a single product record is < 1kB
Assume a max payload for a resultset of 256kB
Assume max 5MB storage on the client
Assume search result sets ranging between 0 ... 5000 items per search
Challenge
The challenge is to define a stateless but (network) efficient way fetch pages from a result set so that it is deterministic which results we will get.
Example
In traditional paging, when getting the next 100 results for some query using this url:
https://example.com/products?category=shoes&firstResult=100&pageSize=100
the search result may look like this:
{
"totalResults": 2458,
"firstResult": 100,
"pageSize": 100,
"results": [
{"some": "item"},
{"some": "other item"},
// 98 more ...
]
}
The problem with this is that there is no way, based on this information, to get exactly the objects that are on a certain page. Because by the time we request the next page, the result set may have changed (due to changes in the DB), influencing which items are part of the result set. Even a small change can have a big impact: one item removed from the DB, that happened to be on page 0 of the result set, will change what results we will get when requesting all subsequent pages.
Goal
I am looking for a mechanism to make the definition of the result set independent of future database changes, so if someone was looking for shoes and got a result set of 2458 items, he could actually fetch all pages of that result set reliably even if it got influenced by later changes in the DB (I plan to not really delete items, but set a removed flag on them, for this purpose)
Ideas so far
I have seen a solution where the result set included a "pages" property, which was an array with the first and last id of the items in that page. Assuming your IDs keep going up in number and you don't really delete items from the DB ever, the number of items between two IDs is constant. Meaning the app could get all items between those two IDs and always get the exact same items back. The problem with this solution is that it only works if the list is sorted in ID order... I need custom sorting options.
The only way I have come up with for now is to just send a list of all IDs in the result set... That way pages can be fetched by doing a SELECT * FROM products WHERE id IN (3,4,6,9,...)... but this feels rather inelegant...
Any way I am hoping it is not too broad or theoretical. I have a web-based DB, just no good idea on how to do paging with it. I am looking for answers that help me in a direction to learn, not full solutions.

Versioning DB is the answer for resultsets consistency.
Each record has primary id, modification counter (version number) and timestamp of modification/creation. Instead of modification of record r you add new record with same id, version number+1 and sysdate for modification.
In fetch response you add DB request_time (do not use client timestamp due to possibly difference in time between client/server). First page is served normally, but you return sysdate as request_time. Other pages are served differently: you add condition like modification_time <= request_time for each versioned table.

You can cache the result set of IDs on the server side when a query arrives for the first time and return a unique ID to the frontend. This unique ID corresponds to the result set for that query. So now the frontend can request something like next_page with the unique ID that it got the first time it made the query. You should still go ahead with your approach of changing DELETE operation to a removed operation because it would make sure that none of the entries from the result set it deleted. You can discard the result set of the query from the cache when the frontend reaches the end of the result set or you can set a time limit on the lifetime of the cache entry.

KendoUI filters and paging functionality - How they work with large JSON data

I have 100,000 records right now (will grow in future). I have JSON api call (remote URL, however, within same server) to get those records. If I use KendoUI with paging turned on (say 50 per page), will KendoUI datasource going to fetch all those records and bring them into client and apply paging? Or is it something I need to pass to the server (page size) to be able to only get only needed data for display? If I need to pass, do I have to write custom data source query methods?
Same question goes for using filter input boxes in toolbar within KendoUI.

There are two (efficient) ways of loading that amount of data:
Setting serverPaging to true in the DataSourcedefinition.
Using serverPaging plyst (as #bobosov534 and #gitsitgo suggests) virtual scrolling.
In the both you receive in the server tow parameters: top indicating the number of records to retrieve (what you have defined as pageSize) and skip for the number of records to ignore (no skip means the first top records).
The difference is that int the first you see a pagination bar in the bottom of the grid and in the second you see additional records as you scroll down.
In DataSource.serverPaging you find detailed information on the fields sent to the server for managing pagination.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio