Youtube API returns 1,000,000 results everytime , its an estimated result and not the exact one. How to implement pagination for this? - youtube-data-api

I would like to add pagination to the youtube API results.
The results returned are 1,000,000 results always.
I have read about this number of results returned, It seems its an estimated number and not the exact count.
In this scenario how to implement pagination.

pageInfo.totalResults
integer
The total number of results in the result set.Please note that the value is an approximation and may not represent an exact value. In addition, the maximum value is 1,000,000.
Source: Search: list
Implementation: Pagination will help you to understand how pagination works with YouTube Data API v3. The 1,000,000 value isn't a limit, however there may be other limitations. So let us know if you are still stuck.

Related

Elasticsearch Track total hits alternative with approximation

Based on this article - link there are some serious performance implications with having track_total_hits property set to true.
We currently use it to get the number of documents matching after users search. Then user can use pagination to scroll through the results. The number of documents for such a search usually ranges from 10k - 5M.
Example of a user work flow:
User performs a search which matches 150.000 documents
We show him the first 200 results which he can scroll through but we also show him the total number of documents found in the search.
Since we always show the number of document searches and often those numbers can be quite high we need some kind of a way to get that count. I'm not sure but if we almost always perform paginated searches I would assume a lot of the things would be in memory ? Maybe then this actually effects us less then how it's shown in the provided article?
Some kind of an approximation and not an exact count would be ok for us if it would improve performance.
Is there such an option in Elasticsearch where we can get approximated count on search requests ?
There is no option to get an approximate count, but you may want to consider assigning track_total_hits a lower bound instead of true , which is a good compromise from a performance standpoint ( https://www.elastic.co/guide/en/elasticsearch/reference/master/search-your-data.html#track-total-hits)
That way, you can show users that there are at least k results - but there could be more.
Also, try using search_after (if you are not using it already) for pagination.

How to implement Elasticsearch pagination with large dataset

Environment
.Net 5
Elasticsearch.Net.Aws 7.1.0
Problem
Even with pagination, Elasticsearch's query API does not support more than 10_000 records by default. I.e. if the sum of from and size > 10_000 the API throws an error.
Potential solutions
Increase size
I can increase the index's max_result_window as described here. However I am expecting a large dataset in production - probably less than 10_000_000 records at one time, but for obvious reasons I don't believe that simply increasing the window size is a good idea. My use-case does not require over-the-top performance, but it has to be reasonable for both the end-user and the AWS bill.
What do you think? What leeway do I have regarding to max_result_window setting?
Track total hits
I've read about track_total_hits parameter - It only returns the correct amount of total hits on each request, but still does not allow records after the 10_000th to be fetched
Scroll API
I've read about the Scroll-API - it's being deprecated currently, so I'd like to avoid it.
Search after
I've read about the search_after parameter - the concept is to define a consistent sort criteria and call exact query for each page, the only difference being is the value of search_after, which for every subsequent search should be the sort value returned of the last hit in the previous search.
As far as I can tell this is the recommended solution, but while it may work for large page sizes, I'm having difficulty understanding how it solves the basic paging case:
Lets say we have 20_000 records total, page size is 10, hense 2_000 pages. How can I return the last page, containing records 19_990-20_000? Unless I misunderstand, search_after does not help, because I've skipped pages and I don't have the sort value of record number 19_989.
Further more, per the docs:
If provided, the from argument must be 0 (default) or -1
This means that I cannot use a combination of both:
Perform one search with "from": "990"
Use the last record's sort value to perform a second search, again using a "from": "990"
Return the results of the second search.
Beyond that I cannot figure out another way to use it. Could you tell me where I'm getting it wrong?

Can you apply a limit on promQL query?

For example:
Avg by (server) (HttpStatusCodes{category = 'Api.ResponseStatus'}) limit 10
Is this valid in promQl? I can not find anything about it in the documentation. Thanks
The provided query is valid MetricsQL query for VictoriaMetrics, but unfortunately it doesn't work in the original PromQL.
Prometheus provides topk
and bottomk operators, which can be used for limiting the number of returned time series. Unfortunately, these operators limit the number of returned time series on a per-point basis (considering points on the graph). This means that the total number of returned time series may exceed the requested limit. MetricsQL solve this issue with a family of topk_* and bottomk_* functions. See MetricsQL docs for details.

Suggestion for limiting fuzzy search suggestion results

I've implemented a fuzzy search algorithm based on a N closest neighbors query for given search terms. Each query returns a pre-set number of raw results, in my case a max. of 200 hits / query, sorted descending by score, highest score first.
The raw search already produces good results, but in some rather rare cases not good enough so I've added another post-processing layer or better said another metric to the raw search results based on Levenshtein-Damerau algorithm that measures the word / phrase distance between query term(s) and raw results. The lower the resulting score the better, 0.0 would be an exact match.
Using the Levenshtein-Damerau post-processing algorithm I sort the results ascending, from the lowest to the highest.
The quality of matches is amazingly good and all relevant hits are ranked to the top. Still I have the bulk of 200 hits from the core search and I am looking for a smart way to limit the final result set down to a maximum of 10-20 hits. I could just add a static limit as it is basically done. But I wonder if there is a better way to do this based on the individual metrics I get with each search result set.
I have the following result metrics:
The result score of the fuzzy core search search, a value of type float/double. The higher the better
The Levenshtein-Damerau post processing weight, another value of type float/double. The lower the better
And finally each result set knows its minimum and maximum score limits. Using the Levenshtein-Damerau post processing algorithm on the raw results I take the min/max values from there.
The only ideas I have is to take a sub-range out of the result set, something like the top 20% results which is simple to achieve. More interesting would be to analyse the top result scores/metrics and find some indication where it gets too fuzzy. I could use the metrics I gather inside my Levenshtein-Damerau algorithm layer, respectively the word- and phrase-distance parameters - these values along with 2 other parameters make up the final distance score. For example if the word- and/or phrase distance exceed a certain threshold, then skip the result. This way is a bit more complicated but possible.
Well, I wonder if there are more opportunities I could use and just not obviously see. Once again, I would like to omit a static limit and make it more flexible on each individual result set.
Any hints or further ideas are greatly appreciated.

News Search API V5 paging results with offset and count

From the documentation here: https://msdn.microsoft.com/en-us/library/dn760793.aspx
It says:
totalEstimatedMatches:
The estimated number of news articles that are relevant to the query. Use this number along with the count and offset query parameters to page the results.
However, there are some serious issues.
1.The returned number of results is ALWAYS less than the requested number in the "count" variable. For example, setting a count=100 results in only 75 results.
2.What's more, even skipping the difference and sending another query to the API with an offset (in this example, offset=100), the API returns a new totalEstimatedMatches!! (first query was 70k results, second time was 138)
What is going on here? How do we fully get the totalEstimatedMatches returned from the first query? Or is that a bogus inflated number?
We did some investigation on this issue. Basically, search engine index does not support an accurate estimation of total match, the same behavior could be observed on Bing.com. the 217M results in the screen shot provided in the image tab above which is not very accurate either.
And, news has backend mechanism that any query output should be less than 100. So the total estimated matches number is not used properly in this example. Normally we do not allow user to download too many results of each query in news. The number of documents you could get from certain query actually capped at a certain number, in most of the case it is around 100.

Resources