End of search results using search_after parameter from Elastic Search API - elasticsearch

For a given date range in the query and with a search_after parameter I am able to successfully extract the relevant results. How do I figure out if I am at the end of the search results for the given date range and I dont have to continue querying with the search_after parameter.

There is a pretty cool "trick" that does not involve any additional queries or knowledge of the total number of results:
Say you have a page size of 20. Instead of asking elasticsearch for 20 results, ask it for 21.
If you got 21 results back, only use the first 20 of them. But you now know that the next query will have at least one more result (If you use the sort values of the 20th result for the search_after parameter, not the 21st!).
If you get 20 results or fewer, there will be no additional results.
This github issue gives some more details into why elasticsearch does not have this feature out of the box: https://github.com/elastic/elasticsearch/issues/22364

You can either keep querying until it starts returning zero results, or it does return the total, so you could keep a track of how many you've already retrieved and stop searching once you've met the total. (I do a combination of both)

Related

Pagination with multi match query

I'm trying to figure out how to accomplish pagination with a multi match query using elasticsearch.
The scroll and search_after APIs seem like they won't work. scroll isn't meant for real time user requests as per documentation. search_after requires some unique field per id and requires you to sort on that field as per documentation but when using a multi-match query you're basically sorting by the score.
So, the only thing I've thought of so far is to do the following:
Send back last document id + score and use the score as the sort field. But, this could potentially return duplicate documents if other documents were added in between two queries.
If you want to paginate the first option is to use from and size parameter in your query. The documentation here
Pagination of results can be done by using the from and size
parameters. The from parameter defines the offset from the first
result you want to fetch. The size parameter allows you to configure
the maximum amount of hits to be returned.
Though from and size can be set as request parameters, they can also
be set within the search body. from defaults to 0, and size defaults
to 10.
Note that from + size can not be more than the index.max_result_window
index setting which defaults to 10,000. See the Scroll or Search After
API for more efficient ways to do deep scrolling.
If you don't need to paginate over 10k results it's your best choice. The max_result_window can be modified, but the performance will decrease as the selected page number will increase.
But of course if some documents are added during your user pagination they will be added and your pagination can be slightly inaccurate.

Elasticsearch: group into buckets, reduce to one document per bucket, group these documents

I'm looking for a way how to compute the bounce rate of webpages with elastic search.
We collect data in the following simplified structure
{"id":"1", "timestamp"="2017-01-25:15:23", "sessionid"="s1", "page"="index"}
{"id":"2", "timestamp"="2017-01-25:15:24", "sessionid"="s1", "page"="checkout"}
{"id":"3", "timestamp"="2017-01-25:15:25", "sessionid"="s1", "page"="confirm"}
{"id":"4", "timestamp"="2017-01-25:15:26", "sessionid"="s2", "page"="index"}
{"id":"5", "timestamp"="2017-01-25:15:27", "sessionid"="s2", "page"="checkout"}
{"id":"6", "timestamp"="2017-01-25:15:26", "sessionid"="s3", "page"="product_a"}
{"id":"7", "timestamp"="2017-01-25:15:28", "sessionid"="s3", "page"="checkout"}
For this sample the result of the analysis should be:
2/3 of the users get lost at the checkout page.
1/3 of the users get lost at the confirm page
More formally, I'm looking for a generic approach how to implement the following algorithm in an elastic query:
group documents by a field
sort each group (bucket) by a second field and reduce to the topmost document
group all these remaining documents by a third field
sort groups by number of documents
My first attempt was to solve this with a terms aggregation followed by a top_hits aggregation and finally use a
terms_pipeline aggregation to group the pages.
(simplified aggregation structure)
aggs
terms
field: sessionid
aggs
top_hits
sort:timestamp desc
size: 1
terms_pipeline
bucket_path: terms>top_hits
field: page
... but unfortunately there is no such thing like a terms_pipeline aggregation. My bad.
Any ideas for an alternative approach?
Maybe I misunderstood something but if you are willing to know where your users are bouncing, since all pages are in a sequence, you could simply have a terms aggregation on the page field (to know which pages were visited) and a cardinalityone on the sessionid field (to know how many different unique sessions you have). In this case, cardinality(sessionid) would yield 3.
Then again, since all pages are in a sequence, I think you don't really need to know what happened within a given session.
In your example, from the terms(page) aggregation, you'd know that 3 users landed on the checkout page but only one went to the confirm one. Using the cardinality of the sessions, this implicitly means that 2 users (3 total sessions - 1 confirm page hit) bounced on the checkout page.

In Elastic Search how can I get result from each types in an index for result limited to 10 query.?

I have four types in my index and I am searching for a keyword and the result is limited to 10.I need to get records from all types.Is it possible.?
If you mean getting the first 10 docs per type, I'd use the multisearch API.
See https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-multi-search.html

Google Search Appliance: Need to understand how sorting works

I want to understand how sorting works in GSA in below situations:
1) I am executing the query "Jayesh Bhoyar Autobiography" and I received 2000 records and in the query I have also mentioned sort by Date. So my understanding is GSA will pick Top 1000 records from above list based on the Relevance and then Sort it by Date?
However I want GSA should return only top 100 results for "Jayesh Bhoyar Autobiography" as per the relevance and sort on those top 100 records based on the Date. IS this possible?
If yes, how it is possible?
Regards,
Jayesh Bhoyar
The GSA can't do this by itself. If you want to do this, you can easily build a simple application that fetches the first hundred results sorted by relevance, then sorts those results by date. Use the simple XML API to fetch unformatted results from the GSA.
Search Protocol Reference

Elastic Search limit results

In MySQL I can do something like:
SELECT id FROM table WHERE field = 'foo' LIMIT 5
If the table has 10,000 rows, then this query is way way faster than if I left out the LIMIT part.
In ElasticSearch, I've got the following:
{
"query":{
"fuzzy_like_this_field":{
"body":{
"like_text":"REALLY LONG (snip) TEXT HERE",
"max_query_terms":1,
"min_similarity":0.95,
"ignore_tf":true
}
}
}
}
When I run this search, it takes a few seconds, whereas mysql can return results for the same query in far, far less time.
If I pass in the size parameter (set to 1), it successfully only returns 1 result, but the query itself isn't any faster than if I had set the size to unlimited and returned all the results. I suspect the query is being run in its entirety and only 1 result is being returned after the query is done processing. This means the "size" attribute is useless for my purposes.
Is there any way to have my search stop searching as soon as it finds a single record that matches the fuzzy search, rather than processing every record in the index before returning a response? Am I misunderstanding something more fundamental about this?
Thanks in advance.
You are correct the query is being ran entirely. Queries by default return data sorted by score, so your query is going to score each document. The docs state that the fuzzy query isn't going to scale well, so might want to consider other queries.
A limit filter might give you similar behavior to what your looking for.
A limit filter limits the number of documents (per shard) to execute
on
To replicate mysql field='foo' try using a term filter. You should use filters when you don't care about scoring, they are faster and cache-able.

Resources