Fetch data less than a score in Elasticsearch - elasticsearch

I am trying to make an Instagram like Explore page using Elasticsearch. The contents are scored based on time as well as number of likes. Since, the content likes are frequently updated, pagination is difficult using From/Size and Search After. Suppose, I fetched first 10 posts using From 0, Size 10. Another 10 posts scored more likes by the time I'm trying to fetch the second page in pagination. Now, I have the same posts that I fetched in first pagination at positions 10 to 20. This will create lot of duplicate in my explore page.
I am more concerned about avoiding duplicates in pagination than missing some content, because if the user refresh explore page, the top contents will be displayed again. The best way I think is to fetch all posts below a particular score. Is there anything like a max_score api. If not, how can i solve this problem?

Related

Pagination with multi match query

I'm trying to figure out how to accomplish pagination with a multi match query using elasticsearch.
The scroll and search_after APIs seem like they won't work. scroll isn't meant for real time user requests as per documentation. search_after requires some unique field per id and requires you to sort on that field as per documentation but when using a multi-match query you're basically sorting by the score.
So, the only thing I've thought of so far is to do the following:
Send back last document id + score and use the score as the sort field. But, this could potentially return duplicate documents if other documents were added in between two queries.
If you want to paginate the first option is to use from and size parameter in your query. The documentation here
Pagination of results can be done by using the from and size
parameters. The from parameter defines the offset from the first
result you want to fetch. The size parameter allows you to configure
the maximum amount of hits to be returned.
Though from and size can be set as request parameters, they can also
be set within the search body. from defaults to 0, and size defaults
to 10.
Note that from + size can not be more than the index.max_result_window
index setting which defaults to 10,000. See the Scroll or Search After
API for more efficient ways to do deep scrolling.
If you don't need to paginate over 10k results it's your best choice. The max_result_window can be modified, but the performance will decrease as the selected page number will increase.
But of course if some documents are added during your user pagination they will be added and your pagination can be slightly inaccurate.

Store website URL hit count in elastic search

I want to keep a record of pages requested by users. Additionally, I have to store count of each page requested. I am currently storing my website page visits of users by updating an index in elastic search.
I do this by updating a document which is similar to
{
userid : 'id1234',
url : 'website.com/url-1',
count : 23,
}
Here, count of '23' is the total number of time the URL was requested by user with id 'id1234'.
To achieve this, I retrieve the document, increment the present count, and re-push again. My questions is that is it possible to do this with a single query?
I saw a similar approach using scripts here.
Can we do this without scripts?
Elasticsearch is not well suited for Updates. So, even if it was possible to do an update like this, it was first deleting the record, then adding it (the whole document) and reindexing.
Probably the closest thing here is using partial update feature:
Here is an example from a documentation:
POST /metrics/users/1/_update
{
"script" : "ctx._source.count+=1"
}
But you've mentioned it in the question ( The link to the relevant document is available here)
But if you were using scripts, the problem still is that it's relatively slow

Elasticsearch : search for sets of items instead of items

I created a website where I log users actions: visit page, download document, log in, etc. Each action is timestamped, attached to a user and indexed in Elasticsearch
I would like to recognize predefined patterns in thoses actions. eg:
find users who visited this page, this other page and downloaded 2 documents in the last 3 weeks
find users who logged in and visited at least 5 pages in the same day
The problem I have is I always used ES to find items that match criterias but never to find set of items.
How would you start to solve this problem ?
Thank you for your help.
For the second query I would suggest aggregations (like SQL GROUP BY): count the number of page visits aggregated per user and day.
And then add conditions on these aggregated results (like SQL HAVING)
To filter on aggregation results I found this (not tested or tried to understand:):
https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-pipeline-bucket-selector-aggregation.html
Hope it helps

Elasticsearch: group into buckets, reduce to one document per bucket, group these documents

I'm looking for a way how to compute the bounce rate of webpages with elastic search.
We collect data in the following simplified structure
{"id":"1", "timestamp"="2017-01-25:15:23", "sessionid"="s1", "page"="index"}
{"id":"2", "timestamp"="2017-01-25:15:24", "sessionid"="s1", "page"="checkout"}
{"id":"3", "timestamp"="2017-01-25:15:25", "sessionid"="s1", "page"="confirm"}
{"id":"4", "timestamp"="2017-01-25:15:26", "sessionid"="s2", "page"="index"}
{"id":"5", "timestamp"="2017-01-25:15:27", "sessionid"="s2", "page"="checkout"}
{"id":"6", "timestamp"="2017-01-25:15:26", "sessionid"="s3", "page"="product_a"}
{"id":"7", "timestamp"="2017-01-25:15:28", "sessionid"="s3", "page"="checkout"}
For this sample the result of the analysis should be:
2/3 of the users get lost at the checkout page.
1/3 of the users get lost at the confirm page
More formally, I'm looking for a generic approach how to implement the following algorithm in an elastic query:
group documents by a field
sort each group (bucket) by a second field and reduce to the topmost document
group all these remaining documents by a third field
sort groups by number of documents
My first attempt was to solve this with a terms aggregation followed by a top_hits aggregation and finally use a
terms_pipeline aggregation to group the pages.
(simplified aggregation structure)
aggs
terms
field: sessionid
aggs
top_hits
sort:timestamp desc
size: 1
terms_pipeline
bucket_path: terms>top_hits
field: page
... but unfortunately there is no such thing like a terms_pipeline aggregation. My bad.
Any ideas for an alternative approach?
Maybe I misunderstood something but if you are willing to know where your users are bouncing, since all pages are in a sequence, you could simply have a terms aggregation on the page field (to know which pages were visited) and a cardinalityone on the sessionid field (to know how many different unique sessions you have). In this case, cardinality(sessionid) would yield 3.
Then again, since all pages are in a sequence, I think you don't really need to know what happened within a given session.
In your example, from the terms(page) aggregation, you'd know that 3 users landed on the checkout page but only one went to the confirm one. Using the cardinality of the sessions, this implicitly means that 2 users (3 total sessions - 1 confirm page hit) bounced on the checkout page.

Google Search Appliance: Need to understand how sorting works

I want to understand how sorting works in GSA in below situations:
1) I am executing the query "Jayesh Bhoyar Autobiography" and I received 2000 records and in the query I have also mentioned sort by Date. So my understanding is GSA will pick Top 1000 records from above list based on the Relevance and then Sort it by Date?
However I want GSA should return only top 100 results for "Jayesh Bhoyar Autobiography" as per the relevance and sort on those top 100 records based on the Date. IS this possible?
If yes, how it is possible?
Regards,
Jayesh Bhoyar
The GSA can't do this by itself. If you want to do this, you can easily build a simple application that fetches the first hundred results sorted by relevance, then sorts those results by date. Use the simple XML API to fetch unformatted results from the GSA.
Search Protocol Reference

Resources