Elasticsearch : search for sets of items instead of items - elasticsearch

I created a website where I log users actions: visit page, download document, log in, etc. Each action is timestamped, attached to a user and indexed in Elasticsearch
I would like to recognize predefined patterns in thoses actions. eg:
find users who visited this page, this other page and downloaded 2 documents in the last 3 weeks
find users who logged in and visited at least 5 pages in the same day
The problem I have is I always used ES to find items that match criterias but never to find set of items.
How would you start to solve this problem ?
Thank you for your help.

For the second query I would suggest aggregations (like SQL GROUP BY): count the number of page visits aggregated per user and day.
And then add conditions on these aggregated results (like SQL HAVING)
To filter on aggregation results I found this (not tested or tried to understand:):
https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-pipeline-bucket-selector-aggregation.html
Hope it helps

Related

Laravel elastic search display relevant data in top order

This is regarding to order the elastic search results in custom order.
I have city ids(integers) in my elastic search index, based on the user city selection the elastic search should happen.
For example:
Consider the id of Chennai is 1 & Mumbai is 2
If we have 10 records for Chennai and 20 records for Mumbai in elastic index. If the user choose Chennai, we should display the 10 records belongs to Chennai in top order and then display the remaining items.
If the user choose Mumbai, we should display the 20 records belongs to Mumbai in top order and then display the remaining items.
I am using sleimanx2/plastic laravel package for search. Appreciate if anyone help me to achieve this.
Is there any specific reason that you wish to achieve this with elastic?
The mentioned case seems to me like something i would achieve with two queries. One for the promoted, let's call them results, and one that would match everything else, except those that belong to the first query.
Then I would go ahead and display them to their respective areas or whatever.
There might be a way to merge those queries together and get your results as buckets that you can later use to create your markup accordingly, but honestly I am not sure that there is a reason to do it like this.
I hope I do not misunderstand your question,
Best Regards.

Fetch data less than a score in Elasticsearch

I am trying to make an Instagram like Explore page using Elasticsearch. The contents are scored based on time as well as number of likes. Since, the content likes are frequently updated, pagination is difficult using From/Size and Search After. Suppose, I fetched first 10 posts using From 0, Size 10. Another 10 posts scored more likes by the time I'm trying to fetch the second page in pagination. Now, I have the same posts that I fetched in first pagination at positions 10 to 20. This will create lot of duplicate in my explore page.
I am more concerned about avoiding duplicates in pagination than missing some content, because if the user refresh explore page, the top contents will be displayed again. The best way I think is to fetch all posts below a particular score. Is there anything like a max_score api. If not, how can i solve this problem?

Store website URL hit count in elastic search

I want to keep a record of pages requested by users. Additionally, I have to store count of each page requested. I am currently storing my website page visits of users by updating an index in elastic search.
I do this by updating a document which is similar to
{
userid : 'id1234',
url : 'website.com/url-1',
count : 23,
}
Here, count of '23' is the total number of time the URL was requested by user with id 'id1234'.
To achieve this, I retrieve the document, increment the present count, and re-push again. My questions is that is it possible to do this with a single query?
I saw a similar approach using scripts here.
Can we do this without scripts?
Elasticsearch is not well suited for Updates. So, even if it was possible to do an update like this, it was first deleting the record, then adding it (the whole document) and reindexing.
Probably the closest thing here is using partial update feature:
Here is an example from a documentation:
POST /metrics/users/1/_update
{
"script" : "ctx._source.count+=1"
}
But you've mentioned it in the question ( The link to the relevant document is available here)
But if you were using scripts, the problem still is that it's relatively slow

Elasticsearch: group into buckets, reduce to one document per bucket, group these documents

I'm looking for a way how to compute the bounce rate of webpages with elastic search.
We collect data in the following simplified structure
{"id":"1", "timestamp"="2017-01-25:15:23", "sessionid"="s1", "page"="index"}
{"id":"2", "timestamp"="2017-01-25:15:24", "sessionid"="s1", "page"="checkout"}
{"id":"3", "timestamp"="2017-01-25:15:25", "sessionid"="s1", "page"="confirm"}
{"id":"4", "timestamp"="2017-01-25:15:26", "sessionid"="s2", "page"="index"}
{"id":"5", "timestamp"="2017-01-25:15:27", "sessionid"="s2", "page"="checkout"}
{"id":"6", "timestamp"="2017-01-25:15:26", "sessionid"="s3", "page"="product_a"}
{"id":"7", "timestamp"="2017-01-25:15:28", "sessionid"="s3", "page"="checkout"}
For this sample the result of the analysis should be:
2/3 of the users get lost at the checkout page.
1/3 of the users get lost at the confirm page
More formally, I'm looking for a generic approach how to implement the following algorithm in an elastic query:
group documents by a field
sort each group (bucket) by a second field and reduce to the topmost document
group all these remaining documents by a third field
sort groups by number of documents
My first attempt was to solve this with a terms aggregation followed by a top_hits aggregation and finally use a
terms_pipeline aggregation to group the pages.
(simplified aggregation structure)
aggs
terms
field: sessionid
aggs
top_hits
sort:timestamp desc
size: 1
terms_pipeline
bucket_path: terms>top_hits
field: page
... but unfortunately there is no such thing like a terms_pipeline aggregation. My bad.
Any ideas for an alternative approach?
Maybe I misunderstood something but if you are willing to know where your users are bouncing, since all pages are in a sequence, you could simply have a terms aggregation on the page field (to know which pages were visited) and a cardinalityone on the sessionid field (to know how many different unique sessions you have). In this case, cardinality(sessionid) would yield 3.
Then again, since all pages are in a sequence, I think you don't really need to know what happened within a given session.
In your example, from the terms(page) aggregation, you'd know that 3 users landed on the checkout page but only one went to the confirm one. Using the cardinality of the sessions, this implicitly means that 2 users (3 total sessions - 1 confirm page hit) bounced on the checkout page.

GSA report with 'Searches that returned results' and 'Searches that did not return results'

I am using GSA 7.2, In GSA Search Report It have two Report types
Searches that returned results
Searches that did not return results
what is difference between this types?
I tried Last week Search Report with both types, I am getting few same Keywords and Queries in Both Report types with different Occurrences count. Here my question is if GSA is showing result for some Keywords and Queries then It should not show in without result type, May be my Understanding is wrong, Please correct me.
Thank you for help
Looks like this issue is there for a long time.
Check this out. Not sure whether it is fixed or not, better check with google support.
BTW, Do you have user specific search (Role based search)? If so, just try to search using same term for all user/role and see any user/role gets zero results.
When you run the reports, are you restricting the search to a specific collection and/or time range?
If you run a report for "All Collections" then you might see items show in both reports because users are searching against a collection that does not have the documents.
What you'd want to do is run a report for a single day. If you see the same behavior for a single collection then download the search logs for that day and look for searches for that key term and see if the search query parameters are the same. If they are different then there could be some malformed search queries being executed. If not then it could be a transient issue with the GSA.

Resources