Elastic Search - Find Data Common To Multiple Queries - elasticsearch

In Elastic Search I have an index that contains users and the URLs that they've visited. I want to be able to search multiple users and find the common URLs that they've visited.
I can grab the URLs for a single user:
{
"size": 0,
"query": {
"filtered": {
"query": {
"query_string": {
"query": "user:bob"
}
},
"filter": {
"bool": {
"must": [{
"range": {
"#timestamp": {
"gte": 1430456930549,
"lte": 1430666630549
}
}
}],
"must_not": []
}
}
}
},
"aggs": {
"1": {
"terms": {
"field": "url",
"size": 0,
"order": {
"_count": "desc"
}
}
}
}
}
But how do I combine the results from each user (doing some sort of intersection). I can do this programmatically but can Elastic Search do this with some sort of aggregation?

You may use sub-aggregations, terms by urls inside terms by users:
{
"query": {
"match_all": {}
},
"aggs": {
"users": {
"terms": {
"field": "user"
},
"aggs": {
"urls": {
"terms": {
"field": "url"
}
}
}
}
}
}
This will give you buckets of users, each containing buckets of urls.
UPD I misunderstood your question at first. I'm not aware of such type of aggregation you're searching for. However you may take advantage of significant terms aggregation:
{
"query": {
"filtered": {
"filter": {
"terms": {
"user": ["alice", "jack"]
}
}
}
},
"aggs": {
"urls": {
"significant_terms": {
"field": "url",
"size": 5
}
}
}
}
This will give you buckets with the most popular urls within given set of users. Note that in any case it is not a strict intersection, but rather a list where top elements are urls that are more frequent in so-called foreground group (a query scope) than they are in the background group (all documents of the index).
Urls that are common for selected users are very likely to score high on this aggregation.
But if each of 2 requested users visit her own favourite site a lot more than other sites and doesn't visit the other user's favourite one at all, both urls will still appear, and will score higher than those in common.
Generally I recommend exploring this aggregation, it can give some interesting insights from data. For instance, more relevant usage of this aggregation in your dataset will be finding sites that are common between visitors of some other site.
You can read more about it here and here.

Related

Is it possible to paginate term aggregation result with search term?

Is it possible to use pagination in term aggregation query with a search term?
I need to paginate the result of the following query I am not able to find any solution ?
{
"sort": [{
"create_date": {
"order": "desc"
}
}],
"query": {
"bool": {
"must": []
}
},
"aggs": {
"genres": {
"terms": {
"field": "mentions.keyword",
"include": "insta.*"
}
}
}
}
you could use size and from to tell the engine to return the documents in that range every time you come back for next page. Have two variables in your service design and whoever calls the service should also pass the two variables values (basically documents from and the limit)
{
"from": from,
"size": limit,
"sort": [{
"create_date": {
"order": "desc"
}
}],
"query": {
"bool": {
"must": []
}
},
"aggs": {
"genres": {
"terms": {
"field": "mentions.keyword",
"include": "insta.*"
}
}
}
}
if you exposed this query through a service for example mysearch then call the service like this
mysearch?searchTerm=theWord&from=0&limit=15
and in the next call you do the same but with different from and limit values
mysearch?searchTerm=theWord&from=16&limit=15
if this information is not enough then post some sample documents to play with
If you are trying to fetch documents inside terms aggregation, you can use either of two options
In terms aggregation you can use partition to paginate data.
Refer document here
You can use composite aggregation .
In composite aggregtion you can only access data sequentially using after key. You won't be able to jump pages.

Elasticsearch, counting not included terms

I'm trying to get a single, or a couple, of ES requests to count the terms I have not included in my current search.
Let me elaborate.... My front-end looks like this:
I have Closed currently selected, so the other items should show how many items they would add if I were to include that term.
Assume that closed == 500 and Rejected == 100;
While I have closed selected the rejected field should have the number 100 appended to it. If I deselect closed , it should show the number 500. If I select rejected and not select closed it should also show 500.
Easy enough huh? We just add a bucket counting the status field and that will return a bucket for each of these items, we then get the value from it and display it.
That part I got :) However.... when I actually add a term (for example one that filters on NoOffer) the buckets won't include the others field...
This is what my query looks like (global buckets by: ChintanShah25)
{
"size": 50,
"from": 1,
"sort": [
{
"createdAt": "desc"
}
],
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"wildcard": {
"fromPlace": "*rotter*"
}
}
]
}
},
{
"bool": {
"should": [
{
"wildcard": {
"status": "closed"
}
}
]
}
}
]
}
},
"aggs": {
"status": {
"global": {},
"aggs": {
"all_status": {
"terms": {
"field": "status.raw",
"size": 10
}
}
}
}
}
}
The global now shows all the different status codes, but it doesn't take into regard the rest of the statement. The "fromPlace" filter doesn't get applied.
I guess you are looking for global aggregation which will include all the fields regardless of the query. You could also use filter aggregation for selective stats if you want.
{
"query": {
"term": {
"status": {
"value": "closed"
}
}
},
"size": 0,
"aggs": {
"everything": {
"global": {},
"aggs": {
"all_status": {
"terms": {
"field": "status.raw",
"size": 10
}
}
}
}
}
}

Filter/Query support in Elasticsearch Top hits Aggregation

Elasticsearch documentation states that The top_hits aggregation returns regular search hits, because of this many per hit features can be supported Crucially, the list includes Named filters and queries
But trying to add any filter or query throws SearchParseException: Unknown key for a START_OBJECT
Use case: I have items which have list of nested comments
items{id} -> comments {date, rating}
I want to get top rated comment for each item in the last week.
{
"query": {
"match_all": {}
},
"aggs": {
"items": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"comment": {
"nested": {
"path": "comments"
},
"aggs": {
"top_comment": {
"top_hits": {
"size": 1,
//need filter here to select only comments of last week
"sort": {
"comments.rating": {
"order": "desc"
}
}
}
}
}
}
}
}
}
}
So is the documentation wrong, or is there any way to add a filter?
https://www.elastic.co/guide/en/elasticsearch/reference/2.1/search-aggregations-metrics-top-hits-aggregation.html
Are you sure you have mapped them as Nested? I've just tried to execute such query on my data and it did work fine.
If so, you could simply add a filter aggregation, right after nested aggregation (hopefully I haven't messed up curly brackets):
POST data/_search
{
"size": 0,
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"nested": {
"path": "comments",
"query": {
"range": {
"comments.date": {
"gte": "now-1w",
"lte": "now"
}
}
}
}
}
}
},
"aggs": {
"items": {
"terms": {
"field": "id",
"size": 10
},
"aggs": {
"nested": {
"nested": {
"path": "comments"
},
"aggs": {
"filterComments": {
"filter": {
"range": {
"comments.date": {
"gte": "now-1w",
"lte": "now"
}
}
},
"aggs": {
"topComments": {
"top_hits": {
"size": 1,
"sort": {
"comments.rating": "desc"
}
}
}
}
}
}
}
}
}
}
}
P.S. Always include FULL path for nested objects.
So this query will:
Filter documents that have comments younger than one week to narrow down documents for aggregation and to find those, who actually have such comments (filtered query)
Do terms aggregation based on id field
Open nested sub documents (comments)
Filter them by date
Return the most badass one (most rated)

Elasticsearch: how to extract possible pattern over multiple events with same session id

Summary:
I use elasticsearch for my weblogs. I want to get an anwser to the question: how many clients requested page A and page B within one session?
Details:
My Elasticsearch node contains the events that are logged on my website. Each event contains amongst others timestamp, url, referrer and session id. At this moment I know how to find e.g. how many sessions requested url xyz. But I don't know how to find if there are cases that within a session both page A and page B are requested. And of course not that page A or B is part of the referrer.
Is this something that is somehow supported within elasticsearch?
The query should look something like this (assuming your url and session_id are not_analyzed):
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"url": "[Page A URL]"
}
},
{
"term": {
"url": "[Page B URL]"
}
}
]
}
}
}
},
"aggs": {
"requested_both_pages": {
"terms": {
"field": "session_id"
}
}
}
}
The doc_count in the response will be the number you're looking for.
Keep in mind that if your url is analyzed and you need to do fuzzy matching then you'll have to use a match query instead of the terms filter. I generally wouldn't recommend an analyzed referrer. Instead I would break it down into its parts and create a nested url object with each string not_analyzed and then use a terms filter. You can do a wildcard query with not_analyzed fields still if you need some fuzziness.
I figured out a query that at least returns how many times url A and url B are requested per session. I was not aware that I could use this style of aggregation. Still not the perfect solution as it can return sessions where url A has counts and url B has no counts. So I will not mark the anwser as solved. Unless some expert can tell me that my request is just not possible at all.
{
"query": {
"filtered": {
"filter": {
"bool": {
"should": [
{
"term": {
"Url": "[Page A URL]"
}
},
{
"term": {
"Url": "[Page B URL]"
}
}
]
}
}
}
},
"aggs": {
"sessions_all": {
"terms": {
"field": "session_id",
"size": 100
},
"aggs": {
"Page_A_URL": {
"filter": {
"term": {
"Url": "[Page A URL]"
}
}
},
"Page_B_URL": {
"filter": {
"term": {
"Url": "[Page A URL]"
}
}
}
}
}
}
}

ElasticSearch filtering by field1 THEN field2 THEN take max of field3

I am struggling to get the information that I need from ElasticSearch.
My log statements are like this:
field1: Example
field2: Example2
field3: Example3
I would like to search a timeframe (using last 24 hours) to find all data that has this in field1 and that in field2.
There then may be multiple this.that.[field3] entries, so I want to only return the maximum of that field.
In fact, in my data, field3 is actually the key of the entry.
What is the best way of retrieving the information I need? I have managed to get the results returned using aggs, but the data is in buckets, and I am only interested in the data with the max value of field3.
I have added an example of the query that I am looking to do: https://jsonblob.com/54535d49e4b0d117eeaf6bb4
{
"size": 0,
"aggs": {
"agg_129": {
"filters": {
"filters": {
"CarName: Toyota": {
"query": {
"query_string": {
"query": "CarName: Toyota"
}
}
}
}
},
"aggs": {
"agg_130": {
"filters": {
"filters": {
"Attribute: TimeUsed": {
"query": {
"query_string": {
"query": "Attribute: TimeUsed"
}
}
}
}
},
"aggs": {
"agg_131": {
"terms": {
"field": "#timestamp",
"size": 0,
"order": {
"_count": "desc"
}
}
}
}
}
}
}
},
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2014-10-27T00:00:00.000Z",
"lte": "2014-10-28T23:59:59.999Z"
}
}
}
],
"must_not": []
}
}
}
}
}
So, that example above is showing only those that have CarName = Toyota and Attribute = TimeUsed.
My data is as follows:
There are x number of cars CarName and each car has y number of Attributes and each of those Attributes have a document with a timestamp.
To begin with, I was looking for a query for CarName.Attribute.timestamp (latest), however, if I am able to use just ONE query to get the latest timestamp for EVERY attribute for EVERY CarName, then that would decrease query calls from ~50 to one.
If you are using a ElasticSearch v1.3+, you can add a top_hits aggregation with parameter size:1 and descending sort on the field3 value.
This will return the whole document with maximum value on the field, as you wish.
This example in the documentation might do the trick.
Edit:
Ok, it seems you don't need the whole document, but only the maximum timestamp value. You can use a max aggregation instead of using a top_hits one.
The following query (not tested) should give you the maximum timestamp value for each top 10 Attribute value of each CarName top 10 value, in only one request.
terms aggregation is like a GROUP BY clause, and you should not have to query 50 times to retrieve the values of each CarName/Attribute combination : this is the point of nesting a terms aggregation for Attribute in the CarName aggregation.
Note that, to work properly, the CarName and Attribute fields should be not_analyzed. If it's not the case, you will have "funny" results in your buckets. The problem (and possible solution) is very well described here.
Feel free to change the size parameter of the terms aggregation to fit to your case.
{
"size": 0,
"aggs": {
"by_carnames": {
"terms": {
"field": "CarName",
"size": 10
},
"aggs": {
"by_attribute": {
"terms": {
"field": "Attribute",
"size": 10
},
"aggs": {
"max_timestamp": {
"max": {
"field": "#timestamp"
}
}
}
}
}
}
},
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gte": "2014-10-27T00:00:00.000Z",
"lte": "2014-10-28T23:59:59.999Z"
}
}
}
]
}
}
}
}
}

Resources