Elasticsearch: how to extract possible pattern over multiple events with same session id - elasticsearch

Summary:
I use elasticsearch for my weblogs. I want to get an anwser to the question: how many clients requested page A and page B within one session?
Details:
My Elasticsearch node contains the events that are logged on my website. Each event contains amongst others timestamp, url, referrer and session id. At this moment I know how to find e.g. how many sessions requested url xyz. But I don't know how to find if there are cases that within a session both page A and page B are requested. And of course not that page A or B is part of the referrer.
Is this something that is somehow supported within elasticsearch?

The query should look something like this (assuming your url and session_id are not_analyzed):
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"url": "[Page A URL]"
}
},
{
"term": {
"url": "[Page B URL]"
}
}
]
}
}
}
},
"aggs": {
"requested_both_pages": {
"terms": {
"field": "session_id"
}
}
}
}
The doc_count in the response will be the number you're looking for.
Keep in mind that if your url is analyzed and you need to do fuzzy matching then you'll have to use a match query instead of the terms filter. I generally wouldn't recommend an analyzed referrer. Instead I would break it down into its parts and create a nested url object with each string not_analyzed and then use a terms filter. You can do a wildcard query with not_analyzed fields still if you need some fuzziness.

I figured out a query that at least returns how many times url A and url B are requested per session. I was not aware that I could use this style of aggregation. Still not the perfect solution as it can return sessions where url A has counts and url B has no counts. So I will not mark the anwser as solved. Unless some expert can tell me that my request is just not possible at all.
{
"query": {
"filtered": {
"filter": {
"bool": {
"should": [
{
"term": {
"Url": "[Page A URL]"
}
},
{
"term": {
"Url": "[Page B URL]"
}
}
]
}
}
}
},
"aggs": {
"sessions_all": {
"terms": {
"field": "session_id",
"size": 100
},
"aggs": {
"Page_A_URL": {
"filter": {
"term": {
"Url": "[Page A URL]"
}
}
},
"Page_B_URL": {
"filter": {
"term": {
"Url": "[Page A URL]"
}
}
}
}
}
}
}

Related

Update first document via Update By Query API

I'm trying to get Elasticsearch to do the same thing that MongoDB does with the findOneAndUpdate method, but this doesn't seem to be possible.
The use case is that multiple servers and threads will look into the specific index for the next task to complete.
Therefore my best bet would be to update the "next" task/document with a unique ID and then retrieve the document afterwards.
This query will give me the next document to retrieve:
GET /test_index/_search
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "next_id"
}
}
}
},
"sort": {
"next_update": {"order": "asc"}
},
"size": 1
}
But I can't seem to figure out how to use the Update By Query API to update only a single row. I've been trying this query, but it updates every found document:
POST /test_index/_update_by_query
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "next_id"
}
}
}
},
"sort": {
"next_update": {"order": "asc"}
},
"script": {
"source": "ctx._source['next_update'] = params.next_id",
"params": {
"next_id": "xxxx"
}
}
}
How can I solve this?
You can use max_docs param in _update_by_query and set value to 1 so it will be executed for only one document.
You can check this documentation.
POST /test_index/_update_by_query
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "next_id"
}
}
}
},
"max_docs": 1,
"sort": {
"next_update": {"order": "asc"}
},
"script": {
"source": "ctx._source['next_update'] = params.next_id",
"params": {
"next_id": "xxxx"
}
}
}

Terms aggregation across two fields in Elasticsearch

I'm not sure what I want to do is possible. I have data that looks like this:
{
"Actor1Name": "PERSON",
"Actor2Name": "OTHERPERSON"
}
I use copy_to in order to populate a secondary field, ActorNames, with both values.
I am trying to build a typeahead capability where a user can start to type a name and it will populate with the top hits for that prefix. I want it to search across both actor fields. The only problem is when I search across ActorNames, I get both values even if only one matches. That means if I'm searching for prefix O that I will get both OTHERPERSON (desired) and PERSON (undesired) in my results based on the above document.
My current solution is to run 2 aggregations and combine them client side, but is it possible to do this purely in ES?
Current query:
{
"query": {
"prefix": {
"ActorNames": "O"
}
},
"aggs": {
"actor1": {
"filter": {
"prefix": {
"Actor1Name": "O"
}
},
"aggs": {
"actor1": {
"terms": {
"field": "Actor1Name",
}
}
}
},
"actor2": {
"filter": {
"prefix": {
"Actor2Name": "O"
}
},
"aggs": {
"actor2": {
"terms": {
"field": "Actor2Name",
}
}
}
}
}
}
If you want to check the prefix condition on both the fields, why not use ANDING of prefix on both fields? Like:
GET /my_index/_search
{
"query": {
"bool": {
"must": [
{
"prefix": {
"Actor1Name": "O"
}
},
{
"prefix": {
"Actor2Name": "O"
}
}
]
}
}
}

Elasticsearch, counting not included terms

I'm trying to get a single, or a couple, of ES requests to count the terms I have not included in my current search.
Let me elaborate.... My front-end looks like this:
I have Closed currently selected, so the other items should show how many items they would add if I were to include that term.
Assume that closed == 500 and Rejected == 100;
While I have closed selected the rejected field should have the number 100 appended to it. If I deselect closed , it should show the number 500. If I select rejected and not select closed it should also show 500.
Easy enough huh? We just add a bucket counting the status field and that will return a bucket for each of these items, we then get the value from it and display it.
That part I got :) However.... when I actually add a term (for example one that filters on NoOffer) the buckets won't include the others field...
This is what my query looks like (global buckets by: ChintanShah25)
{
"size": 50,
"from": 1,
"sort": [
{
"createdAt": "desc"
}
],
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"wildcard": {
"fromPlace": "*rotter*"
}
}
]
}
},
{
"bool": {
"should": [
{
"wildcard": {
"status": "closed"
}
}
]
}
}
]
}
},
"aggs": {
"status": {
"global": {},
"aggs": {
"all_status": {
"terms": {
"field": "status.raw",
"size": 10
}
}
}
}
}
}
The global now shows all the different status codes, but it doesn't take into regard the rest of the statement. The "fromPlace" filter doesn't get applied.
I guess you are looking for global aggregation which will include all the fields regardless of the query. You could also use filter aggregation for selective stats if you want.
{
"query": {
"term": {
"status": {
"value": "closed"
}
}
},
"size": 0,
"aggs": {
"everything": {
"global": {},
"aggs": {
"all_status": {
"terms": {
"field": "status.raw",
"size": 10
}
}
}
}
}
}

ElasticSearch multi_match if field exists apply filter otherwise dont worry about it?

So we got an elasticsearch instance, but a job is requiring a "combo search" (A single search field, with checkboxes for types across a specific index)
This is fine, I simply apply this kind of search to my index (for brevity: /posts):
{
"query": {
"multi_match": {
"query": querystring,
"type":"cross_fields",
"fields":["title","name"]
}
}
}
}
As you may guess from the need for the multi_match here, the schemas to each of these types differs in one way or another. And that's my challenge right now.
In one of the types, just one, there is a field that doesnt exist in the other types, it's called active and it's a basic boolean 0 or 1.
We want to index inactive items in the type for administration search purposes, but we don't want inactive items in this type to be exposed to the public when searching.
To my knowledge and understanding, I want to use a filter. But when I supply a filter asking for active to be 1, I only ever now get results from that type and nothing else. Because now it's explicitly looking for items with that field and equal to one.
How can I do a conditional "if field exists, make sure it equals 1, otherwise ignore this condition"? Can this even be achieved?
if field exists, make sure it equals 1, otherwise ignore this condition
I think it can be implemented like this:
{
"query": {
"filtered": {
"filter": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"exists": {
"field": "active"
}
},
{
"term": {
"active": 1
}
}
]
}
},
{
"missing": {
"field": "active"
}
}
]
}
}
}
}
}
and the complete query:
{
"query": {
"filtered": {
"query": {
"multi_match": {
"query": "whatever",
"type": "cross_fields",
"fields": [
"title",
"name"
]
}
},
"filter": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"exists": {
"field": "active"
}
},
{
"term": {
"active": 1
}
}
]
}
},
{
"missing": {
"field": "active"
}
}
]
}
}
}
}
}

Elastic Search - Find Data Common To Multiple Queries

In Elastic Search I have an index that contains users and the URLs that they've visited. I want to be able to search multiple users and find the common URLs that they've visited.
I can grab the URLs for a single user:
{
"size": 0,
"query": {
"filtered": {
"query": {
"query_string": {
"query": "user:bob"
}
},
"filter": {
"bool": {
"must": [{
"range": {
"#timestamp": {
"gte": 1430456930549,
"lte": 1430666630549
}
}
}],
"must_not": []
}
}
}
},
"aggs": {
"1": {
"terms": {
"field": "url",
"size": 0,
"order": {
"_count": "desc"
}
}
}
}
}
But how do I combine the results from each user (doing some sort of intersection). I can do this programmatically but can Elastic Search do this with some sort of aggregation?
You may use sub-aggregations, terms by urls inside terms by users:
{
"query": {
"match_all": {}
},
"aggs": {
"users": {
"terms": {
"field": "user"
},
"aggs": {
"urls": {
"terms": {
"field": "url"
}
}
}
}
}
}
This will give you buckets of users, each containing buckets of urls.
UPD I misunderstood your question at first. I'm not aware of such type of aggregation you're searching for. However you may take advantage of significant terms aggregation:
{
"query": {
"filtered": {
"filter": {
"terms": {
"user": ["alice", "jack"]
}
}
}
},
"aggs": {
"urls": {
"significant_terms": {
"field": "url",
"size": 5
}
}
}
}
This will give you buckets with the most popular urls within given set of users. Note that in any case it is not a strict intersection, but rather a list where top elements are urls that are more frequent in so-called foreground group (a query scope) than they are in the background group (all documents of the index).
Urls that are common for selected users are very likely to score high on this aggregation.
But if each of 2 requested users visit her own favourite site a lot more than other sites and doesn't visit the other user's favourite one at all, both urls will still appear, and will score higher than those in common.
Generally I recommend exploring this aggregation, it can give some interesting insights from data. For instance, more relevant usage of this aggregation in your dataset will be finding sites that are common between visitors of some other site.
You can read more about it here and here.

Resources