Elasticsearch - get (unfiltered) aggregates for a (filtered) subset - elasticsearch

I have an elasticsearch index containing "hit" documents (with fields like ip/timestamp/uri etc) which are populated from my nginx access logs.
I'm looking for a method of getting the total number of hits / ip - but for a subset of IPs, namely the ones that did a request today.
I know I can have a filtered aggregation by doing:
/search?size=0
{
'query': { 'bool': { 'must': [
{'range': { 'timestamp': { 'gte': $today}}},
{'query_string': {'query': 'status:200 OR status:404'}},
]}},
'aggregations': {'c': {'terms': {'field': 'ip', 'size': 99999}}}
}
but this will sum only the hits that were done today, I want the total number of hits in the index but only from IPs that have hits today. Is this possible?
-edit-
I've tried the global option but while
'aggregations': {'c': {'global': {}, 'aggs': {'c2': {'terms': {'field': 'remote_user', 'size': 99999}}}}}
returns counts from all IPs; it ignores my filter on timestamp (eg. it includes IPs that did hits a couple of days ago)

There is a way to achieve what you want in a single query but since it involves scripting and the performance might suffer depending on the volume of data you will be running this query on.
The idea is to leverage the scripted_metric aggregation in order to build your own aggregation logic over the whole document set.
What we do below is pretty simple:
we don't give any query, so we consider the full document set
Map phase: we build a map of all IPs and for each
we count the total number of hits
we flag it if it had hits today AND with the given status (same as what you do in your query)
Reduce phase: we return the total hits count for each IP that was flagged as having hits today
Here is how the query looks like:
POST my-index/_search
{
"size": 0,
"aggs": {
"all_time_hits": {
"scripted_metric": {
"init_script": "state.ips = [:]",
"map_script": """
// initialize total hits count for each IP and increment
def ip = doc['ip.keyword'].value;
if (state.ips[ip] == null) {
state.ips[ip] = [
'total_hits': 0,
'hits_today': false
]
}
state.ips[ip].total_hits++;
// flag IP if:
// 1. it has hits today
// 2. the hit had one of the given statuses
def today = Instant.ofEpochMilli(new Date().getTime()).truncatedTo(ChronoUnit.DAYS);
def hitDate = doc['timestamp'].value.toInstant().truncatedTo(ChronoUnit.DAYS);
def hitToday = today.equals(hitDate);
def statusOk = params.statuses.indexOf((int) doc['status'].value) >= 0;
state.ips[ip].hits_today = state.ips[ip].hits_today || (hitToday && statusOk);
""",
"combine_script": "return state.ips;",
"reduce_script": """
def ips = [:];
for (state in states) {
for (ip in state.keySet()) {
// only consider IPs that had hits today
if (state[ip].hits_today) {
if (ips[ip] == null) {
ips[ip] = 0;
}
ips[ip] += state[ip].total_hits;
}
}
}
return ips;
""",
"params": {
"statuses": [200, 404]
}
}
}
}
}
And here is how the answer looks like:
"aggregations" : {
"all_time_hits" : {
"value" : {
"123.123.123.125" : 1,
"123.123.123.123" : 4
}
}
}
I think that pretty much does what you expect.
The other option (more performant because no script) requires you to make two queries. First, a query with the date range and status check with a terms aggregation to retrieve all IPs that have hits today (like you do now), and then a second query where you filter on those IPs (using a terms query) over the whole index (no date range or status check) and get hits count for each of them using a terms aggregation.

In the example you have shared you have a query and your documents are filtered according to that. But you want your aggregation to take all documents regardless of the query.
This is why the global option exists.
This context is defined by the indices and the document types you’re searching on, but is not influenced by the search query itself.
Sample query example:
{
"query": {
"match": { "type": "t-shirt" }
},
"aggs": {
"all_products": {
"global": {},
"aggs": {
"avg_price": { "avg": { "field": "price" } }
}
}
}
}

Related

Use query result as parameter for another query in Elasticsearch DSL

I'm using Elasticsearch DSL, I'm trying to use a query result as a parameter for another query like below:
{
"query": {
"bool": {
"must_not": {
"terms": {
"request_id": {
"query": {
"match": {
"processing.message": "OUT Followup Synthesis"
}
},
"fields": [
"request_id"
],
"_source": false
}
}
}
}
}
}
As you can see above I'm trying to search for sources that their request_id is not one of the request_idswith processing.message equals to OUT Followup Synthesis.
I'm getting an error with this query:
Error loading data [x_content_parse_exception] [1:1660] [terms_lookup] unknown field [query]
How can I achieve my goal using Elasticsearch DSL?
Original question extracted from the comments
I'm trying to fetch data with processing.message equals to 'IN Followup Sythesis' with their request_id doesn't appear in data with processing.message equals to 'OUT Followup Sythesis'. In SQL language:
SELECT d FROM data d
WHERE d.processing.message = 'IN Followup Sythesis'
AND d.request_id NOT IN (SELECT request_id FROM data WHERE processing.message = 'OUT Followup Sythesis');
Answer: generally speaking, neither application-side joins nor subqueries are supported in Elasticsearch.
So you'll have to run your first query, take the retrieved IDs and put them into a second query — ideally a terms query.
Of course, this limitation can be overcome by "hijacking" a scripted metric aggregation.
Taking these 3 documents as examples:
POST reqs/_doc
{"request_id":"abc","processing":{"message":"OUT Followup Synthesis"}}
POST reqs/_doc
{"request_id":"abc","processing":{"message":"IN Followup Sythesis"}}
POST reqs/_doc
{"request_id":"xyz","processing":{"message":"IN Followup Sythesis"}}
you could run
POST reqs/_search
{
"size": 0,
"query": {
"match": {
"processing.message": "IN Followup Sythesis"
}
},
"aggs": {
"subquery_mock": {
"scripted_metric": {
"params": {
"disallowed_msg": "OUT Followup Synthesis"
},
"init_script": "state.by_request_ids = [:]; state.disallowed_request_ids = [];",
"map_script": """
def req_id = params._source.request_id;
def msg = params._source.processing.message;
if (msg.contains(params.disallowed_msg)) {
state.disallowed_request_ids.add(req_id);
// won't need this particular doc so continue looping
return;
}
if (state.by_request_ids.containsKey(req_id)) {
// there may be multiple docs under the same ID
// so concatenate them
state.by_request_ids[req_id].add(params._source);
} else {
// initialize an appendable arraylist
state.by_request_ids[req_id] = [params._source];
}
""",
"combine_script": """
state.by_request_ids.entrySet()
.removeIf(entry -> state.disallowed_request_ids.contains(entry.getKey()));
return state.by_request_ids
""",
"reduce_script": "return states"
}
}
}
}
which'd return only the correct request:
"aggregations" : {
"subquery_mock" : {
"value" : [
{
"xyz" : [
{
"processing" : { "message" : "IN Followup Sythesis" },
"request_id" : "xyz"
}
]
}
]
}
}
⚠️ This is almost guaranteed to be slow and goes against the suggested guidance of not accessing the _source field. But it also goes to show that subqueries can be "emulated".
💡 I'd recommend to test this script on a smaller set of documents before letting it target your whole index — maybe restrict it through a date range query or similar.
FYI Elasticsearch exposes an SQL API, though it's only offered through X-Pack, a paid offering.

Elasticsearch: get documents only when value changes

I have an ES index with such kind of documents:
from_1,to_1,timestamp_1
from_1,to_1,timestamp_2
from_1,to_2,timestamp_3
from_2,to_3,timestamp_4
from_1,to_2,timestamp_5
from_2,to_3,timestamp_6
from_1,to_1,timestamp_7
from_2,to_4,timestamp_8
I need a query that would return a document only if its combination of from and to values is different than the previous seen document with the same from value.
So with the provided sample above:
document with timestamp_1 should be in the result because there is no earlier document with from_1+to_1 combination
document with timestamp_2 must be skipped because its from+to combination is exactly the same as the last seen document with from = from_1
document with timestamp_3 should be in the result because its to field (to_2) is different than the value of the last seen with the same from (to_1 in document with timestamp_1
document with timestamp_4 should be in the result
document with timestamp_5 must not be in the result because it has the same combination of from+to as the last seen with from_1 (document with timestamp_3)
document with timestamp_6 must not be in the result because it has the same combination of from+to as the last seen with from_2 (document with timestamp_4)
document with timestamp_7 should be in the result because it has the different combination of from+to to the last seen with from_1 (document with timestamp_3)
document with timestamp_8 should be in the result because its combination is completely new so far
I need to fetch all such "semi-unique" documents from the index, so it would be nice if it possible to use scroll request or after_key if an aggregation is used.
Any ideas how to approach it?
The closest thing I could come up with is the following (let me know if it does not work with your data).
{
"size": 0,
"aggs": {
"from_and_to": {
"composite" : {
"size": 5,
"sources": [
{
"from_to_collected":{
"terms": {
"script": {
"lang": "painless",
"source": "doc['from'].value + '_' + doc['to'].value"
}
}
}
}]
},
"aggs": {
"top_from_and_to_hits": {
"top_hits": {
"size": 1,
"sort": [{"timestamp":{"order":"asc"}}],
"_source": {"includes": ["_id"]}
}
}
}
}
}
}
Keep in mind that the terms aggregations is probabilistic.
This will allow you to scroll to the next set of buckets over the from_to_collected key.

Elasticsearch Mutually Exclusive results

I have an Elasticsearch query which has a condition which checks whether issoldout = false . And based on it I have few Sum and Count Aggregation fields.
However I would like to get aggregation values if issoldout = false fetch no results, then try with issoldout = true. Is there any way that I can get this done without a second search with issoldout = true.
You could literally submit two queries using _msearch as noted, but you could also just run them in parallel within the same request:
You can do this with the filter aggregation in order to get it to dive down both with it. Similarly, you could just use a terms aggregation to do it, but you would then get it when it's false too.
{
"query": {
... normal query ...
},
"aggs": {
"group_by_soldout": {
"filter": {
"term": {
"issoldout": true
}
},
"aggs": {
"stats_for_field": {
"stats": {
"field": "your_field"
}
}
}
}
}
}

How to do a minus operation on time-stamps in elasticsearch?

I have some server logs dumped into elasticsearch. The logs contain entries like 'action_id':'AU11nP1mYXS3pt6INMtU','action':'start','time':'March 31st 2015, 19:42:07.121' and 'action_id':'AU11nP1mYXS3pt6INMtU','action':'complete','time':'March 31st 2015, 23:06:00.271'. Identical action_id refers to a single action and I'm interested in how long it took to complete an action.
I don't really know the elasticsearch way of framing my question but I'll try my best: how to make an aggregation on 'action_id' based upon the custom metric defined by the time-span it took to go from 'action':'start' to 'action':'complete'?
I'm using kibana for visualization if that helps.
I looked at the example given for scripted metric aggregation and modified it for this problem:
{
"aggs": {
"actions": {
"terms": {
"field": "action_id"
},
"aggs": {
"duration": {
"scripted_metric": {
"init_script": "_agg['delta'] = 0",
"map_script": "if (doc['action'].value == \"complete\"){ _agg.delta += doc['time'].value } else {_agg.delta -= doc['time'].value}",
"combine_script": "return _agg.delta",
"reduce_script": "duration = 0; for (d in _aggs) { duration += d }; return duration"
}
}
}
}
}
}
First it creates buckets for each action_id with terms aggregation.
Then for each bucket it calculates a scripted metric.
On map step it takes 'complete' timestamps as positive values and others (i.e. 'start' ones) as negative for each shard. Then on combine step it just returns them. And on reduce step it sums durations for an action over all the shards (as 'start' and 'complete' events could be on different shards) to get actual duration.
I'm not sure about the performance of this aggregation but you can try it out on your dataset. And please note that it is marked as experimental functionality yet.
It looks like elasticsearch is not designed to calculate time duration directly. It seems like elasticsearch uses logstash to perform such tasks.
https://www.elastic.co/guide/en/logstash/current/plugins-filters-elasticsearch.html
if [action] == "complete" {
elasticsearch {
hosts => ["es-server"]
query => "action:start AND action_id:%{[action_id]}"
fields => ["time", "started"]
}
date {
match => ["[started]", "ISO8601"]
target => "[started]"
}
ruby {
code => "event['duration_hrs'] = (event['#timestamp'] - event['started']) / 3600 rescue nil"
}
}

Elastic Search filter with aggregate like Max or Min

I have simple documents with a scheduleId. I would like to get the count of documents for the most recent ScheduleId. Assuming Max ScheduleId is the most recent, how would we write that query. I have been searching and reading for few hours and could get it to work.
{
"aggs": {
"max_schedule": {
"max": {
"field": "ScheduleId"
}
}
}
}
That is getting me the Max ScheduleId and the total count of documents out side of that aggregate.
I would appreciate if someone could help me on how take this aggregate value and apply it as a filter (like a sub query in SQL!).
This should do it:
{
"aggs": {
"max_ScheduleId": {
"terms": {
"field": "ScheduleId",
"order" : { "_term" : "desc" },
"size": 1
}
}
}
}
The terms aggregation will give you document counts for each term, and it works for integers. You just need to order the results by the term instead of by the count (the default). And since you only want the highest ScheduleID, "size":1 is adequate.
Here is the code I used to test it:
http://sense.qbox.io/gist/93fb979393754b8bd9b19cb903a64027cba40ece

Resources