Elasticsearch - find IPs from which only anonymous requests came - elasticsearch

I have network logs in my Elasticsearch. Each log has an username and an IP field. Something like this:
{"username":"user1", "ip": "1.2.3.4"}
{"username":"anonymous", "ip": "1.2.3.4"}
{"username":"anonymous", "ip": "2.3.4.5"}
{"username":"user2", "ip": "3.4.5.6"}
I have a seemingly simple task: list all IP-s from which only anonymous requests came. The problem is, I can not simply filter for anonymous, because then I'll list false IP-s which appear with anonymous, but not exclusively. Manually I can do this with a 3 step process:
List all unique IP-s
List unique IP-s that appear with something other than anonymous
Exclude items of 2nd list from the first.
But is there a way to do this with a single ES query? My first instinct was to use bool query. My current approach is this:
GET /sample1/_search
{
"query": {
"bool": {
"must": {
"wildcard": {
"ip": "*"
}
},
"must_not": {
"term": {
"username": "-anonymous"
}
}
}
},
"size": 0,
"aggs": {
"ips": {
"terms": {
"field": "ip.keyword"
}
}
}
}
I expect "2.3.4.5", but it returns all 3 unique IPs. I searched the web and tried different query types for hours. Any ideas?

Please find the below mapping, sample docs, the respective query for your scenario and the response:
Mapping:
PUT my_ip_index
{
"mappings": {
"properties": {
"user":{
"type": "keyword"
},
"ip":{
"type": "ip"
}
}
}
}
Documents:
POST my_ip_index/_doc/1
{
"user": "user1",
"ip": "1.2.3.4"
}
POST my_ip_index/_doc/2
{
"user": "anonymous",
"ip": "1.2.3.4"
}
POST my_ip_index/_doc/3
{
"user": "anonymous",
"ip": "2.3.4.5"
}
POST my_ip_index/_doc/4
{
"user": "user2",
"ip": "3.4.5.6"
}
Aggregation Query:
POST my_ip_index/_search
{
"size": 0,
"aggs": {
"my_valid_ips": {
"terms": {
"field": "ip",
"size": 10
},
"aggs": {
"valid_users": {
"terms": {
"field": "user",
"size": 10,
"include": "anonymous"
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"valid_users_count": "valid_users._bucket_count",
"my_valid_ips_count": "_count"
},
"script": {
"source": "params.valid_users_count == 1 && params.my_valid_ips_count == 1"
}
}
}
}
}
}
}
Note how I've made use of Terms Aggregation and Bucket Selector Aggregation in the above query.
I've added include part in Terms Agg so as to consider only anonymous users and the logic inside bucket aggregation is to filter out only if it is a single doc count in the top level terms aggregation for e.g. 2.3.4.5 followed by single bucket count in the second level terms aggregation.
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_valid_ips" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "2.3.4.5", <---- Expected IP/Answer
"doc_count" : 1,
"valid_users" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "anonymous",
"doc_count" : 1
}
]
}
}
]
}
}
}
Hope it helps!

Related

Elasticsearch query for finding and grouping objects by a field, and returning the unique values of another field

I have this kind of data (irrelevant fields omitted for simplicity's sake):
{
"endpoint": "endpoint_1",
"user_id": 1,
"session": "value2",
...
}
{
"endpoint": "endpoint_2",
"user_id": 1,
"session": "value3",
...
}
{
"endpoint": "endpoint_2",
"user_id": 2,
"session": "value2",
...
}
{
"endpoint": "endpoint_3",
"user_id": 3,
"session": "value3",
...
}
I want to find all users sharing at least one session, BUT only if they're in specific endpoints. I'm struggling to build a query that finds what I want, because the documentation really sucks.
This is what I have so far, created after painstakingly trying to figure out what the hell the documentation was talking about, but it seems overly complex and wrong:
{
"query": {
"bool": {
"should": [
{"match_phrase": {"endpoint": "endpoint_1"}},
{"match_phrase": {"endpoint": "endpoint_2"}},
{"match_phrase": {"endpoint": "endpoint_4"}},
{"match_phrase": {"endpoint": "endpoint_6"}},
{"match_phrase": {"endpoint": "endpoint_11"}}
],
"minimum_should_match": 1
}
},
"size": 0,
"aggregations": {
"shared_sessions": {
"terms": {
"size": 1000,
"field": "session",
"order": {
"users": "desc"
}
},
"aggregations": {
"users": {
"cardinality": {
"field": "user_id",
"precision_threshold": 100
}
},
"minimum": {
"bucket_selector": {
"buckets_path": {
"var1": "users"
},
"script": "params.var1 > 1"
}
},
"aggregations": {
"terms": {
"field": "user_id",
"size": 1000
}
}
}
}
}
}
This manages to find the users, however it returns hard to parse results, and it makes further alterations prohibitive. It returns something like this:
{
"aggregations" : {
"shared_sessions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 29740,
"buckets" : [
{
"key" : "abcdefg123456", # session
"doc_count" : 6,
"aggregations" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1, # user_id
"doc_count" : 4
},
{
"key" : 2, # user_id
"doc_count" : 1
},
{
"key" : 3, # user_id
"doc_count" : 1
}
]
},
"users" : {
"value" : 3
}
}
}
}
So, is there a better way to do what I want?
Ideally I'd like to return arbitrary results from the object (including ones I've excluded from this answer for simplicity's sake).

Elasticsearch: Tricky aggregation with sum and comparison

I am trying to pull statistics from my elastic cluster I cannot figure out.
In the end what I want to achieve is a count of streams (field: status) over time (field: timestamp) for a specific item (field: media).
The data are logs from nginx with anonymized IPs (field: ip_hash) and user agents (field: http_user_agent). To get a valid count I need to sum up the bytes transferred (field: bytes_sent) and compare that to a minimum threshold (integer) considering the same IP and user agent. It is only a valid stream / only counts if XY bytes of that stream have been transferred in sum.
"_source": {
"media": "my-stream.001",
"http_user_agent": "Spotify/8.4.44 Android/29 (SM-T535)",
"ip_hash": "fcd2653c44c1d8e33ef5d58ac5a33c2599b68f05d55270a8946166373d79a8212a49f75bcf3f71a62b9c71d3206c6343430a9ebec9c062a0b308a48838161ce8",
"timestamp": "2022-02-05 01:32:23.941",
"bytes_sent": 4893480,
"status": 206
}
Where I am having trouble is to sum up the transferred bytes based on the unique user agent / IP hash combination and comparing that to the threshold.
Any pointers are appreciated how I could solve this. Thank you!
So far I got this:
GET /logdata_*/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": "now-1w/d",
"lt": "now/d"
}
}
}
]
}
},
"aggs": {
"status206":{
"filter": {
"term": {
"status": "206"
}
},
"aggs": {
"medias": {
"terms": {
"field": "media",
"size": 10
},
"aggs": {
"ips": {
"terms": {
"field": "ip_hash",
"size": 10
},
"aggs": {
"clients": {
"terms": {
"field": "http_user_agent",
"size": 10
},
"aggs": {
"transferred": {
"sum": {
"field": "bytes_sent"
}
}
}
}
}
}
}
}
}
}
}
}
Which gives something like this:
{
"took" : 1563,
"timed_out" : false,
"_shards" : {
"total" : 12,
"successful" : 12,
"skipped" : 8,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"status206" : {
"doc_count" : 1307130,
"medias" : {
"doc_count_error_upper_bound" : 7612,
"sum_other_doc_count" : 1163149,
"buckets" : [
{
"key" : "20220402_ETD_Podcast_2234_Eliten_-_VD_Hanson.mp3",
"doc_count" : 21772,
"ips" : {
"doc_count_error_upper_bound" : 12,
"sum_other_doc_count" : 21574,
"buckets" : [
{
"key" : "ae55a10beda61afd3641fe2a6ca8470262d5a0c07040d3b9b8285ea1a4dba661a0502a7974dc5a4fecbfbbe5b7c81544cdcea126271533e724feb3d7750913a5",
"doc_count" : 38,
"clients" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Deezer/7.0.0.xxx (Android; 10; Mobile; de) samsung SM-G960F",
"doc_count" : 38,
"transferred" : {
"value" : 7582635.0
}
}
]
}
},
{
"key" : "60082e96eb57c4a8b7962dc623ef7446fbc08cea676e75c4ff94ab5324dec93a6db1848d45f6dcc6e7acbcb700bb891cf6bee66e1aa98fc228107104176734ff",
"doc_count" : 37,
"clients" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Deezer/7.0.0.xxx (Android; 12; Mobile; de) samsung SM-N770F",
"doc_count" : 36,
"transferred" : {
"value" : 7252448.0
}
},
{
"key" : "Mozilla/5.0 (Linux; Android 11; RMX2063) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.58 Mobile Safari/537.36",
"doc_count" : 1,
"transferred" : {
"value" : 843367.0
}
}
]
}
},
Now I would need to check that "transferred" is gte the treshhold and that would count as 1 stream. In the end I need the count of all applicable streams.
You can try the following:
> GET _search?filter_path=aggregations.valid_streams.count
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": "now-1w/d",
"lt": "now/d"
}
}
},
{
"match": {
"status": "206"
}
}
]
}
},
"aggs": {
"streams": {
"multi_terms": {
"size": "65536",
"terms": [
{
"field": "media"
},
{
"field": "ip_hash"
},
{
"field": "http_user_agent"
}
]
},
"aggs": {
"transferred": {
"sum": {
"field": "bytes_sent"
}
},
"threshold": {
"bucket_selector": {
"buckets_path": {
"total": "transferred"
},
"script": "params.total > 12345"
}
}
}
},
"valid_streams": {
"stats_bucket": {
"buckets_path": "streams>transferred"
}
}
}
}
Explanation:
streams - Combined terms aggregations since every changed field in it should be counted as a new stream. This is mainly for better readability, change it if it doesn't fit your logic.
transferred - sum aggregation to sum up the sent bytes.
threshold - bucket_selector aggregation which filters out the streams that didn't reach the XY threshold.
valid_streams - stats_bucket aggregation which returns a count field containing the amount of buckets = valid streams. BTW, it also gives you info about your valid streams (i.e average bytes)
The filter_path queryparam is used to reduce the returned response to only include the aggregation output.

Count number of inner elements of array property (Including repeated values)

Given I have the following records.
[
{
"profile": "123",
"inner": [
{
"name": "John"
}
]
},
{
"profile": "456",
"inner": [
{
"name": "John"
},
{
"name": "John"
},
{
"name": "James"
}
]
}
]
I want to get something like:
"aggregations": {
"name": {
"buckets": [
{
"key": "John",
"doc_count": 3
},
{
"key": "James",
"doc_count": 1
}
]
}
}
I'm a beginner using Elasticsearch, and this seems to be a pretty simple operation to do, but I can't find how to achieve this.
If I try a simple aggs using term, it returns 2 for John, instead of 3.
Example request I'm trying:
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "inner.name"
}
}
}
}
How can I possibly achieve this?
Additional Info: It will be used on Kibana later.
I can change mapping to whatever I want, but AFAIK Kibana doesn't like the "Nested" type. :(
You need to do a value_count aggregation, by default terms only does a doc_count, but the value_count aggregation will count the number of times a given field exists.
So, for your purposes:
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "inner.name"
},
"aggs": {
"total": {
"value_count": {
"field": "inner.name"
}
}
}
}
}
}
Which returns:
"aggregations" : {
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "John",
"doc_count" : 2,
"total" : {
"value" : 3
}
},
{
"key" : "James",
"doc_count" : 1,
"total" : {
"value" : 2
}
}
]
}
}

How to get multiple fields returned in elasticsearch query?

How to get multiple fields returned that are unique using elasticsearch query?
All of my documents have duplicate name and job fields. I would like to use an es query to get all the unique values which include the name and job in the same response, so they are tied together.
[
{
"name": "albert",
"job": "teacher",
"dob": "11/22/91"
},
{
"name": "albert",
"job": "teacher",
"dob": "11/22/91"
},
{
"name": "albert",
"job": "teacher",
"dob": "11/22/91"
},
{
"name": "justin",
"job": "engineer",
"dob": "1/2/93"
},
{
"name": "justin",
"job": "engineer",
"dob": "1/2/93"
},
{
"name": "luffy",
"job": "rubber man",
"dob": "1/2/99"
}
]
Expected result in any format -> I was trying to use aggs but I only get one field
[
{
"name": "albert",
"job": "teacher"
},
{
"name": "justin",
"job": "engineer"
},
{
"name": "luffy",
"job": "rubber man"
},
]
This is what I tried so far
GET name.test.index/_search
{
"size": 0,
"aggs" : {
"name" : {
"terms" : { "field" : "name.keyword" }
}
}
}
using the above query gets me this which is good that its unique
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 95,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Justin",
"doc_count" : 56
},
{
"key" : "Luffy",
"doc_count" : 31
},
{
"key" : "Albert",
"doc_count" : 8
}
]
}
}
}
I tried doing nested aggregation but that did not work. Is there an alternative solution for getting multiple unique values or am I missing something?
That's a good start! There are a few ways to achieve what you want, each provides a different response format, so you can decide which one you prefer.
The first option is to leverage the top_hits sub-aggregation and return the two fields for each name bucket:
GET name.test.index/_search
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "name.keyword"
},
"aggs": {
"top": {
"top_hits": {
"_source": [
"name",
"job"
],
"size": 1
}
}
}
}
}
}
The second option is to use a script in your terms aggregation instead of a field to return a compound value:
GET name.test.index/_search
{
"size": 0,
"aggs": {
"name": {
"terms": {
"script": "doc['name'].value + ' - ' + doc['job'].value"
}
}
}
}
The third option is to use two levels of field collapsing:
GET name.test.index/_search
{
"collapse": {
"field": "name",
"inner_hits": {
"name": "by_job",
"collapse": {
"field": "job"
},
"size": 1
}
}
}

Elasticsearch aggregations: how to get bucket with 'other' results of terms aggregation?

I use aggregation to collect data from nested field and stuck a little
Example of document:
{
...
rectangle: {
attributes: [
{_id: 'some_id', ...}
]
}
ES allows group data by rectangle.attributes._id, but is there any way to get some 'other' bucket to put there documents that were not added to any of groups? Or maybe there is a way to create query to create bucket for documents by {"rectangle.attributes._id": {$ne: "{currentDoc}.rectangle.attributes._id"}}
I think bucket would be perfect because i need to do further aggregations with 'other' docs.
Or maybe there's some cool workaround
I use query like this for aggregation
"aggs": {
"attributes": {
"nested": {
"path": "rectangle.attributes"
},
"aggs": {
"attributesCount": {
"cardinality": {
"field": "rectangle.attributes._id.keyword"
}
},
"entries": {
"terms": {
"field": "rectangle.attributes._id.keyword"
}
}
}
}
}
And get this result
"buckets" : [
{
"key" : "some_parent_id",
"doc_count" : 27616,
"attributes" : {
"doc_count" : 45,
"entries" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "some_id",
"doc_count" : 45,
"attributeOptionsCount" : {
"value" : 2
}
}
]
}
}
}
]
result like this would be perfect:
"buckets" : [
{
"key" : "some_parent_id",
"doc_count" : 1000,
"attributes" : {
"doc_count" : 145,
"entries" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "some_id",
"doc_count" : 45
},
{
"key" : "other",
"doc_count" : 100
}
]
}
}
}
]
You can make use of missing value parameter. Update aggregation as below:
"aggs": {
"attributes": {
"nested": {
"path": "rectangle.attributes"
},
"aggs": {
"attributesCount": {
"cardinality": {
"field": "rectangle.attributes._id.keyword"
}
},
"entries": {
"terms": {
"field": "rectangle.attributes._id.keyword",
"missing": "other"
}
}
}
}
}

Resources