Aggregate objects in ElasticSearch by IP Prefix - elasticsearch

I have an ElasticSearch index where I store internet traffic flow objects, which each object containing an IP address. I want to aggregate the data in a way that all objects with the same IP Prefix are collected in the same bucket (but without specifying a specific Prefix). Something like a histogram aggregation. Is this possible?
I have tried this:
GET flows/_search
{
"size": 0,
"aggs": {
"ip_ranges": {
"histogram": {
"field": "ipAddress",
"interval": 256
}
}
}
}
But this doesn't work, probably because histogram aggregations aren't supported for ip type fields. How would you go about doing this?

Firstly, As suggested here, the best approach would be to:
categorize the IP address at index time and then use a simple keyword field to store the class c information, and then use a term aggregation on that field to do the count.
Alternatively, you could simply add a multi-field keyword mapping:
PUT myindex
{
"mappings": {
"properties": {
"ipAddress": {
"type": "ip",
"fields": {
"keyword": { <---
"type": "keyword"
}
}
}
}
}
}
and then extract the prefix at query time (⚠️ highly inefficient!):
GET myindex/_search
{
"size": 0,
"aggs": {
"my_prefixes": {
"terms": {
"script": "/\\./.split(doc['ipAddress.keyword'].value)[0]",
"size": 10
}
}
}
}
As a final option, you could define the intervals of interest in advance and use an ip_range aggregation:
{
"size": 0,
"aggs": {
"my_ip_ranges": {
"ip_range": {
"field": "ipAddress",
"ranges": [
{ "to": "192.168.1.1" },
{ "from": "192.168.1.1" }
]
}
}
}
}

Related

Composite and Terms Aggregations on a field with a high cardinality

I am facing a huge performance problem with ES which results in more than 2 min response.
I have an index that has more than 25M files and composes of the next 4 fields (among others):
...
"group_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"group_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
...
I have something like 100K unique users and groups and each field is a list of users/groups that holds ~100 values. For example:
"user_read": ["user_1", "group_1", ...],
"user_write": ["user_1", "group_2", ...]
...
I have 2 kinds of aggregation I am using, composite and terms. Composite aggregations for getting only first X results to display and terms aggregation for prefix search.
Composite aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"composite": {
"sources": [
{
"Group Read": {
"terms": {
"field": "group_read.raw"
}
}
}
],
"size": 10
}
},
"Group_Write_Permissions": {
"composite": {
"sources": [
{
"Group Write": {
"terms": {
"field": "group_write.raw"
}
}
}
]
}
},
"User_Write_Permissions": {
"composite": {
"sources": [
{
"User Write": {
"terms": {
"field": "user_write.raw"
}
}
}
]
}
},
"User_Read_Permissions": {
"composite": {
"sources": [
{
"User Read": {
"terms": {
"field": "user_read.raw"
}
}
}
]
}
}
}
}
Terms aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"terms": {
"field": "group_read.raw",
"include": ".*[Ss].*"
}
},
"Group Write Permissions": {
"terms": {
"field": "group_write.raw",
"include": ".*[Ss].*"
}
},
"User Read Permissions": {
"terms": {
"field": "user_read.raw",
"include": ".*[Ss].*"
}
},
"User Write Permissions": {
"terms": {
"field": "user_write.raw",
"include": ".*[Ss].*"
}
}
}
}
Composite aggregation returns results within 1 min and the terms aggregation can take up to 5 min.
What I have tried so far:
Adding new field user_group_permissions and adding to the above 4 fields "copy_to": "user_group_permissions"
Adding to the above 4 fields and to the field "user_group_permissions" the next property: "eager_global_ordinals": true
Increased the refresh_interval up to 200s
** I reindexed for the first 2 suggestions [took something like 6 hours]
All of the above did help a little with the retrieval time but still: composite aggregation takes up to 20s and terms aggregation takes up to 3 min.
[The best results were on the fields user_group_permissions which has been created in the first suggestion, with eager_global_ordinals = true and refresh_interval = 120s].
Please, if someone has any idea how to improve the retrieval times I will be grateful.
First of all, if you only need the first 10 results, you don't need to use the composite aggregation, which is meant to be used only if you need to paginate over all results. Simply use the terms aggregation with default size 10, that'll do the job.
Second, what you're doing with the terms aggregation is not a prefix filtering, but infix filtering, which is completely different in terms of performance. While it's easy to search for prefixes, searching for infixes requires the equivalent of a "full table scan" because each and every term must be visited.
A first optimization I would suggest is that in your second query you should do your regex in the query part (bool/should with one regex query per field), so as to reduce the document set on which the terms aggregations need to run. That might help a bit.
A second optimization is to leverage the wildcard field type which is a specialized field type made specially for grep-like wildcard and regexp queries.
Another possible optimization is to lowercase all your permissions, so that you only need to search for .*s.* instead of the uppercase variant.
Depending on your comments, I'll add more optimizations as the discussion goes on.

Nested Fields, Wildcard Queries and Aggregations in Elasticsearch

I have an index that collects web redirects data for various sites. I am using a nested field to collect the data as shown in the mapping below:
"chain": {
"type": "nested",
"properties": {
"url.position": {
"type": "long"
},
"url.full": {
"type": "text"
},
"url.domain": {
"type": "keyword"
},
"url.path": {
"type": "keyword"
},
"url.query": {
"type": "text"
}
}
}
As you can imagine, each document contains an array of url chains, the size of the array being equal to number of web redirects. I want to get aggregations based on wildcard/regexp matches to url.query field. Here is a sample query:
GET push_url_chain/_search
{
"query": {
"nested": {
"path": "chain",
"query": {
"regexp": {
"chain.url.query": "aff_c.*"
}
}
}
},
"size": 0,
"aggs": {
"dataFields": {
"nested": {
"path": "chain"
},
"aggs": {
"offers": {
"terms": {
"field": "chain.url.domain",
"size": 30
}
}
}
}
}
}
The above query does produce aggregated results but not the way I want.
I want to see chain.url.domain aggregations for the urls that contain the aff_c.* phrase. Right now it is looking at all the urls in the chain and then aggregating the buckets by doc_count regardless of whether that url/domain has the particular phrase. I hope I have been able to explain this clearly. How do I get my results to show bucket aggregations that contain domains that have aff_c.* phrase match to the query field of the url.
I would also like to know how I can use = or / in my wildcard or regexp queries. It is not producing any results if I use the above symbols in my queries.
Tha
Nested query returns all documents where a nested document matches the condition, you get matched nested docs only in inner_hits.
Aggregation is applied on top of these documents, so all domains are coming in terms
You need to use nested aggregation to gets only matching terms.
{
"size": 0,
"aggs": {
"Name": {
"nested": {
"path": "chain"
},
"aggs": {
"matched_doc": {
"filter": { --> filter for url
"match_phrase_prefix": {
"chain.url.query": "abc"
}
},
"aggs": {
"domain": {
"terms": {
"field": "chain.url.domain", -- terms for matched url
"size": 10
}
}
}
}
}
}
}
}
You can use match_phrase_prefix instead of regex. It has better performance.
Standard analyzer while generating tokens removes "/","=". So if you want to use regex or wildcard and look for these , you need to use keyword field not text field.

Search and aggregation on two indices

Two indexes are created with the dates.
First index mapping:
PUT /index_one
{
"mappings": {
"properties": {
"date_start": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Second index mapping:
PUT /index_two
{
"mappings": {
"properties": {
"date_end": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Need to find a date in a certain range and perform aggregation average of the dates difference.
Tried to make a request like this:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"filtered_dates": {
"filter": {
"bool": {
"must": [
{
"exists": {
"field": "date_start"
}
},
{
"exists": {
"field": "date_end"
}
}
]
}
},
"aggs": {
"avg_date": {
"avg": {
"script": {
"lang": "painless",
"source": "doc['date_end'].value.toInstant().toEpochMilli() - doc['date_begin'].value.toInstant().toEpochMilli()"
}
}
}
}
}
}
}
I get the following response to the request:
{
"hits": {
"total": {
"value": 16508
},
"hits": [
{
"_index": "index_one",
"_type": "_doc",
"_id": "93a34c5b-101b-45ea-9965-96a2e0446a28",
"_score": 1.0,
"_source": {
"date_begin": "2021-02-26 07:26:29.732+0300"
}
}
]
},
"aggregations": {
"filtered_dates": {
"meta": {},
"doc_count": 0,
"avg_date": {
"value": null
}
}
}
}
Can you please tell me if it is possible to make a query with search and aggregation over two indices in Elasticsearch? If so, how?
If you stored date_start on the document which contains date_end, it'd be much easier to figure out the average — check my answer to Store time related data in ElasticSearch.
Now, the script context operates on one single document at a time and has "no clue" about the other, potentially related docs. So if you don't store both dates at the same time in at least one doc, you'd need to somehow connect the docs nonetheless.
One option would be to use their ids:
POST index_one/_doc
{ "id":1, "date_start": "2021-01-01" }
POST index_two/_doc
{ "id":1, "date_end": "2021-12-31" }
POST index_one/_doc/2
{ "id":2, "date_start": "2021-01-01" }
POST index_two/_doc/2
{ "id":2, "date_end": "2021-01-31" }
After that, it's possible to:
Target multiple indices — as you already do.
Group the docs by their IDs and select only those that include at least 2 buckets (assuming two buckets represent the start & the end).
Obtain the min & max dates — essentially cherry-picking the date_start and date_end to be used later down the line.
Use a bucket_script aggregation to calculate their difference (in milliseconds).
Leverage a top-level average bucket aggregation to run over all the difference buckets and ... average them.
In concrete terms:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"grouped_by_id": {
"terms": {
"field": "id",
"min_doc_count": 2,
"size": 10
},
"aggs": {
"min_date": {
"min": {
"field": "date_start"
}
},
"max_date": {
"max": {
"field": "date_end"
}
},
"diff": {
"bucket_script": {
"buckets_path": {
"min": "min_date",
"max": "max_date"
},
"script": "params.max - params.min"
}
}
}
},
"avg_duration_across_the_board": {
"avg_bucket": {
"buckets_path": "grouped_by_id>diff",
"gap_policy": "skip"
}
}
}
}
If everything goes right, you'll end up with:
...
"aggregations" : {
"grouped_by_id" : {
...
},
"avg_duration_across_the_board" : {
"value" : 1.70208E10 <-- 17,020,800,000 milliseconds ~ 4,728 hrs
}
}
⚠️ Caveat: note that the 2nd level terms aggregation has an adjustable size. You'll probably need to increase it to cover more docs. But there are theoretical and practical limits as to how far it makes sense to increase it.
📖 Shameless plug: this was inspired in part by the chapter Aggregations & Buckets in my recently published Elasticsearch Handbook — containing lots of other real-world, non-trivial examples 🙌

How to count number of objects in a nested field in elastic search?

How to count number of objects in a nested filed in elastic search?
Sample mapping :
"base_keywords": {
"type": "nested",
"properties": {
"base_key": {
"type": "text"
},
"category": {
"type": "text"
},
"created_at": {
"type": "date"
},
"date": {
"type": "date"
},
"rank": {
"type": "integer"
}
}
}
I would like to count number of objects in nested filed 'base_keywords'.
You would need to do this with inline script. This is what worked for me: (Using ES 6.x):
GET your-indices/_search
{
"aggs": {
"whatever": {
"sum": {
"script": {
"inline": "params._source.base_keywords.size()"
}
}
}
}
}
Aggs are normally good for counting and grouping, for nested documents you can use nested aggs:
"aggs": {
"MyAggregation1": {
"terms": {
"field": "FieldA",
"size": 0
},
"aggs": {
"BaseKeyWords": {
"nested": { "path": "base_keywords" },
"aggs": {
"BaseKeys": {
"terms": {
"field": "base_keywords.base_key.keyword",
"size": 0
}
}
}
}
}
}
}
You don't specify what you want to count, but aggs are quite flexible for grouping and counting data.
The "doc_count" and "key" behave similar to an sql group by + count()
Updated (This assumes you have a .keyword field create the "keys" values, since a property of type "text" can't be aggregated or counted:
{
"aggs": {
"MyKeywords1Agg": {
"nested": { "path": "keywords1" },
"aggs": {
"NestedKeywords": {
"terms": {
"field": "keywords1.keys.keyword",
"size": 0
}
}
}
}
}
}
For simply counting the number of nested keys you could simply do this:
{
"aggs": {
"MyKeywords1Agg": {
"nested": { "path": "keywords1" }
}
}
}
If you want to get some grouping on the field values on the "main" document or the nested documents, you will have to extend your mapping / data model to include terms that are aggregatable, which includes most data types in elasticsearch except "text", ex.: dates, numbers, geolocations, keywords.
Edit:
Example with aggregating on a unique identifier for each top level document, assuming you have a property on it called "WordMappingId" of type integer
{
"aggs": {
"word_maping_agg": {
"terms": {
"field": "WordMappingId",
"size": 0,
"missing": -1
},
"aggs": {
"Keywords1Agg": null,
"nested": { "path": "keywords1" }
}
}
}
}
If you don't add any properties to the "word_maping" document on the top level there is no way to do an aggregation for each unique document. The builtin _id field is by default not aggregateable, and I suggest you include a unique identifier from the source data on the top level to aggregate on.
Note: the "missing" parameter will put all documents that don't have the WordMappingId property set in a bucked with the supplied value, this makes sure you're not missing any documents in the search results.
Aggs can support a behaviour similar to a group by in SQL, but you need something to actually group it by, and according to the mapping you supplied there are no such fields currently in your index.
I was trying to do similar to understand production data distribution
The following query helped me find top 5
{
"query": {
"match_all": {}
},
"aggs": {
"n_base_keywords": {
"nested": { "path": "base_keywords" },
"aggs": {
"top_count": { "terms": { "field": "_id", "size" : 5 } }
}
}
}
}

Elasticsearch: how to scope aggregations to your query and filter?

I have been playing around with elasticsearch query and filter for some time now but never worked with aggregations before. The idea that we can scope the aggregations with our query seems quite amazing to me but I want to understand how to do it properly so that I do not make any mistakes. Currently all my search queries are designed this way:
{
"query": {
},
"filter": {
},
"from": 0,
"size": 60
}
Now, when I added some aggregation buckets, the structure became this:
{
"aggs": {
"all_colors": {
"terms": {
"field": "color.name"
}
},
"all_brands": {
"terms": {
"field": "brand_slug"
}
},
"all_sizes": {
"terms": {
"field": "sizes"
}
}
},
"query": {
},
"filter": {
},
"from": 0,
"size": 60
}
However, the results of the aggregation are always the same irrespective of what info I provide in filter.
Now, when I changed the query structure to something like this, it started showing different results:
{
"aggs": {
"all_colors": {
"terms": {
"field": "color.name"
}
},
"all_brands": {
"terms": {
"field": "brand_slug"
}
},
"all_sizes": {
"terms": {
"field": "sizes"
}
}
},
"query": {
"filtered": {
"query": {
},
"filter": {
}
}
},
"from": 0,
"size": 60
}
Does it mean I will have to change the structure of my search queries everywhere to this new filtered type of structure ? Is there any other workaround which allows me to achieve desired results without having to change that much of code ?
Also, another thing I observed is that if my brand_slug field contains multiple keywords like "peter england", then both of these are returned in separate buckets like this:
{
"buckets": [
{
"key": "england",
"doc_count": 368
},
{
"key": "peter",
"doc_count": 368
}
]
}
How can I ensure that both these end up in a same bucket like this:
{
"buckets": [
{
"key": "peter england",
"doc_count": 368
}
]
}
UPDATE: This second part I have been able to accomplish by indexing brand, color and sizes differently like this:
"sizes": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
What you've noticed is by design. Have a look at my answer to a similar question on SO. Basically, input to both aggregation and filter sections is the output of query section. Filtered Query as you've suggested would be the best way to achieve the results you desire. There is another way too. You can use Filter Aggregation. Then you would not need to change your query and filter sections but simply copy the filter section inside the aggregation sections but that in my opinion would be an overkill and a violation of the DRY principle in general.

Resources