Nested Objects aggregations (with Kibana) - elasticsearch

We got an Elasticsearch index containing documents with a subset of arbitrary nested object called devices. Each of those devices has a key call "aw".
What I try to accomplish, is to get an average of the aw key for each device type.
When trying to aggregate and visualize this average I don't get the average of the aw of every device type, but of all devices within the documents containing the specific device.
So instead of fetching all documents where device.id=7 and aggregating the awper device.id, Elasticsearch / Kibana fetches all documents containing device.id=7 but then builds it's average using all devices within the documents.
Out index mapping looks like this (only important parts):
"mappings" : {
"devdocs" : {
"_all": { "enabled": false },
"properties" : {
"cycle": {
"type": "object",
"properties": {
"t": {
"type": "date",
"format": "dateOptionalTime||epoch_second"
}
}
},
"devices": {
"type": "nested",
"include_in_parent": true,
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"aw": {
"type": "long"
}
"t": {
"type": "date",
"format": "dateOptionalTime||epoch_second"
},
}
}
}
}
Kibana generates the following query:
{
"size": 0,
"query": {
"filtered": {
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*"
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"cycle.t": {
"gte": 1290760324744,
"lte": 1448526724744,
"format": "epoch_millis"
}
}
}
],
"must_not": []
}
}
}
},
"aggs": {
"2": {
"terms": {
"field": "devices.name",
"size": 35,
"order": {
"1": "desc"
}
},
"aggs": {
"1": {
"avg": {
"field": "devices.aw"
}
}
}
}
}
}
Is there a way to aggregate the average aw on device level, or what am I doing wrong?

Kibana doesn't support nested aggregations yet , Nested Aggregations Issue.
I had the same issue and solved it by building kibana from src from this fork by user ppadovani. [branch : nestedAggregations]
See instructions to build kibana from source here.
After building when you run kibana now it will contain a Nested Path text box and a reverse nested checkbox in advanced options for buckets and metrics.
Here is an example of nested terms aggregation on lines.category_1, lines.category_2, lines.category_3 and lines being of nested type. using the above with three buckets, :

I would suggest adding filter aggregation to leave everything with aw: 7.
Defines a single bucket of all the documents in the current document
set context that match a specified filter. Often this will be used to
narrow down the current aggregation context to a specific set of
documents.

Kibana does not support Nested json.

Related

Composite and Terms Aggregations on a field with a high cardinality

I am facing a huge performance problem with ES which results in more than 2 min response.
I have an index that has more than 25M files and composes of the next 4 fields (among others):
...
"group_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"group_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
...
I have something like 100K unique users and groups and each field is a list of users/groups that holds ~100 values. For example:
"user_read": ["user_1", "group_1", ...],
"user_write": ["user_1", "group_2", ...]
...
I have 2 kinds of aggregation I am using, composite and terms. Composite aggregations for getting only first X results to display and terms aggregation for prefix search.
Composite aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"composite": {
"sources": [
{
"Group Read": {
"terms": {
"field": "group_read.raw"
}
}
}
],
"size": 10
}
},
"Group_Write_Permissions": {
"composite": {
"sources": [
{
"Group Write": {
"terms": {
"field": "group_write.raw"
}
}
}
]
}
},
"User_Write_Permissions": {
"composite": {
"sources": [
{
"User Write": {
"terms": {
"field": "user_write.raw"
}
}
}
]
}
},
"User_Read_Permissions": {
"composite": {
"sources": [
{
"User Read": {
"terms": {
"field": "user_read.raw"
}
}
}
]
}
}
}
}
Terms aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"terms": {
"field": "group_read.raw",
"include": ".*[Ss].*"
}
},
"Group Write Permissions": {
"terms": {
"field": "group_write.raw",
"include": ".*[Ss].*"
}
},
"User Read Permissions": {
"terms": {
"field": "user_read.raw",
"include": ".*[Ss].*"
}
},
"User Write Permissions": {
"terms": {
"field": "user_write.raw",
"include": ".*[Ss].*"
}
}
}
}
Composite aggregation returns results within 1 min and the terms aggregation can take up to 5 min.
What I have tried so far:
Adding new field user_group_permissions and adding to the above 4 fields "copy_to": "user_group_permissions"
Adding to the above 4 fields and to the field "user_group_permissions" the next property: "eager_global_ordinals": true
Increased the refresh_interval up to 200s
** I reindexed for the first 2 suggestions [took something like 6 hours]
All of the above did help a little with the retrieval time but still: composite aggregation takes up to 20s and terms aggregation takes up to 3 min.
[The best results were on the fields user_group_permissions which has been created in the first suggestion, with eager_global_ordinals = true and refresh_interval = 120s].
Please, if someone has any idea how to improve the retrieval times I will be grateful.
First of all, if you only need the first 10 results, you don't need to use the composite aggregation, which is meant to be used only if you need to paginate over all results. Simply use the terms aggregation with default size 10, that'll do the job.
Second, what you're doing with the terms aggregation is not a prefix filtering, but infix filtering, which is completely different in terms of performance. While it's easy to search for prefixes, searching for infixes requires the equivalent of a "full table scan" because each and every term must be visited.
A first optimization I would suggest is that in your second query you should do your regex in the query part (bool/should with one regex query per field), so as to reduce the document set on which the terms aggregations need to run. That might help a bit.
A second optimization is to leverage the wildcard field type which is a specialized field type made specially for grep-like wildcard and regexp queries.
Another possible optimization is to lowercase all your permissions, so that you only need to search for .*s.* instead of the uppercase variant.
Depending on your comments, I'll add more optimizations as the discussion goes on.

Nested Fields, Wildcard Queries and Aggregations in Elasticsearch

I have an index that collects web redirects data for various sites. I am using a nested field to collect the data as shown in the mapping below:
"chain": {
"type": "nested",
"properties": {
"url.position": {
"type": "long"
},
"url.full": {
"type": "text"
},
"url.domain": {
"type": "keyword"
},
"url.path": {
"type": "keyword"
},
"url.query": {
"type": "text"
}
}
}
As you can imagine, each document contains an array of url chains, the size of the array being equal to number of web redirects. I want to get aggregations based on wildcard/regexp matches to url.query field. Here is a sample query:
GET push_url_chain/_search
{
"query": {
"nested": {
"path": "chain",
"query": {
"regexp": {
"chain.url.query": "aff_c.*"
}
}
}
},
"size": 0,
"aggs": {
"dataFields": {
"nested": {
"path": "chain"
},
"aggs": {
"offers": {
"terms": {
"field": "chain.url.domain",
"size": 30
}
}
}
}
}
}
The above query does produce aggregated results but not the way I want.
I want to see chain.url.domain aggregations for the urls that contain the aff_c.* phrase. Right now it is looking at all the urls in the chain and then aggregating the buckets by doc_count regardless of whether that url/domain has the particular phrase. I hope I have been able to explain this clearly. How do I get my results to show bucket aggregations that contain domains that have aff_c.* phrase match to the query field of the url.
I would also like to know how I can use = or / in my wildcard or regexp queries. It is not producing any results if I use the above symbols in my queries.
Tha
Nested query returns all documents where a nested document matches the condition, you get matched nested docs only in inner_hits.
Aggregation is applied on top of these documents, so all domains are coming in terms
You need to use nested aggregation to gets only matching terms.
{
"size": 0,
"aggs": {
"Name": {
"nested": {
"path": "chain"
},
"aggs": {
"matched_doc": {
"filter": { --> filter for url
"match_phrase_prefix": {
"chain.url.query": "abc"
}
},
"aggs": {
"domain": {
"terms": {
"field": "chain.url.domain", -- terms for matched url
"size": 10
}
}
}
}
}
}
}
}
You can use match_phrase_prefix instead of regex. It has better performance.
Standard analyzer while generating tokens removes "/","=". So if you want to use regex or wildcard and look for these , you need to use keyword field not text field.

Elasticsearch nested significant terms aggregation with background filter

I am having hard times applying a background filter to a nested significant terms aggregation , the bg_count is always 0.
I'm indexing article views that have ids and timestamps, and have multiple applications on a single index. I want the foreground and background set to relate to the same application, so I'm trying to apply a term filter on the app_id field both in the boo query and in the background filter. article_views is a nested object since I want to be also able to query on views with a range filter on timestamp, but I haven't got to that yet.
Mapping:
{
"article_views": {
"type": "nested",
"properties": {
"id": {
"type": "string",
"index": "not_analyzed"
},
"timestamp": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
},
"app_id": {
"type": "string",
"index": "not_analyzed"
}
}
Query:
{
"aggregations": {
"articles": {
"nested": {
"path": "article_views"
},
"aggs": {
"articles": {
"significant_terms": {
"field": "article_views.id",
"size": 5,
"background_filter": {
"term": {
"app_id": "17"
}
}
}
}
}
}
},
"query": {
"bool": {
"must": [
{
"term": {
"app_id": "17"
}
},
{
"nested": {
"path": "article_views",
"query": {
"terms": {
"article_views.id": [
"1",
"2"
]
}
}
}
}
]
}
}
}
As I said, in my result, the bg_count is always 0, which had me worried. If the significant terms is on other fields which are not nested the background_filter works fine.
Elasticsearch version is 2.2.
Thanks
You seem to be hitting the following issue where in your background filter you'd need to "go back" to the parent context in order to define your background filter based on a field of the parent document.
You'd need a reverse_nested query at that point, but that doesn't exist.
One way to circumvent this is to add the app_id field to your nested documents so that you can simply use it in the background filter context.

Elasticsearch aggregation performance takes a hit on relatively small dataset

We have a cluster of 3 Linux VMs (each machine has 2 cores, 8GB of RAM per core) where we have deployed an Elasticsearch 2.1.1 cluster, with default configuration. Store size is ~50GB for ~3M documents -so arguably fairly modest. We index documents ranging in size from tweets to blog posts. For each document, we extract "entities" (eg, if string "Barack Obama" appears in a document, we locate its character position and classify it into an entity type, in this case the type "person", or "statesman") from the text before indexing the document alongside its array of extracted entities.
Our mapping is as follows:
{
"mappings": {
"_default_": {
"_all": { "enabled": "false" },
"dynamic": false
},
"document": {
"properties": {
"body": { "type": "string", "index": "analyzed", "analyzer": "english" },
"timestamp": { "type": "date", "index":"not_analyzed" },
"author": {
"properties": {
"name": { "type": "string", "index": "not_analyzed" }
}
},
"entities": {
"type": "nested",
"include_in_parent": true,
"properties": {
"text": { "type": "string", "index": "not_analyzed" },
"type": { "type": "string", "index": "analyzed", "analyzer": "path" },
"start": { "type": "integer", "index":"not_analyzed", "doc_values": false },
"stop": { "type": "integer", "index":"not_analyzed", "doc_values": false }
}
}
}
}
}
}
Path analyzer is used on the entity type field (entity types are based on some hierarchical taxonomy, so the type is represented as a path-like string). The only other analyzed field is the body of the document. For some reason that I could expand on if necessary, we have to index the entities as nested types, though we are still including them in the parent document.
There are on average ~10 entities extracted per document, so ~30M entities in total. The cardinality for the entities field is thus fairly high (~2M unique values).
Our problem is that some of the aggregations we are doing are very slow (>30s). In particular, the following two aggregations:
{
"query": {
"bool": {
"must": {
"query": {
// Some query
}
},
"filter": {
// Some filter
}
}
},
"aggs": {
"aggData": {
terms: { field: 'entities.text', size: 50 }
}
}
}
And the same one, just replacing 'terms' aggregation with 'significant_terms':
{
"query": {
"bool": {
"must": {
"query": {
// Some query
}
},
"filter": {
// Some filter
}
}
},
"aggs": {
"aggData": {
significant_terms: { field: 'entities.text', size: 50 }
}
}
}
My questions:
Why are these aggregations prohibitively slow?
Is there something stupid/inefficient in the mapping strategy?
Does indexing the entities as a nested document while still keeping them in the parent document have an impact?
Is it simply that the cardinality of the entities field is just too big and Elasticsearch is not magic?

Elasticsearch getting the last nested or most recent nested element

We have this mapping:
{
"product_achievement": {
"type": "nested",
"properties": {
"id": {
"type": "long"
},
"last_purchase": {
"type": "long"
},
"products": {
"type": "long"
}
}
}
}
As you see this is nested, and the last_purchase field is a unixtimestamp value. We would like to query from all nested elements the most recent entry defined by the last_purchase field AND see if in the last entry there is some product id is in products.
You can achieve this using a nested query with inner_hits. In the query part, you can specify the product id you want to match and then using inner_hits you can sort by decreasing last_purchase timestamp and only take the first one using size: 1
{
"query": {
"nested": {
"path": "product_achievement",
"query": {
"term": {
"product_achievement.products": 1
}
},
"inner_hits": {
"size": 1,
"sort": {
"product_achievement.last_purchase": "desc"
}
}
}
}
}

Resources