Syntax for referencing sub-buckets in Vega - elasticsearch

I have an Elasticsearch (6.2) query that returns the following JSON:
"aggregations": {
"per_buyer": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Example-buyer",
"doc_count": 45,
"per_cart": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 6,
"buckets": [
{
"key": "397105FB",
"doc_count": 8,
"net": {
"value": 10
} } ] } } ] } }
What is the correct syntax for the vega "format" field to display data per "per_cart" bucket? Anything deeper than aggregations.per_buyer.buckets returns the error _.aggregations.per_buyer.buckets.per_cart is undefined . VEGA_DEBUG.view.data shows that some aggregations.per_buyer.buckets have a "per_cart" object, which in turn have buckets. ( Filtering so that all buckets have per_cart objects does not change anything. )
I previously asked this question without success on the Elastic forums.

Sorry missed the request in the discuss forum. Cross-posting my answer here:
#Steven_Ensslen your format must be aggregations.per_buyer.buckets, simply because that's the list of data that you need. Each data element in that bucket may contain a sub-list of buckets which you have to access from Vega itself. Now, if the sub-list will always have just a single bucket (e.g. if you are summing-up the total per main bucket), you can access it either directly from a mark, e.g. datum.per_cart.buckets[0].net.value, or you can create a formula transform that would copy that value into a top level field, e.g. {type:'formula', as:'net_value', expr: 'datum.per_cart.buckets[0].net.value'}, and use the net_value field in the mark. If on the other hand you have multiple items in the sub-list, you could use the flatten transform to flatten out the sub-buckets into a non-hierarchical list of items, and then use various transformations to get the data into the format you need.
P.S. "flatten" transform may not be available until 6.3 or 6.4.

Related

Terms aggregation on first three octets of IP

I'm doing a faceted search UI, and one of the facets I want to add is for the first three octets of an IP field.
So for example, given documents with IPs "192.168.1.1", "192.168.1.2", "192.168.2.1", I would want to display the facets "192.168.1 (2)" and "192.168.2 (1)".
Is there an aggregation I can use for this? As far as I can tell, range aggregations require me to predefine the ranges, and term aggregations only take a field.
Obviously the alternative is for me to index the first three octets as a separate field, but of course I would prefer to avoid that.
Thanks!
You can add a path hierarchy tokenizer with delimeter of '.' and a custom analyzer with the tokenizer set to the tokenizer you just made.
See this question for the syntax:
Elasticsearch - using the path hierarchy tokenizer to access different level of categories
Then you can aggregate terms and you will get results grouped by each number group
{
"key": "192",
"doc_count": 10
},
{
"key": "192.168",
"doc_count": 10
},
...
In the linked answer there is a way to exclude certain aggregations levels. The following should exclude all results except ones that have 3 levels of numbers.
"aggs": {
"ipaddr": {
"terms": {
"field": "your_ip_addr",
"exclude": ".*",
"include": ".*\\..*\\..*"
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pathhierarchy-tokenizer.html

String range query in Elasticsearch

I'm trying to query data in an Elasticsearch cluster (2.3) using the following range query. To clarify, I'm searching on a field that contains an array of values that were derived by concatenating two ids together with a count. For example:
Schema:
{
id1: 111,
id2: 222,
count: 5
}
The query I'm using looks like the following:
Query:
{
"query": {
"bool": {
"must": {
"range": {
"myfield": {
"from": "111_222_1",
"to": "111_222_2147483647",
"include_lower": true,
"include_upper": true
}
}
}
}
}
}
The to field uses Integer.MAX_VALUE
This works alright but doesn't exactly match the underlying data. Querying through other means produces more results than this method.
More strangely, trying 111_222_5 in the from field produces 0 results, while trying 111_222_10 does produce results.
How is ES (and/or Lucene) interpreting this range query and why is it producing such strange results? My initial guess is that it's not looking at the full value of the last portion of the String and possibly only looking at the first digit.
Is there a way to specify a format for the TermRange? I understand date ranging allows formatting.
A look here provides the answer.
The way it's doing range is lexicographic, 5 comes before 50 comes before 6, etc.
To get around this, I reindexed using a fixed length string for the count.
0000000001
0000000100
0001000101
...

Elasticsearch count terms ignoring spaces

Using ES 1.2.1
My aggregation
{
"size": 0,
"aggs": {
"cities": {
"terms": {
"field": "city","size": 300000
}
}
}
}
The issue is that some city names have spaces in them and aggregate separately.
For instance Los Angeles
{
"key": "Los",
"doc_count": 2230
},
{
"key": "Angeles",
"doc_count": 2230
},
I assume it has to do with the analyzer? Which one would I use to not split on spaces?
For fields that you want to perform aggregations on I would recommend either the keyword analyzer or do not analyze the field at all. From the keyword analyzer documentation:
An analyzer of type keyword that "tokenizes" an entire stream as a single token. This is useful for data like zip codes, ids and so on. Note, when using mapping definitions, it might make more sense to simply mark the field as not_analyzed.
However if you want to still perform analysis on the field to include for other searches, then consider using the field setting of ES 1.x As described in the field/multi_field documentation. This will allow you to have a value of the field for searching and one for aggregations.
There are 2 approaches to solve this.
The not_analyzed way - But this wont consider different capital and small cases
The keyword tokenizer way - Here we can map different terms with
different case as one.
These two concepts with working code examples are illustrated in this blog.

ElasticSearch custom score script does not preserve array ordering

I am using ElasticSearch with a function_score property to retrieve documents sorted by createdOn. The createdOn field is stored as an Array representing date values, i.e.
"createdOn": [ 2014, 4, 24, 22, 11, 47 ]
Where createdOn[0] is year, createdOn[1] is month, createdOn[2] is day, etc. I am testing the following query, which should return documents scored by year. However, the doc['createdOn'] array does not preserve the value of the elements. In this query, doc['createdOn'].values[0] returns 4, not 2014.
POST /example/1
{
name:"apple",
createdOn: [2014, 8, 22, 5, 12, 32]
}
POST /example/2
{
name:"apple",
createdOn: [2011, 8, 22, 5, 12, 32]
}
POST /example/3
{
name:"apple",
createdOn: [2013, 8, 22, 5, 12, 32]
}
POST /example/_search
{
"query":
{
"function_score": {
"boost_mode": "replace",
"query": {
"match_all": {}
},
"script_score" : {
"script": "doc['createdOn'].values[0]"
}
}
}
}
It appears that this is due to the way ElasticSearch caches data: http://elasticsearch-users.115913.n3.nabble.com/Accessing-array-field-within-Native-Plugin-td4042848.html:
The only apparent solution other than using the source method (which is slow), is to use nested queries. Any ideas on how I could rewrite my query using nested queries? It seems like the only efficient way to sort this query by year.
The docFieldDoubles method gets it's values from the in memory
structures of the field data cache. This is done for performance. The
field data cache is not loaded from source of the document (because
this will be slow) but from the lucene index, where the values are
sorted (for lookup speed). The get api does work based on the original
document source which is why you see those values in order (note- ES
doesn't the parse the source for the get api, it just gives you back
what you've put in it).
You can access the original document (which will be parsed) using the
SourceLookup (available from the source method) but it will be slow as
it needs to go to disk for every document.
I'm not sure about the exact semantics of what you are trying to
achieve, but did you try looking at nested objects? those allow you to
store a list of object in a why that keeps values together, like [{
"key": "k1" , "value" : "v1"},...].

couchdb view/reduce. sometimes you can return values, sometimes you cant..?

This is on a recent version of couchbase server.
The end goal is for the reduce/groupby to aggregate the values of the duplicate keys in to a single row with an array value.
view result with no reduce/grouping (in reality there are maybe 50 rows like this emitted):
{
"total_rows": 3,
"offset": 0,
"rows": [
{
"id": "1806a62a75b82aa6071a8a7a95d1741d",
"key": "064b6b4b-8e08-4806-b095-9e59495ac050",
"value": "1806a62a75b82aa6071a8a7a95d1741d"
},
{
"id": "47abb54bf31d39946117f6bfd1b088af",
"key": "064b6b4b-8e08-4806-b095-9e59495ac050",
"value": "47abb54bf31d39946117f6bfd1b088af"
},
{
"id": "ed6a3dd3-27f9-4845-ac21-f8a5767ae90f",
"key": "064b6b4b-8e08-4806-b095-9e59495ac050",
"value": "ed6a3dd3-27f9-4845-ac21-f8a5767ae90f"
}
}
with reduce + group_level=1:
function(keys,values,re){
return values;
}
yields an error from couch with the actual 50 or so rows from the real view (even fails with fewer view rows). couch says something about the data not shrinking rapidly enough. However this same type of thing works JUST FINE when the view keys are integers and there is a small amount of data.
Can someone please explain the difference to me?
Reduce values need to remain as small as possible, due to the nature of how they are stored in the internal b-tree data format. There's a little bit of information in the wiki about why this is.
If you want to identify unique values, this needs to be done in your map function. This section on the same wiki page shows you one method you can use to do so. (I'm sure there are others)
I am almost always going to be querying this view with a "key" parameter, so there really is no need to aggregate values via couch, it can be easily and efficiently done in the app.

Resources