elasticsearch group every X values - elasticsearch

Using elasticsearch, I query for a particular field. Is there then a way to aggregate every X values?
For instance, let's say I query 10 documents for field "myField", returning 10 values,
1, 4, 2, 4, 5, 3, 3, 2, 1, 4.
Is there a way to aggregate such that every 2 values are averaged, yielding
2.5, 3, 4, 2.5, 2.5 ?

You can do some interesting--and perhaps inadvisable--stuff with scripted metric aggregations. They let you define map-reduce scripts that run against your documents. You can get yourself in trouble with this, of course.
But just to see if I could do it, I set up a simple, single-shard index with the data you provided:
PUT /test_index
{"settings": {"number_of_shards": 1}}
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"num":1}
{"index":{"_id":2}}
{"num":4}
{"index":{"_id":3}}
{"num":2}
{"index":{"_id":4}}
{"num":4}
{"index":{"_id":5}}
{"num":5}
{"index":{"_id":6}}
{"num":3}
{"index":{"_id":7}}
{"num":3}
{"index":{"_id":8}}
{"num":2}
{"index":{"_id":9}}
{"num":1}
{"index":{"_id":10}}
{"num":4}
Then I can average every two documents like this:
POST /test_index/_search
{
"size": 0,
"aggs": {
"profit": {
"scripted_metric": {
"init_script" : "_agg['nums'] = []; _agg['avgs'] = [];",
"map_script" : "_agg.nums.add(doc['num'].value); if(_agg.nums.size() == 2){ _agg.avgs.add((_agg.nums[0] + _agg.nums[1])/2.0); _agg['nums'] = [];}",
"combine_script" : "return _agg.avgs",
"reduce_script" : "return _aggs"
}
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 0,
"hits": []
},
"aggregations": {
"profit": {
"value": [
[
2.5,
3,
4,
2.5,
2.5
]
]
}
}
}
It doesn't seem to respect sorting order in the query, though the outcome is deterministic as far as I can tell.
What I did here only works with a single shard; you could probably generalize it somehow if you wanted to tinker with it long enough.
Also, big fat disclaimer: doing this in production might be a bad idea. You'd want to test this sort of thing out on small data sets first before you potentially crash your cluster with out-of-memory errors. Also only use scripting if your cluster isn't open to the big bad Internet.
Here is some code I used to play around with it:
http://sense.qbox.io/gist/c31f089e63200127fd9ca09992004db8bb11b890

Related

How get a distinct list of document fields using NEST?

I have just started with Elasticsearch and am using the NEST API for my .Net application. I have an index and some records inserted. I am now trying to get a distinct list of document field values. I have this working in Postman. I do not know how to port the JSON aggregation body to a NEST call. Here is the call I am trying to port to the NEST C# API:
{
"size": 0,
"aggs": {
"hosts": {
"terms": {
"field": "host"
}
}
}
Here is the result which is my next question. How would I parse or assign a POCO to the result? I am only interested in the distinct list of the field value in this case 'host'. I really just want an enumerable of strings back. I do not care about the count at this point.
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"hosts": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "hoyt",
"doc_count": 3
}
]
}
}
}
I was able to get the results I am after with the following code:
var result = await client.SearchAsync<SyslogEntryIndex>(s => s.Size(0).Aggregations(a => a.Terms("hosts", t => t.Field(f => f.Host))));
List<string> hosts = new List<string>();
foreach (BucketAggregate v in result.Aggregations.Values)
{
foreach (KeyedBucket<object> item in v.Items)
{
hosts.Add((string)item.Key);
}
}
return hosts;

No results from search when passing more than one parameter in user metadata

I want to apply document level security in elastic, but once I provide more than one value in user metadata I get no matches.
I am creating a role and a user in elastic and passing values inside user metadata to the role on whose basis the search should happen. It works fine if I give one value.
For creating role:
PUT _xpack/security/role/my_policy
{
"indices": [{
"names": ["my_index"],
"privileges": ["read"],
"query": {
"template": {
"source": "{\"bool\": {\"filter\": [{\"terms_set\": {\"country_name\": {\"terms\": {{#toJson}}_user.metadata.country_name{{/toJson}},\"minimum_should_match_script\":{\"source\":\"params.num_terms\"}}}}]}}"
}
}
}]
}
And for user:
PUT _xpack/security/user/jack_black
{
"username": "jack_black",
"password":"testtest",
"roles": ["my_policy"],
"full_name": "Jack Black"
"email": "jb#tenaciousd.com",
"metadata": {
"country_name": ["india" , "japan"]
}
}
I expect the output to be results for india and japan only. If the user searches for anything else they should get no results.
However, I do not see any results at all:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}

Elastic search storage: How to get the list of field names under _source?

I am very new to using Elastic search storage and looking for a clue to find the list of all fields listed under_source. So far, I have come across the ways to find out the values for the different fields defined under _source but not the way to list out all the fields. For example: I have below document
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "my_product",
"_type": "_doc",
"_id": "B2LcemUBCkYSNbJBl-G_",
"_score": 1,
"_source": {
"email": "123#abc.com",
"product_0": "iWLKHmUBCkYSNbJB3NZR",
"product_price_0": "10",
"link_0": ""
}
}
]
}
}
So, from the above example, I would like to get the fields names like email, product_0, product_price_0 and link_0 which are under _source. I have been retrieving the values by parsing the array returned from the ess api but what should be at the ? mark to get the field names $result['hits']['hits'][0]['_source'][?]
Note: I am using php to insert data into ESS and retrieve data from it.
If I understood correctly you need array_keys
array_keys($result['hits']['hits'][0]['_source'])

Convert any Elasticsearch response to simple field value format

On elastic search, when doing a simple query like:
GET miindex-*/mytype/_search
{
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*"
}
}
}
It returns a format like:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 28,
"max_score": 1,
"hits": [
...
So I parse like response.hits.hits to get the actual records.
However if you are doing another type of query e.g. aggregation, the response is totally different like:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 28,
"max_score": 0,
"hits": []
},
"aggregations": {
"myfield": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
...
and I need to look actually in another json property: response.aggregations.myfield.buckets which gets even more complicated if you have more than one aggregation.
So, my question is very simple, isn't there a way that I can get Elasticsearch to response always with just the fields I want just like in SQL format:
E.g.
SELECT author, bookid FROM books
Would return:
{"author":"rogers", "bookid":099991}
{"author":"peter", "bookid":099992}
SELECT COUNT(author) As count_author, author, count(bookid) As count_bookid, bookid FROM books GROUP BY author, bookid
Would return:
{"count_author":4, "author":"rogers", "count_bookid":9, "bookid":099991}
{"count_author":8, "author":"peter", "count_bookid":9, "bookid":099992}
Is there a way to show only the fields I want and nothing else?(not having to look within nested json objects and all that stuff). (I want this because I'm doing many reports and I want to have a simple function that parses each response easily in a uniform way.)

Timeout in Elastic search query

I have following elastic search query, I want to apply timeout. So I used
"timeout" param.
GET testdata-2016.04.14/_search
{
"size": 10000,
"timeout": "1ms"
}
I have set timeout to be 1ms, but I observed that query is taking time about more than 5000ms. I have tried the query as below also:
GET testdata-2016.04.14/_search?timeout=1ms
{
"size": 10000
}
IN both cases, I am getting below response after approx. 5000ms.
{
"took": 126,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 26536,
"max_score": 1,
"hits": [
{
...................
...................
}
}
}
I am not sure what is happening here. Is anything missing in above queries ?
Please help.
I have tried to find out solution on google but didn't find any working solution.

Resources