Indexed documents are like:
{
id: 1,
title: 'Blah',
...
platform: {id: 84, url: 'http://facebook.com', title: 'Facebook'}
...
}
What I want is count and output stats-by-platform.
For counting, I can use terms aggregation with platform.id as a field to count:
aggs: {
platforms: {
terms: {field: 'platform.id'}
}
}
This way I receive stats as a multiple buckets looking like {key: 8, doc_count: 162511}, as expected.
Now, can I somehow add to those buckets also platform.name and platform.url (for pretty output of stats)? The best I've came with looks like:
aggs: {
platforms: {
terms: {field: 'platform.id'},
aggs: {
name: {terms: {field: 'platform.name'}},
url: {terms: {field: 'platform.url'}}
}
}
}
Which, in fact, works, and returns pretty complicated structure in each bucket:
{key: 7,
doc_count: 528568,
url:
{doc_count_error_upper_bound: 0,
sum_other_doc_count: 0,
buckets: [{key: "http://facebook.com", doc_count: 528568}]},
name:
{doc_count_error_upper_bound: 0,
sum_other_doc_count: 0,
buckets: [{key: "Facebook", doc_count: 528568}]}},
Of course, name and url of platform could be extracted from this structure (like bucket.url.buckets.first.key), but is there more clean and simple way to do the task?
It seems the best way to show intentions is top hits aggregation: "from each aggregated group select only one document", and then extract platform from it:
aggs: {
platforms: {
terms: {field: 'platform.id'},
aggs: {
platform: {top_hits: {size: 1, _source: {include: ['platform']}}}
}
}
This way, each bucked will look like:
{"key": 7,
"doc_count": 529939,
"platform": {
"hits": {
"hits": [{
"_source": {
"platform":
{"id": 7, "name": "Facebook", "url": "http://facebook.com"}
}
}]
}
},
}
Which is kinda too deeep (as usual with ES), but clean: bucket.platform.hits.hits.first._source.platform
If you don't necessarily need to get the value of platform.id, you could get away with a single aggregation instead using a script that concatenates the two fields name and url:
aggs: {
platforms: {
terms: {script: 'doc["platform.name"].value + "," + doc["platform.url"].value'}
}
}
Related
I have an index, invoices, that I need to aggregate into yearly buckets then sort.
I have succeeded in using Bucket Sort to sort my buckets by simple sum values (revenue and tax). However, I am struggling to sort by more deeply nested doc_count values (status).
I want to order my buckets not only by revenue, but also by the number of docs with a status field equal to 1, 2, 3 etc...
The documents in my index looks like this:
"_source": {
"created_at": "2018-07-07T03:11:34.327Z",
"status": 3,
"revenue": 68.474,
"tax": 6.85,
}
I request my aggregations like this:
const params = {
index: 'invoices',
size: 0,
body: {
aggs: {
sales: {
date_histogram: {
field: 'created_at',
interval: 'year',
},
aggs: {
total_revenue: { sum: { field: 'revenue' } },
total_tax: { sum: { field: 'tax' } },
statuses: {
terms: {
field: 'status',
},
},
sales_bucket_sort: {
bucket_sort: {
sort: [{ total_revenue: { order: 'desc' } }],
},
},
},
},
},
},
}
The response (truncated) looks like this:
"aggregations": {
"sales": {
"buckets": [
{
"key_as_string": "2016-01-01T00:00:00.000Z",
"key": 1451606400000,
"doc_count": 254,
"total_tax": {
"value": 735.53
},
"statuses": {
"sum_other_doc_count": 0,
"buckets": [
{
"key": 2,
"doc_count": 59
},
{
"key": 1,
"doc_count": 58
},
{
"key": 5,
"doc_count": 57
},
{
"key": 3,
"doc_count": 40
},
{
"key": 4,
"doc_count": 40
}
]
},
"total_revenue": {
"value": 7355.376005351543
}
},
]
}
}
I want to sort by key: 1, for example. Order the buckets according to which one has the greatest number of docs with a status value of 1. I tried to order my terms aggregation, then specify the desired key like this:
statuses: {
terms: {
field: 'status',
order: { _key: 'asc' },
},
},
sales_bucket_sort: {
bucket_sort: {
sort: [{ 'statuses.buckets[0]._doc_count': { order: 'desc' } }],
},
},
However this did not work. It didn't error, it just doesn't seem to have any effect.
I noticed someone else on SO had a similar question many years ago, but I was hoping a better answer had emerged since then: Elasticsearch aggregation. Order by nested bucket doc_count
Thanks!
Nevermind I figured it out. I added a separate filter aggregation like this:
aggs: {
total_revamnt: { sum: { field: 'revamnt' } },
total_purchamnt: { sum: { field: 'purchamnt' } },
approved_invoices: {
filter: {
term: {
status: 1,
},
},
},
Then I was able to bucket sort that value like this:
sales_bucket_sort: {
bucket_sort: {
sort: [{ 'approved_invoices>_count': { order: 'asc' } }],
},
},
In case if anyone comes to this issue again. Latest update tried with Elasticsearch version 7.10 could work in this way:
sales_bucket_sort: {
bucket_sort: {
sort: [{ '_count': { order: 'asc' } }],
},
}
With only _count specified, it will automatically take the doc_count and sort accordingly.
I believe this answer will just sort by the doc_count of the date_histogram aggregation, not the nested sort.
JP's answer works: create a filter with the target field: value then sort by it.
I have a difficulties with elasticsearch.
Here is what I want to do:
Let's say unit of my index looks like this:
{
transacId: "qwerty",
amount: 150,
userId: "adsf",
client: "mobile",
goal: "purchase"
}
I want to build different types of statistics of this data and elasticsearch does it really fast. The problem I have is that in my system user can add new field in transaction on demand. Let's say we have another row in the same index:
{
transacId: "qrerty",
amount: 200,
userId: "adsf",
client: "mobile",
goal: "purchase",
token_1: "game"
}
So now I want to group by token_1.
{
query: {
match: {userId: "asdf"}
},
aggs: {
token_1: {
terms: {field: "token_1"},
aggs: {sumAmt: {sum: {field: "amount"}}}
}
}
}
Problem here that it will aggregate only documents with field token_1. I know there is aggregation missing and I can do something like this:
{
query: {
match: {userId: "asdf"}
},
aggs: {
token_1: {
missing: {field: "token_1"},
aggs: {sumAmt: {sum: {field: "amount"}}}
}
}
}
But in this case it will aggregate only documents without field token_1, what I want is to aggregate both types of documents in on query. I tried do this, but it also didn't work for me:
{
query: {
match: {userId: "asdf"}
},
aggs: {
token_1: {
missing: {field: "token_1"},
aggs: {sumAmt: {sum: {field: "amount"}}}
},
aggs: {
token_1: {
missing: {field: "token_1"},
aggs: {sumAmt: {sum: {field: "amount"}}}
}
}
}
}
I think may be there is something like operator OR in aggregation, but I couldn't find anything. Help me, please.
I am really new to elasticsearch world.
Let's say I have a nested aggregation on two fields : field1 and field2 :
{
...
aggs: {
field1: {
terms: {
field: 'field1'
},
aggs: {
field2: {
terms: {
field: 'field2'
}
}
}
}
}
}
This piece of code works perfectly and gives me something like this :
aggregations: {
field1: {
buckets: [{
key: "foo",
doc_count: 123456,
field2: {
buckets: [{
key: "bar",
doc_count: 34323
},{
key: "baz",
doc_count: 10
},{
key: "foobar",
doc_count: 36785
},
...
]
},{
key: "fooOO",
doc_count: 423424,
field2: {
buckets: [{
key: "bar",
doc_count: 35
},{
key: "baz",
doc_count: 2435453
},
...
]
},
...
]
}
}
Now, my need is to exclude all aggregation results where doc_count is less than 1000 for instance and get this instead :
aggregations: {
field1: {
buckets: [{
key: "foo",
doc_count: 123456,
field2: {
buckets: [{
key: "bar",
doc_count: 34323
},{
key: "foobar",
doc_count: 36785
},
...
]
},{
key: "fooOO",
doc_count: 423424,
field2: {
buckets: [{
key: "baz",
doc_count: 2435453
},
...
]
},
...
]
}
}
Is it possible to set this need in the query body ? or do I have to perform the filter in the caller layout (in javascript in my case)?
Thanks in advance
Next time, M'sieur Toph' : RTFM !!!
I feel really dumb: I found the anwser in the manual, 30 seconds after asking.
I don't remove my question because, it can help, who knows...
Here is the anwser :
You can specify the min_doc_count property in the terms aggregation.
It gives you :
{
...
aggs: {
field1: {
terms: {
field: 'field1',
min_doc_count: 1000
},
aggs: {
field2: {
terms: {
field: 'field2',
min_doc_count: 1000
}
}
}
}
}
}
You also can specify a specific minimal count for each level of your aggregation.
What else ? :)
I am looking for a way to do this. I need to show all experts inside the users mapping. (Experts are documents with field role equals 3). But while showing the experts, I need to show experts having "Linkedin" inside their social medias (social_medias is an array field in the users mapping) first and those without "Army" afterwards. For ex:, I have 5 documents:
[
{
role: 3,
name: "David",
social_medias: ["Twitter", "Facebook"]
},
{
role: 3,
name: "James",
social_medias: ["Facebook", "Linkedin"]
},
{
role: 3,
name: "Michael",
social_medias: ["Linkedin", "Facebook"]
},
{
role: 3,
name: "Peter",
social_medias: ["Facebook"]
},
{
role: 3,
name: "John",
social_medias: ["Facebook", "Twitter"]
},
{
role: 2,
name: "Babu",
social_medias: ["Linkedin", "Facebook"]
}
]
So, I want to get documents with role 3 and while fetching it, documents having "Linkedin" in social media should come first. So, the output after query should be in this order:
[
{
role: 3,
name: "James",
social_medias: ["Facebook", "Linkedin"]
},
{
role: 3,
name: "Michael",
social_medias: ["Linkedin", "Facebook"]
},
{
role: 3,
name: "David",
social_medias: ["Twitter", "Facebook"]
},
{
role: 3,
name: "Peter",
social_medias: ["Facebook"]
},
{
role: 3,
name: "John",
social_medias: ["Facebook", "Twitter"]
}
]
I am trying with function_score now. I can specify a column to have more priority in function_score, but cant figure out how to specify condition based priority.
Why not let the default sorting in ES (sort by score) do the job for you, without custom ordering or custom scoring:
GET /my_index/media/_search
{
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{"match": {"social_medias": "Linkedin"}},
{"match_all": {}},
{"query_string": {
"default_field": "social_medias",
"query": "NOT Army"
}}
]
}
},
"filter": {
"term": {
"role": "3"
}
}
}
}
}
The query above filters for "role":"3" and then in a should clause it basically says: if the documents match social_medias field with value Linkedin then give them a score based on this matching. To, also, include all others documents that don't match Linkedin, add another should for match_all. Now, everything that matches match_all gets a score. If those documents, also, match Linkedin then they get an additional score, thus making them score higher and be first in the list of results.
I have a collection of documents which belongs to few authors:
[
{ id: 1, author_id: 'mark', content: [...] },
{ id: 2, author_id: 'pierre', content: [...] },
{ id: 3, author_id: 'pierre', content: [...] },
{ id: 4, author_id: 'mark', content: [...] },
{ id: 5, author_id: 'william', content: [...] },
...
]
I'd like to retrieve and paginate a distinct selection of best matching document based upon the author's id:
[
{ id: 1, author_id: 'mark', content: [...], _score: 100 },
{ id: 3, author_id: 'pierre', content: [...], _score: 90 },
{ id: 5, author_id: 'william', content: [...], _score: 80 },
...
]
Here's what I'm currently doing (pseudo-code):
unique_docs = res.results.to_a.uniq{ |doc| doc.author_id }
Problem is right on pagination: How to select 20 "distinct" documents?
Some people are pointing term facets, but I'm not actually doing a tag cloud:
Distinct selection with CouchDB and elasticsearch
http://elasticsearch-users.115913.n3.nabble.com/Getting-Distinct-Values-td3830953.html
Thanks,
Adit
As at present ElasticSearch does not provide a group_by equivalent, here's my attempt to do it manually.
While the ES community is working for a direct solution to this problem (probably a plugin) here's a basic attempt which works for my needs.
Assumptions.
I'm looking for relevant content
I've assumed that first 300 docs are relevant, so I consider
restricting my research to this selection, regardless many or some
of these are from the same few authors.
for my needs I didn't "really" needed full pagination, it was enough
a "show more" button updated through ajax.
Drawbacks
results are not precise
as we take 300 docs per time we don't know how many unique docs will come out (possibly could be 300 docs from the same author!). You should understand if it fits your average number of docs per author and probably consider a limit.
you need to do 2 queries (waiting remote call cost):
first query asks for 300 relevant docs with just these fields: id & author_id
retrieve full docs of paginated ids in a second query
Here's some ruby pseudo-code: https://gist.github.com/saxxi/6495116
Now the 'group_by' issue have been updated, you can use this feature from elastic 1.3.0 #6124.
If you search for following query,
{
"aggs": {
"user_count": {
"terms": {
"field": "author_id",
"size": 0
}
}
}
}
you will get result
{
"took" : 123,
"timed_out" : false,
"_shards" : { ... },
"hits" : { ... },
"aggregations" : {
"user_count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "mark",
"doc_count" : 87350
}, {
"key" : "pierre",
"doc_count" : 41809
}, {
"key" : "william",
"doc_count" : 24476
} ]
}
}
}