unique value from each bucket in elastic search - elasticsearch

I have a sample database of 1000 bank accounts.
{"account_number":1,"balance":39225,...,"state":"IL"}
What I want is list of highest balance accounts in each state. Using a terms aggregator I received collected count of accounts from each state.
eg.
"aggregations" : {
"states" : {
"buckets" : [ {
"key" : "tx",
"doc_count" : 30
}, ....
But this doesn't returns the required list. Any suggestions?

Use max aggregation
{
"aggs" : {
"max_price" : { "max" : { "field" : "price" } }
}
}
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-max-aggregation.html

You should look at significant terms aggregation http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html - this generates buckets with related terms. Explore it

Related

ElasticSearch search, get unique categories of returned products

In an eshop with thousands of products we have a searchbar at the top. The expected output of the search is a list of categories in which there are products matching the query.
For example searching for 'iphone' should return a list of categories where there are products with that keyword.
e.g.
- Mobile phones
- Batteries for phones
- Case for phones
- etc.
What I did is search through the products index for the keyword, then get the results, pluck the category_id of each product, remove duplicates and do a /_mget in the categories index with the ids I should display.
This however seems to be inneffient since the first search might return 10k results (if it is too generic) which I then loop through to get its category_id.
I am looking for better ways to do the above.
Any ideas on how to make the above more effiecient?
Take a look into Elasticsearch Aggregations. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
A good place to start would be with a Terms Aggregation which is a bucket aggregation https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html.
An example:
GET /_search
{
"query": {...},
"aggs" : {
"categories" : {
"terms" : { "field" : "category_name" }
}
}
}
The response should look something like this where it puts the field value and a count into buckets.
{
...
"aggregations" : {
"categories" : {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : "Mobile phones",
"doc_count" : 6
},
{
"key" : "Batteries for phones",
"doc_count" : 3
},
{
"key" : "Cases for phones",
"doc_count" : 2
}
]
}
}
}

Elasticsearch plugin to classify documents

Is there an elasticsearch plugin out there that would allow me to classify the documents that I enter in an index?
The best solution for me would be a classifications of all the most recurrent terms (/ concepts) displayed in a sort of tags cloud that the user can navigate.
Is there a way to achieve this? Any suggestions?
Thanks
The basic idea is to use a terms aggregations, which will yield one bucket per term.
POST /_search
{
"aggs" : {
"genres" : {
"terms" : { "field" : "genre" }
}
}
}
The response you'll get will be ordered by decreasing amount of term occurrences:
{
...
"aggregations" : {
"genres" : {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : "jazz",
"doc_count" : 10
},
{
"key" : "rock",
"doc_count" : 5
},
{
"key" : "electronic",
"doc_count" : 2
},
]
}
}
}
If you're using Kibana, you can directly create a tag cloud visualization based on those terms.

How to get aggregated term vector in elasticsearch?

I am new to elasticsearch. I am trying to get the total word frequency count of a set of documents, but I cannot seem to figure it out in elasticsearch. I know there is a document count functionality using aggregation. And with a term vector, I can find the frequency of a term in a document, but what about finding the total frequency of terms in a set of documents?
Term vector for a single document:
GET /test/product/3/_termvector
Aggregated document count:
GET /test/product/_search?pretty=true
{
"size" : 0,
"query" : {
"match_all" : {}
},
"aggs" : {
"phrases" : {
"terms" : {
"field" : "title",
"size" : 10000
}
}
}
}

Elasticsearch: filter or get value out of aggregation bucket

In Elasticsearch I have an index containing documents with a timestamp and the number of observed requests to a webservice.
I would like to perform an aggregation to get, for each day, the hour where the maximum number of requests were observed (peak hour).
I succeed to get the result by performing the following request:
{
"aggregations" : {
"week_summary" : {
"filter" : {"range": {"#timestamp": {"gte": "2015-01-20||-7d","lte": "2015-01-20"}}},
"aggregations" : {
"oneday_interval" : {
"date_histogram" : {"field" : "#timestamp", "interval" : "1d","order" : { "_key" : "desc" }},
"aggregations" : {
"peak_hour_histogram" : {
"date_histogram" : {"field" : "#timestamp", "interval" : "1h","order" : { "peak_request_count.value" : "desc" }},
"aggregations" : {
"peak_request_count" : {
"sum" : { "field" : "request_count"}
}
}
}
}
}
}
}
},
size : 0
}
This is working ok in a sense: the first item in the peak_hour_histogram buckets array is indeed corresponding to the peak hour due to the ability to sort a date histogram on a sub-aggregation value.
Nevertheless, I don't need all the other buckets items (i.e. the other 23 hours of the day), and I'd like to receive only the first item. I tried to play with top_hits without any success.
Do you know a way to perform this filtering?
NB: In the real use case my aggregation is returning about 3MB of data. So filtering all those useless values becomes important.
Thanks for your answers.
I think this would be the feature that should answer your requirement: https://github.com/elasticsearch/elasticsearch/issues/6704. Started from this one: https://github.com/elasticsearch/elasticsearch/issues/7103

elasticsearch offset and limit facets

I'm trying to make a search that both limits and "offsets" (the keyword from in elasticsearch) the facet result set, so something like:
'{
"query" : {
"nested" : {
"_scope" : "my_scope",
"path" : "related_award_vendors",
"score_mode" : "avg",
"query" : {
"bool" : {
"must" : {
"text" : {"related_award_vendors.title" : "inc"}
}
}
}
}
},
"facets" : {
"facet1" : {
"terms_stats" : {
"key_field" : "related_award_vendors.django_id",
"value_field" : "related_award_vendors.award_amount",
"order":"term",
"size": 5,
"from":2
},
"scope" : "my_scope" }
}
}'
In the above, it returns id's 1,2,3,4,5 and if I remove "from" it still returns 1,2,3,5 in the result set.
The "size" is working correctly. In this case, it's returning five items in the result set.
My understanding is that solr can do this. Can this be done in elasticsearch?
The terms stats facet doesn't support the from parameter. The only way to achieve what you want is to set size to size + offset and ignore first offset entries on the client side. In your example it would mean to request 7 entries and ignore first 2.

Resources