How to get aggregated term vector in elasticsearch? - elasticsearch

I am new to elasticsearch. I am trying to get the total word frequency count of a set of documents, but I cannot seem to figure it out in elasticsearch. I know there is a document count functionality using aggregation. And with a term vector, I can find the frequency of a term in a document, but what about finding the total frequency of terms in a set of documents?
Term vector for a single document:
GET /test/product/3/_termvector
Aggregated document count:
GET /test/product/_search?pretty=true
{
"size" : 0,
"query" : {
"match_all" : {}
},
"aggs" : {
"phrases" : {
"terms" : {
"field" : "title",
"size" : 10000
}
}
}
}

Related

Total count of all the tokens

Is it possible to get the ttf (total term frequency) for all the tokens from a field in all the shards for a given index?
e.g. I have:
PUT /index/type/1
{
"sentence": "delicious cake"
}
PUT /index/type/2
{
"sentence": "horrible cake"
}
I want to get:
cake 2
horrible 1
delicious 1
Also is it possible to do it for multiple fields (let's say I'd have sentence1 and sentence2 and I'd like to run such a count on the concatenation of them)?
I know termvectors give the ttf and that mtermvectors can do it for multiple documents but then I'd have to go through all the documents and handle the results myself somehow.
Actually only the top K terms would be sufficient for me if I can control K.
If your field 'sentence' is analyzed you can get TTF with Terms Facet:
POST /index/type/_search
{
"query": {
"match_all": {}
},
"facets" : {
"sentence" : {
"terms" : {
"field" : "sentence",
"size" : 10
}
}
}
}
TTF will be in facet part of response
Also you can pass array of fields ["sentence", "sentence2"] to count TTF across multiple fields
POST /index/type/_search
{
"query" : {
"match_all" : { }
},
"facets" : {
"multiple_sentence" : {
"terms" : {
"fields" : ["sentence", "sentence2"],
"size" : 10
}
}
}
}

Elastic Search : Difference between include & filter in aggregation query

My gender data contains male,female,unknown
I want to know the difference between the following query. How it is computed
{
"aggs" : {
"data" : {
"filter" : { "term": { "gender": "male" } },
"aggs" : {
"data_aggs" : {
"terms" : {
"field" : "gender"
}
}
}
}
}}
And
{
"aggs" :{
"data" : {
"terms" :{
"field" :"gender",
"include" : "male"
}
}
}}
In your first aggregation, the filter will select only the subset of documents whose gender field is exactly male. Your data aggregation will then be run only on the selected documents.
Your second aggregation will be run on all the documents matched by your query and then the terms aggregation will only return buckets whose key matches male.
In the first case, the aggregation is pre-filtering the data before running. In the second case, the aggregation is filtering the data on-the-fly, but it will work on all documents since it has to retrieve the gender field in all document to know whether the gender value needs to be aggregated or not. It goes without saying that the first aggregation should be more performant than the second, especially if your document base is massive.

Elasticsearch: filter or get value out of aggregation bucket

In Elasticsearch I have an index containing documents with a timestamp and the number of observed requests to a webservice.
I would like to perform an aggregation to get, for each day, the hour where the maximum number of requests were observed (peak hour).
I succeed to get the result by performing the following request:
{
"aggregations" : {
"week_summary" : {
"filter" : {"range": {"#timestamp": {"gte": "2015-01-20||-7d","lte": "2015-01-20"}}},
"aggregations" : {
"oneday_interval" : {
"date_histogram" : {"field" : "#timestamp", "interval" : "1d","order" : { "_key" : "desc" }},
"aggregations" : {
"peak_hour_histogram" : {
"date_histogram" : {"field" : "#timestamp", "interval" : "1h","order" : { "peak_request_count.value" : "desc" }},
"aggregations" : {
"peak_request_count" : {
"sum" : { "field" : "request_count"}
}
}
}
}
}
}
}
},
size : 0
}
This is working ok in a sense: the first item in the peak_hour_histogram buckets array is indeed corresponding to the peak hour due to the ability to sort a date histogram on a sub-aggregation value.
Nevertheless, I don't need all the other buckets items (i.e. the other 23 hours of the day), and I'd like to receive only the first item. I tried to play with top_hits without any success.
Do you know a way to perform this filtering?
NB: In the real use case my aggregation is returning about 3MB of data. So filtering all those useless values becomes important.
Thanks for your answers.
I think this would be the feature that should answer your requirement: https://github.com/elasticsearch/elasticsearch/issues/6704. Started from this one: https://github.com/elasticsearch/elasticsearch/issues/7103

unique value from each bucket in elastic search

I have a sample database of 1000 bank accounts.
{"account_number":1,"balance":39225,...,"state":"IL"}
What I want is list of highest balance accounts in each state. Using a terms aggregator I received collected count of accounts from each state.
eg.
"aggregations" : {
"states" : {
"buckets" : [ {
"key" : "tx",
"doc_count" : 30
}, ....
But this doesn't returns the required list. Any suggestions?
Use max aggregation
{
"aggs" : {
"max_price" : { "max" : { "field" : "price" } }
}
}
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-max-aggregation.html
You should look at significant terms aggregation http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html - this generates buckets with related terms. Explore it

elasticsearch offset and limit facets

I'm trying to make a search that both limits and "offsets" (the keyword from in elasticsearch) the facet result set, so something like:
'{
"query" : {
"nested" : {
"_scope" : "my_scope",
"path" : "related_award_vendors",
"score_mode" : "avg",
"query" : {
"bool" : {
"must" : {
"text" : {"related_award_vendors.title" : "inc"}
}
}
}
}
},
"facets" : {
"facet1" : {
"terms_stats" : {
"key_field" : "related_award_vendors.django_id",
"value_field" : "related_award_vendors.award_amount",
"order":"term",
"size": 5,
"from":2
},
"scope" : "my_scope" }
}
}'
In the above, it returns id's 1,2,3,4,5 and if I remove "from" it still returns 1,2,3,5 in the result set.
The "size" is working correctly. In this case, it's returning five items in the result set.
My understanding is that solr can do this. Can this be done in elasticsearch?
The terms stats facet doesn't support the from parameter. The only way to achieve what you want is to set size to size + offset and ignore first offset entries on the client side. In your example it would mean to request 7 entries and ignore first 2.

Resources