How to get elasticsearch most used words? - elasticsearch

I am using terms aggregation on elasticsearch to get most used words in a index with 380607390 (380 millions) and i receive timeout on my application.
The aggregated field is a text with a simple analyzer( the field holds post content).
My question is:
The terms aggregation is the correct aggregation to do that? With a large content field?
{
"aggs" : {
"keywords" : {
"terms" : { "field" : "post_content" }
}
}
}

You can try this using min_doc_count. You would ofcourse not want to get those words which have been used just once or twice or thrice...
You can set min_doc_count as per your requirement. This would definitely
reduce the time.
{
"aggs" : {
"keywords" : {
"terms" : { "field" : "post_content",
"min_doc_count": 5 //----->Set it as per your need
}
}
}
}

Related

How can we do a key insensitive cardinality aggregation?

We can use cardinality to get a distinct count on a field, however the cardinality is case sensitive... meaning that if we have emails like user#x.com, User#x.com and USER#x.com these will count as 3 emails, however I need this to count as a single email count.
This is the aggregation I am using:
"aggs" : {
"emails" : {
"cardinality" : {
"field" : "emails.keyword"
}
}
}
I would need something like:
"aggs" : {
"emails" : {
"cardinality" : {
"field" : "emails.keyword",
"casesensitive": false ????
}
}
}
How can we do to make a cardinality aggregation to be key insensitive?
Although I would go with Val's suggestion, here is the query I thought may be useful if you do not have the control of the mapping where I made use of a custom script in Cardinality Aggregation
Aggregation Query:
POST <your_index_name>/_search
{
"size":0,
"aggs":{
"email_count":{
"cardinality":{
"script":{
"source":"doc['email.keyword'].toString().toLowerCase()"
}
}
}
}
}
Note that you would find more details on Scripting in the aforementioned link.
Hope this helps!

How to apply aggregations on grouped fields in Elasticsearch?

On my eCommerce store I want to only include the first item in each group (grouped by item_id) in the final results. At the same time I don't want to lose my aggregations (little numbers next to attributes that indicate how many items with that attribute are found).
Here is a little example:
Suppose I make a search for items and only 25 show up. This is the result for the color aggregation that I currently get:
black (65)
green (32)
white (13)
And I want it to be:
black (14)
green (6)
white (5)
The numbers should amount to the total number the user actually sees on the page.
How could I achieve that with Elasticsearch? I have tried both Grouping (Top Hits) and Field Collapsing and both don't seem to fit my use case. Solr does it almost by default with its Grouping functionality.
It should be rather easy. When you are asking for aggregation you are simple sending request to the _search endpoint. Example:
POST /exams/_search
{
"aggs" : {
"avg_grade" : { "avg" : { "field" : "grade" } }
}
}
and in above example you will get aggregation for all the documents.
If you want to get aggregation for specific documents you just need to add specific query to the request body, like:
POST /exams/_search
{
"query": {
"bool" : {
"must" : {
"query_string" : {
"query" : "some query string here"
}
},
"filter" : {
"term" : { "user" : "kimchy" }
}
}
},
"aggs" : {
"avg_grade" : { "avg" : { "field" : "grade" } }
}
}
and you can send size and from parameters as well.

Elasticsearch Switch between previous/next record from search result

I am getting the results based on various filters in the Elasticsearch which also includes pagination.
Now I need to navigate between previous and next record from that search results, when we open a record of the search results.
Is there a way to achieve this through Elasticsearch?
You could use the from and size parameters of the Search API.
GET /_search
{
"from" : 0, "size" : 10,
"query" : {
"term" : { "user" : "kimchy" }
}
}
or
GET /_search?from=0&size=10
{
"query" : {
"term" : { "user" : "kimchy" }
}
}
Note the default value for size is 10.

Ordering term aggregation buckets by sub-aggregration result values

I have two questions about the query seen on this capture:
How do I order by value in the sum_category field in the results?
I use respsize again in the query but it's not correct as you can see below.
Even if I make only an aggregration, why do all the documents come with the result? I mean, if I make a group by query in SQL it retrieves only grouped data, but Elasticsearch retrieves all documents as if I made a normal search query. How do I skip them?
Try this:
{
"query" : {
"match_all" : {}
},
"size" : 0,
"aggs" : {
"categories" : {
"terms" : {
"field" : "category",
"size" : 999999,
"order" : {
"sum_category" : "desc"
}
},
"aggs" : {
"sum_category" : {
"sum" : {
"field" : "respsize"
}
}
}
}
}
}
1). See the note in (2) for what your sort is doing. As for ordering the categories by the value of sum_category, see the order portion. There appears to be an old and closed issue related to that https://github.com/elastic/elasticsearch/issues/4643 but it worked fine for me with v1.5.2 of Elasticsearch.
2). Although you do not have that match_all query, I think that's probably what you are getting results for. And so the sort your specified is actually getting applied to those results. To not get these back, I just have size: 0 portion.
Do you want buckets for all the categories? I noticed you do not have size specified for the main aggregation. That's the size: 999999 portion.

elasticsearch query to find documents that don't exist

Is there a way in Elasticsearch through filters, queries, aggregations etc to search for a list of document ids and have returned which ids did not hit?
With a small list it is easy enough to compare the results against the requested ids list but I'm dealing with lists of ids in the tens of thousands and it is not going to be performant to do that.
Do you mean, from https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-not-filter.html
"filtered" : {
"query" : {
"term" : { "name.first" : "shay" }
},
"filter" : {
"not" : {
"range" : {
"postDate" : {
"from" : "2010-03-01",
"to" : "2010-04-01"
}
}
}
}
}
Take a look at the guide at https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

Resources