Elasticsearch: How many words occured once - elasticsearch

I am modifying my question to be a little more generic so please humor me...
Say I have a Elasticsearch index with each document holding a word from a textbook. Is there a way I can tell how many words occurred just once, how many twice, and so on?
ie the result is something like this:
# words occurring once = 10,001,
twice = 503,
thrice = 807,
four times = 997,
five times = 23
Is there a way to do this in elastic?
I am not looking for "give me the top "x" words that occur most often" - that is easily retrieved by doing an aggregation.
Thanks!

Suppose your documents have a field word that holds a word from the textbook. Your use case will be solved by using the terms aggregations bucketing that'll group all the occurrences of a word into one bucket. Thus your query would like this:
{
"aggs" : {
"word_count" : {
"terms" : { "field" : "word" }
}
}
}
With the following output:
{
"aggregations" : {
"word_count" : {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : "The",
"doc_count" : 10
},
{
"key" : "wild",
"doc_count" : 2
},
{
"key" : "fox",
"doc_count" : 3
},
]
}
}
}
where doc_count indicates the occurrence of each word.

Related

Counting unique buckets from aggregation

I am trying to get the unique count for all labels used on a set of documents. In order to do that, and have the json returned in the bucket (cardinality doesnt return json and count together), I need to write a pipeline query.
My query gets me half way there, but I'm missing the second part that counts the number of buckets a label is in.
Here's my query
{
"size":0,
"aggs" : {
unique_count : {
"composite" : [
"metadataId" : {
"terms" :{"field" : "document.metadata.id"}
},
"label" : {
"terms" :{"field" : "document.label"}
}
]
}
}
}
This produces
...
"buckets" : [
{
"key" : {
"metadataId" : "1",
"label" : "label one"
},
"doc_count" : 2
},
{
"key" : {
"metadataId" : "2",
"label" : "label one"
},
"doc_count" : 1
},
{
"key" : {
"metadataId" : "3",
"label" : "label three"
},
"doc_count" : 3
}
]
...
The problem I'm facing is that each bucket is considered unique and the sum of the unique counts is what I would like to return. For example, in the buckets above the label "label one" is contained within two buckets, so it's doc_count should be 2, while "label three" should have a doc_count of 1.
After the last phase in the pipeline I'd like to see the following output:
"buckets" : [
{
"label" : "label one"
"doc_count" : 2
},
{
"label" : "label three"
"doc_count" : 1
}
]
I've tried all sorts of things, but they're just not getting me close to the output I need. Can anyone point me in the right direction?
Try with the nested terms aggregations where first level aggs would be on label and the second level on metadataId field. The aggs block should look something like:
"aggs" : {
"labels": {
"terms": {
"field": "label.keyword",
"size": 1000
},
"aggs": {
"metadata": {
"terms": {
"field": metadataId.keyword",
"size": 1000
}
}
}
}
}
As output, you will get buckets of labels with key as label value and doc_count with count of docs matching that label. Each label bucket will have a nested buckets of metadataId with key as metadataId value and doc_count with count of docs matching that label and metadataId.

Elasticsearch Aggregation most common list of integers

I am looking for elastic search aggregation + mapping
that will return the most common list for a certain field.
For example for docs:
{"ToneCurvePV2012": [1,2,3]}
{"ToneCurvePV2012": [1,5,6]}
{"ToneCurvePV2012": [1,7,8]}
{"ToneCurvePV2012": [1,2,3]}
I wish for the aggregation result:
[1,2,3] (since it appears twice).
so far any aggregation that i made would return: 1
This is not possible with default terms aggregation. You need to use terms aggregation with script. Please note that this might impact your cluster performance.
Here, i have used script which will create string from array and used it for aggregation. so if you have array value like [1,2,3] then it will create string representation of it like '[1,2,3]' and that key will be used for aggregation.
Below is sample query you can use to generate aggregation as you expected:
POST index1/_search
{
"size": 0,
"aggs": {
"tone_s": {
"terms": {
"script": {
"source": "def value='['; for(int i=0;i<doc['ToneCurvePV2012'].length;i++){value= value + doc['ToneCurvePV2012'][i] + ',';} value+= ']'; value = value.replace(',]', ']'); return value;"
}
}
}
}
}
Output:
{
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"tone_s" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "[1,2,3]",
"doc_count" : 2
},
{
"key" : "[1,5,6]",
"doc_count" : 1
},
{
"key" : "[1,7,8]",
"doc_count" : 1
}
]
}
}
}
PS: key will be come as string and not as array in aggregation response.

ElasticSearch search, get unique categories of returned products

In an eshop with thousands of products we have a searchbar at the top. The expected output of the search is a list of categories in which there are products matching the query.
For example searching for 'iphone' should return a list of categories where there are products with that keyword.
e.g.
- Mobile phones
- Batteries for phones
- Case for phones
- etc.
What I did is search through the products index for the keyword, then get the results, pluck the category_id of each product, remove duplicates and do a /_mget in the categories index with the ids I should display.
This however seems to be inneffient since the first search might return 10k results (if it is too generic) which I then loop through to get its category_id.
I am looking for better ways to do the above.
Any ideas on how to make the above more effiecient?
Take a look into Elasticsearch Aggregations. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
A good place to start would be with a Terms Aggregation which is a bucket aggregation https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html.
An example:
GET /_search
{
"query": {...},
"aggs" : {
"categories" : {
"terms" : { "field" : "category_name" }
}
}
}
The response should look something like this where it puts the field value and a count into buckets.
{
...
"aggregations" : {
"categories" : {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : "Mobile phones",
"doc_count" : 6
},
{
"key" : "Batteries for phones",
"doc_count" : 3
},
{
"key" : "Cases for phones",
"doc_count" : 2
}
]
}
}
}

Elasticsearch plugin to classify documents

Is there an elasticsearch plugin out there that would allow me to classify the documents that I enter in an index?
The best solution for me would be a classifications of all the most recurrent terms (/ concepts) displayed in a sort of tags cloud that the user can navigate.
Is there a way to achieve this? Any suggestions?
Thanks
The basic idea is to use a terms aggregations, which will yield one bucket per term.
POST /_search
{
"aggs" : {
"genres" : {
"terms" : { "field" : "genre" }
}
}
}
The response you'll get will be ordered by decreasing amount of term occurrences:
{
...
"aggregations" : {
"genres" : {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : "jazz",
"doc_count" : 10
},
{
"key" : "rock",
"doc_count" : 5
},
{
"key" : "electronic",
"doc_count" : 2
},
]
}
}
}
If you're using Kibana, you can directly create a tag cloud visualization based on those terms.

Elasticsearch, aggregation, how to count accurately in the estimated final list

elasticSearch (ES) term aggregation result is approximate in term of the finalists and their counts. https://www.elastic.co/guide/en/elasticsearch/reference/1.6/search-aggregations-bucket-terms-aggregation.html
I'd like to have the accurate counts for the for the estimated finalists, despite that the finalist are not accurate. I want to eliminate per bucket document count error.
I am thinking to issue a second query that's filtered by the finalists, and since I know the number of finalists, I can count them accurately if I set size=#finalists.
Using the example from the link above: after I have the top 5 Products: a,z,c,g,b from the first aggregation result, I want to find their accurate counts:
{
...
"aggregations" : {
"products" : {
"doc_count_error_upper_bound" : 46,
"buckets" : [
{
"key" : "Product A",
"doc_count" : 100,
"doc_count_error_upper_bound" : 0
},
{
"key" : "Product Z",
"doc_count" : 52,
"doc_count_error_upper_bound" : 2
},
...
]
}
}
}
Now the doc_counts are estimated, I can issue a second query filtered by the product ids:
{
...
"query": {
"filtered": {
"filter": {
"terms": {"product": ["Product A", "Product Z","Product C","Product G","Product B"]}
}
}
},
"aggs":{
"products":{
"terms":{
"field": "product",
"size": 5,
"shard_size": 5
}
}
}
}
My questions are:
does this give me the correct counts on a,z,c,g,b?
is there a better way to do this? inside one query, maybe nested aggregation?
the parsing aggregation results to prepare filters is done with JAVA code, and it is error-prone. Is there an example of this task? or can it be done by ES ?
Thanks in advance.

Resources