How to get distinct values of afeild in ES? - elasticsearch

I am trying to calculate distinct values of a field in ES. For example if I have an index containing documents like:
{
"NAME": "XYZ",
"TITLE": "ABC"
}
{
"NAME": "RTY",
"TITLE": "BNM"
}
I want to have an output like this:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 10000,
"max_score": 1,
"hits": [
{
"_index": "record_new",
"_type": "record_new",
"_id": "AWChga1952qKS23vpN8J",
"_score": 1,
"_source": {
"TITLE":{
"ABC",
"BNM"
}
}
}]
}
}
How can I get the distinct values in "title" field in this format. I have tried using aggregation but the output is very weired. Please help.

Test Data:
PUT http://localhost:9200/stackoverflow/os/1
{
"NAME": "XYZ",
"TITLE": "LINUX OS"
}
PUT http://localhost:9200/stackoverflow/os/1
{
"NAME": "XYZ",
"TITLE": "WINDOWS SERVER"
}
First Query Attempt:
Note I have used POST here instead of GET since most REST clients do not support payload with GET.
POST http://localhost:9200/stackoverflow/_search
{
"size":0,
"aggs":{
"uniq_soft_tags":{
"terms":{
"field":"TITLE"
}
}
}
}
If you did not give a mapping for your data and run the above, most probably you'll end up with below error.
Fielddata is disabled on text fields by default. Set fielddata=true on [TITLE] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory
Read more about that here.
Adding the mapping to enable Fielddata:
PUT http://localhost:9200/stackoverflow/_mapping/os/
{
"properties": {
"TITLE": {
"type": "text",
"fielddata": true
}
}
}
Second Query Attempt:
POST http://localhost:9200/stackoverflow/_search
{
"size":0,
"aggs":{
"uniq_soft_tags":{
"terms":{
"field":"TITLE"
}
}
}
}
Results:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"uniq_soft_tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "linux",
"doc_count": 1
},
{
"key": "os",
"doc_count": 1
},
{
"key": "server",
"doc_count": 1
},
{
"key": "windows",
"doc_count": 1
}
]
}
}
}
Note that the doc_counts here are approximate.
Make sure to read the following section in the docs. https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html#before-enabling-fielddata
That explains how you get that spilitting behaviour.
Before enabling fielddata
Before you enable fielddata, consider why you are using a text field
for aggregations, sorting, or in a script. It usually doesn’t make
sense to do so.
A text field is analyzed before indexing so that a value like New York
can be found by searching for new or for york. A terms aggregation on
this field will return a new bucket and a york bucket, when you
probably want a single bucket called New York.
UPDATE:
To prevent splitting behaviour you have to provide a mapping as follows. Note that with this you would not need the previous mapping where we set Fielddata to true.
PUT http://localhost:9200/stackoverflow/_mapping/os/
{
"properties": {
"TITLE": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
Now we can use,
TITLE field for searches.
TITLE.keyword field for aggregations, sorting, or in scripts.
Third Query Attempt:
POST http://localhost:9200/stackoverflow/_search
{
"size":0,
"aggs":{
"uniq_soft_tags":{
"terms":{
"field":"TITLE.keyword"
}
}
}
}
Results:
{
"took": 59,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"uniq_soft_tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "LINUX OS",
"doc_count": 1
},
{
"key": "WINDOWS SERVER",
"doc_count": 1
}
]
}
}
}

Related

How to get the number of documents for each occurence in Elastic?

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client.
Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months.
For the moment, the closest I get it with this query:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
}
The result is something like that:
{
"took": 793,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": 1.0,
"hits": [
{
"_index": "file",
"_type": "_doc",
"_id": "8DkTFHQB3kG435svAA3O",
"_score": 1.0,
"_source": {
"filename": "taz",
"id": 24009,
"when": "2020-08-21T08:11:54.943Z"
}
},
...
]
},
"aggregations": {
"downloads": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 418486,
"buckets": [
{
"key": "file1",
"doc_count": 313873
},
{
"key": "file2",
"doc_count": 281504
},
...,
{
"key": "file10",
"doc_count": 10662
}
]
}
}
}
So I am quite interested in the aggregations.downloads.bucket, but this is limited to 10 results.
What do I need to change in my query to have all the list (in my case, I will have ~15,000 different files)?
Thanks.
The size of the terms buckets defaults to 10. If you want to increase it, go with
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 15000 <-------
}
}
}
}
Note that there are strategies to paginate those buckets using a composite aggregation.
Also note that as your index grows, you may hit the default limit as well. It's a dynamic cluster-wide setting so it can be changed.

Group results returned by elasticsearch query based on query terms

I am very new with elasticsearch. I am facing an issue building a query. My document structure is like:
{
latlng: {
lat: '<some-latitude>',
lon: '<some-longitude>'
},
gmap_result: {<Some object>}
}
I am doing a search on a list of lat-long. For each coordinate, I am fetching a result that is within 100m. I have been able to do this part. But, the tricky part is that I do not know which results in the output correspond to the which query term. I think this requires using aggregations at some level, but I am currently clueless on how to proceed on this.
An aggregate query is the correct approach. You can learn about them here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
An example is below. In this example, I am using a match query to find all instances of the word test in the field title and then aggregating the field status to count the number of results with the word test that are in each status.
GET /my_index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "*test*"
}
}
]
}
},
"aggs": {
"count_by_status": {
"terms": {
"field": "status"
}
}
},
"size": 0
}
The results look like this:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 346,
"max_score": 0,
"hits": []
},
"aggregations": {
"count_by_status": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Open",
"doc_count": 283
},
{
"key": "Completed",
"doc_count": 36
},
{
"key": "On Hold",
"doc_count": 12
},
{
"key": "Withdrawn",
"doc_count": 10
},
{
"key": "Declined",
"doc_count": 5
}
]
}
}
}
If you provide your query, it would help us give a more specific aggregate query for you to use.

Boosting elastic aggregation result

I have an elastic index for products, each product has Brand attribution and I "have to" create an aggregation that returns Brands of the products.
My Sample Query:
GET /products/product/_search
{
"size": 0,
"aggs": {
"myFancyFilter": {
"filter": {
"match_all": {}
},
"aggs": {
"inner": {
"terms": {
"field": "Brand",
"size": 3
}
}
}
}
},
"query": {
"match_all": {}
}
}
And the result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 236952,
"max_score": 0,
"hits": []
},
"aggregations": {
"myFancyFilter": {
"doc_count": 236952,
"inner": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 139267,
"buckets": [
{
"key": "Brand1",
"doc_count": 3144
},
{
"key": "Brand2",
"doc_count": 1759
},
{
"key": "Brand3",
"doc_count": 1737
}
]
}
}
}
}
It works perfect for me. Elastic sorts buckets according to doc_count, however I would like to manipulate the bucket order in result. For example, assume that I have Brand5 and I want to increment its order to #2. I want result coming in order Brand1, Brand5 and Brand3.
If it was not in an aggregation, but in a query, I could use function_score, but now, I don't have an idea. Any clues?
What you are looking for is to define your own sorting definition and that to be applied in aggregation in elasticsearch. I've been able to come up with a solution by renaming the aggregation terms in below manner:
Brand1 to a_Brand1
Brand5 to b_Brand5
Brand3 to c_Brand3
And then apply sorting on the terms so that sorting happens lexicographically.
Of course this may not be the exact or the best solution but I felt this can help.
Below is the query that I've used. Please note that my field name is brand and it is a multifield and I'm using the field brand.keyword.
POST testdataindex/_search
{
"size":0,
"query":{
"match_all":{
}
},
"aggs":{
"myFancyFilter":{
"filter":{
"match_all":{
}
},
"aggs":{
"inner":{
"terms":{
"script":{
"lang":"painless",
"inline":"if(params.newNames.containsKey(doc['brand.keyword'].value)) { return params.newNames[doc['brand.keyword'].value];} return null;",
"params":{
"newNames":{
"Brand1":"a_Brand1",
"Brand5":"b_Brand5",
"Brand3":"c_Brand3"
}
}
},
"order":{
"_term":"asc"
}
}
}
}
}
}
}
I've created a sample data with brand names Brand1, Brand3 and Brand5 and below how the results would appear. Note the change in the term names.
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 0,
"hits": []
},
"aggregations": {
"myFancyFilter": {
"doc_count": 8,
"inner": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "a_Brand1",
"doc_count": 2
},
{
"key": "b_Brand5",
"doc_count": 4
},
{
"key": "c_Brand3",
"doc_count": 2
}
]
}
}
}
}
Hope it helps!

Get Percentage of Values in Elasticsearch

I have some test documents that look like
"hits": {
...
"_source": {
"student": "DTWjkg",
"name": "My Name",
"grade": "A"
...
"student": "ggddee",
"name": "My Name2",
"grade": "B"
...
"student": "ggddee",
"name": "My Name3",
"grade": "A"
And I wanted to get the percentage of students that have a grade of B, the result would be "33%", assuming there were only 3 students.
How would I do this in Elasticsearch?
So far I have this aggregation, which I feel like is close:
"aggs": {
"gradeBPercent": {
"terms": {
"field" : "grade",
"script" : "_value == 'B'"
}
}
}
This returns:
"aggregations": {
"gradeBPercent": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "false",
"doc_count": 2
},
{
"key": "true",
"doc_count": 1
}
]
}
}
I'm not looking necessarily looking for an exact answer, perhaps what I could terms and keywords I could google. I've read over the elasticsearch docs and not found anything that could help.
First off, you shouldn't need a script for this aggregation. If you want to limit your results to everyone where `value == 'B' then you should do that using a filter, not a script.
ElasticSearch won't return you a percentage exactly, but you can easily calculate that using the result from a TERMS AGGREGATION.
Example:
GET devdev/audittrail/_search
{
"size": 0,
"aggs": {
"a1": {
"terms": {
"field": "uIDRequestID"
}
}
}
}
That returns:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 25083,
"max_score": 0,
"hits": []
},
"aggregations": {
"a1": {
"doc_count_error_upper_bound": 9,
"sum_other_doc_count": 1300,
"buckets": [
{
"key": 556,
"doc_count": 34
},
{
"key": 393,
"doc_count": 28
},
{
"key": 528,
"doc_count": 15
}
]
}
}
}
So what does that return mean?
the hits.total field is the total number of records matching your query.
the doc_count is telling you how many items are in each bucket.
So for my example here: I could say that the key "556" shows up in 34 of 25083 documents, so it has a percentage of (34 / 25083) * 100

How to return number of matches according to specific term in search query?

In my search query I have this:
...
term: { CategoryId: [1,2,3] }
...
I need to return how many matches were found for each category. For now just total number of matches is returned. Is it possible? I think this might be related to aggregation, however I can't find the right solution...
A sample query can be,
POST /test/products/_search
{
"size": 0,
"aggs": {
"category": {
"terms": {
"field": "category"
}
}
}
}
so response is as,
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 0,
"hits": []
},
"aggregations": {
"category": {
"buckets": [
{
"key": "1",
"doc_count": 10
},
{
"key": "2",
"doc_count": 12
}
]
}
}
}
Which gives no of documents for each category.
Hope this helps!!

Resources