Elasticsearch find unique items in list field - elasticsearch

Need to find unique string values that are in list field.
The question is similar to ElasticSearch - Return Unique Values
but now field values are lists
Records:
PUT items/1
{ "tags" : ["a", "b"] }
PUT items/2
{ "tags" : ["b", "c"] }
PUT items/3
{ "tags" : ["a" "d"] }
Query:
GET items/_search
{ ... }
# => Expected Response
["a", "b", "c", "d"]
Is there way to make such search?

Good news! We can use the exact same aggregation as the one used in the SO post you linked to in the description. In fact, if we were submitting a list of numeric values, our work would be done already! However the main difference between this question and the question you referenced is that you are using a "string" type.
It is useful to know that in more recent versions of elasticsearch, there are two ways to represent "strings" in elasticsearch and that type is actually not referred to as a string any more. Using the keyword type will treat the entire text as a single token, while using the text type will apply an analyzer to break the text up into many different tokens and build an index with those tokens.
For example, the string "Foxes are brown" can be represented as "foxes are brown" or ["foxes", "are", "brown"] in the index. In your case, tags should be treated as a keyword so we'll need to tell elasticsearch that that field is a keyword and not text which is the default.
NOTE: Using the keyword type whenever possible will alleviate the issue of needing to allow elasticsearch to set fielddata to true, which uses up a lot of memory in your cluster if this aggregation is used much. Tags and ordinal data are good candidates for the keyword type.
Anyways, let's get to the real stuff eh?
First, you're going to want to set the mapping for tags in the items as a keyword type.
curl --request PUT \
--url http://localhost:9200/items \
--header 'content-type: application/json' \
--data '{
"mappings": {
"item": {
"properties": {
"tags" : { "type": "keyword" }
}
}
}
}
'
Then you're going to run the aggregation similar to the aggregation in the post you referenced.
curl --request POST \
--url http://localhost:9200/items/item/_search \
--header 'content-type: application/json' \
--data '{
"size": 0,
"aggregations": {
"tags_term_agg": {
"terms": {
"field": "tags"
}
}
}
}'
Your response should looks something like this.
{
"took": 24,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"tags_term_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "a",
"doc_count": 2
},
{
"key": "b",
"doc_count": 2
},
{
"key": "c",
"doc_count": 1
},
{
"key": "d",
"doc_count": 1
}
]
}
}
}

Related

Elasticsearch aggregation on values in nested list (array)

I have stored some values in Elasticsearch nested data type (an array) but without using key/value pair. An example record would be:
{
"categories": [
"Category1",
"Category2"
],
"product_name": "productx"
}
Now I want to run aggregation query to find out unique list of categories available. But all the examples I've seen pointed to mapping that has key/value. Is there any way I can use above schema as is or do I need to change my schema to something like this to run aggregation query
{
"categories": [
{"name": "Category1"},
{"name": "Category2"}
],
"product_name": "productx"
}
Well regarding JSON structure, you need to take a step back and figure out if you'd want list or key-value pairs.
Looking at your example, I don't think you need key-value pairs but again its something you may want to clarify by understanding your domain if there'd be some more properties for categories.
Regarding aggregation, as far as I know, aggregations would work on any valid JSON structure.
For the data you've mentioned, you can make use of the below aggregation query. Also I'm assuming the fields are of type keyword.
Aggregation Query
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"myaggs": {
"terms": {
"size": 100,
"script": {
"inline": """
def myString = "";
def list = new ArrayList();
for(int i=0; i<doc['categories'].length; i++){
myString = doc['categories'][i] + ", " + doc['product'].value;
list.add(myString);
}
return list;
"""
}
}
}
}
}
Aggregation Response
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"myaggs": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "category1, productx",
"doc_count": 1
},
{
"key": "category2, productx",
"doc_count": 1
}
]
}
}
}
Hope it helps!

elasticsearch how to find number of occurrences

I wonder if it's possible to convert this sql query into ES query?
select top 10 app, cat, count(*) from err group by app, cat
Or in English it would be answering: "Show top app, cat and their counts", so this will be grouping by multiple fields and returning name and count.
For aggregating on a combination of multiple fields, you have to use scripting in Terms Aggregation like below:
POST <index name>/<type name>/_search?search_type=count
{
"aggs": {
"app_cat": {
"terms": {
"script" : "doc['app'].value + '#' + doc['cat'].value",
"size": 10
}
}
}
}
I am using # as a delimiter assuming that it is not present in any value of app and/or cat fields. You can use any other delimiter of your choice. You'll get a response something like below:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 0,
"hits": []
},
"aggregations": {
"app_cat": {
"buckets": [
{
"key": "app2#cat2",
"doc_count": 4
},
{
"key": "app1#cat1",
"doc_count": 3
},
{
"key": "app2#cat1",
"doc_count": 2
},
{
"key": "app1#cat2",
"doc_count": 1
}
]
}
}
}
On the client side, you can get the individual values of app and cat fields from the aggregation response by string manipulations.
In newer versions of Elasticsearch, scripting is disabled by default due to security reasons. If you want to enable scripting, read this.
Terms aggregation is what you are looking for.

Elasticsearch: How do I find a term which exists in all documents of a certain type?

Say I have a type type1 for which one of the fields is an array:
curl -XPUT localhost:9200/index1/type1/1
{
'field1': ['A', 'B', 'C'],
'field2': 1
}
curl -XPUT localhost:9200/index1/type1/2
{
'field1': ['A', 'E', 'D'],
'field2': 2
}
I'd like to query for the values in field1 which are common to all documents of that type. So in this case, the query will return 'A'.
Is there a way to do it with an elasticsearch query? What if I want to also add a condition about another field?
Finally, Is there any way to define such query also in Kibana 4?
I don't think it's exactly what you asked for but it can be misused for it: you can use Term Aggregation on field1 and by knowing how many docs you have in total know which one every document has.
I'm re-pasting your queries, easier to work with them in Sense
Indexing:
PUT i30288948/type1/1
{
"field1": ["A", "B", "C"],
"field2": 1
}
PUT i30288948/type1/2
{
"field1": ["A", "E", "D"],
"field2": 2
}
Query:
POST i30288948/_search?search_type=count
{
"aggs": {
"terms": {
"terms": {
"field": "field1",
"size": 1
}
}
}
}
The result:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 4,
"buckets": [
{
"key": "a",
"doc_count": 2
}
]
}
}
}
By knowing your total of 2 and comparing doc_count, it would be possible to answer your question.
Whether this is suitable as you want for Kibana I unfortunately cannot answer.
Update to first comment:
I understand the solution is not satisfying in general, but it could be solved. You said:
First, I need to know the count beforehand.
You can get the total count as part of the hits.total; I didn't include it in the original result paste; it's now the complete response. By comparing, you would know if it affects all or not.
But more importantly - it doesn't get me all items. Say instead of E I have B again in item 2. Running the query then returns A, but I would expect to get both A and B.
You can increase the size of the aggregation; but I understand the solution is not really good, because you don't know how many of aggregation match your totals. Maybe looking at sum_other_doc_count helps to figure out whether there's more data.

Change the structure of ElasticSearch response json

In some cases, I don't need all of the fields in response json.
For example,
// request json
{
"_source": "false",
"aggs": { ... },
"query": { ... }
}
// response json
{
"took": 123,
"timed_out": false,
"_shards": { ... },
"hits": {
"total": 123,
"max_score": 123,
"hits": [
{
"_index": "foo",
"_type": "bar",
"_id": "123",
"_score": 123
}
],
...
},
"aggregations": {
"foo": {
"buckets": [
{
"key": 123,
"doc_count": 123
},
...
]
}
}
}
Actually I don't need the _index/_type every time. When I do aggregations, I don't need hits block.
"_source" : false or "_source": { "exclude": [ "foobar" ] } can help ignore/exclude the _source fields in hits block.
But can I change the structure of ES response json in a more common way? Thanks.
I recently needed to "slim down" the Elasticsearch response as it was well over 1MB in json and I started using the filter_path request variable.
This allows to include or exclude specific fields and can have different types of wildcards. Do read the docs in the link above as there is quite some info there.
eg.
_search?filter_path=aggregations.**.hits._source,aggregations.**.key,aggregations.**.doc_count
This reduced (in my case) the response size by half without significantly increasing the search duration, so well worth the effort..
In the hits section, you will always jave _index, _type and _id fields. If you want to retrieve only some specific fields in your search results, you can use fields parameter in the root object :
{
"query": { ... },
"aggs": { ... },
"fields":["fieldName1","fieldName2", etc...]
}
When doing aggregations, you can use the search_type (documentation) parameter with count value like this :
GET index/type/_search?search_type=count
It won't return any document but only the result count, and your aggregations will be computed in the exact same way.

Elasticsearch aggregations with object type fields

I am trying to figure something out :
Here's an example of a document that contains object properties, and then trying to do simple terms aggregations.
https://gist.github.com/BAmine/80e1be219d2ac272561a
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"test": {
"buckets": [
{
"key": "canine",
"doc_count": 1,
"test2": {
"buckets": [
{
"key": "cat",
"doc_count": 1
},
{
"key": "dog",
"doc_count": 1
},
{
"key": "tiger",
"doc_count": 1
},
{
"key": "wolf",
"doc_count": 1
}
]
}
},
{
"key": "feline",
"doc_count": 1,
"test2": {
"buckets": [
{
"key": "cat",
"doc_count": 1
},
{
"key": "dog",
"doc_count": 1
},
{
"key": "tiger",
"doc_count": 1
},
{
"key": "wolf",
"doc_count": 1
}
]
}
}
]
}
}
}
The question is : How can I avoid getting, in my sub-aggregations, buckets whose keys do not belong to the parent aggregation's keys ( example : cat and tiger are not in the property whose label is canine) ?
Is there a way to do this without using nested properties ?
Thank you !
To have this work with the data as is; you could set the animals field's type to nested:
"animals":{
"type": "nested",
"properties": {
"label" : { "type" : "string"},
"names":{
"properties":{
"label" : {"type" : "string"}
}
}
}
This allows you to make requests of that part of the document as separate objects. You could then use two filter aggregations within nested aggregations, one filtering for label == feline and the other for label == canine, you could then use aggregations within these that would give you the two separate lists.
This solution would have the drawback of having to add another nested filter aggregation for each new class of animals you add later.
the solution #vadik suggested seems superior to me, as there doesn't seem to be anything about these lists that requires them to be in the same document. If there is, you could make them be in separate documents with a common parent.
The problem is that you have one document for both the animals. And that's why you will get all the four animals. I suggest another approach instead. Create 2 documents, 1 each for every item in the animals array.
And try on similar lines and you will get your result.
Why are you not getting the result?
Aggregations framework gets the only document, and finds the occurrences of animals.label. It finds two, canine and feline and it outputs both. Further, there is another aggregation within the previous aggregation that wants to aggregate the key animals.names.label. Now there is just one document, which has both the keys of type animals.label and then for each key, the document has all the four values for key animals.names.label. So ES is right. The problem is that the item in animals must be independently identifiable as a document. And then aggs framework will be able to consider it as a container and your intention to aggregate animals.names.label inside animals.label. This is exactly what will happen when you will split the document into two documents.
Another thing that you can try is working with Nested Types. To understand why nested types may help, read this article.

Resources