elasticsearch: count appearance of terms aggregation on other fields - elasticsearch

I want to count how many times, unique values (result of terms aggragation) have appeared in other fields in the same query. Let's say:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"unique_products": {
"terms": {
"field": "products.name.keyword",
"min_doc_count": 10
}
}
}
}
What I want is to count, how many time each of the keys returned in the bucket, appeared in another field.
My ideal output is:
"aggregations": {
"product_stat": {
"key": "<product_name>"
"sold": "<#>" #I want to know how many times the key is appeared in another field like sold
"bought": "<#>"
}
}

Elasticsearch cannot do terms aggregations over multiple fields. In short, if they would, aggregations would not be blazing fast.
As documentation suggests, there are two options:
use script terms aggregation (with performance penalty),
change how the documents are indexed so a normal terms aggregation can be used.
Depending on the structure of your data and your use-cases, you might get by with a complex aggregation + some processing on the client side. This can be done with sub aggregations like here, for example.
Hope that helps!

Related

How to find all duplicate documents in ElasticSearch

We have a need to walk over all of the documents in our AWS ElasticSearch cluster, version 6.0, and gather a count of all the duplicate user ids.
I have tried using a Data Visualization to aggregate counts on the user ids and export them, but the numbers don't match another source of our data that is searchable via traditional SQL.
What we would like to see is like this:
USER ID COUNT
userid1 4
userid22 3
...
I am not an advanced Lucene query person and have yet to find an answer to this question. If anyone can provide some insight into how to do this, I would be appreciative.
The following query will count each id, and filter the ids which have <2 counts, so you'll get something in the likes of:
id:2, count:2
id:4, count:15
GET /index
{
"query":{
"match_all":{}
},
"aggs":{
"user_id":{
"terms":{
"field":"user_id",
"size":100000,
"min_doc_count":2
}
}
}
}
More here:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
If you want to get all duplicate userids with count
First you get to know maximum size of aggs.
find all maximum matches record via aggs cardinality.
GET index/type/_search
{
"size": 0,
"aggs": {
"maximum_match_counts": {
"cardinality": {
"field": "userid",
"precision_threshold": 100
}
}
}
}
get value of maximum_match_counts aggregations
Now you can get all duplicate userids
GET index/type/_search
{
"size": 0,
"aggs": {
"userIds": {
"terms": {
"field": "userid",
"size": maximum_match_counts,
"min_doc_count": 2
}
}
}
}
When you go with terms aggregation (Bharat suggestion) and set aggregation size more than 10K you will get a warning about this approach will throw an error for the feature releases.
Instead of using terms aggregation you should go with composite aggregation to scan all of your documents by pagination/afterkey method.
the composite aggregation can be used to paginate all buckets from a multi-level aggregation efficiently. This aggregation provides a way to stream all buckets of a specific aggregation similarly to what scroll does for documents.

Elasticsearch:: Sorting giving weird results

When I am searching the for the first time, its sorting all documents and giving me the first 5 records. However, if same search query is executed by changing the sort direction(ASC -> DESC), then its not sorting all documents again, its giving me last 5 retrieved documents(from previous search query), sorting them in desc order, and giving it back to me. I was expecting that it will sort all available documents in DESC order, and then retrieve first 5 results.
Am I doing something wrong, or missed any concept.
My search query:
{
"sort": {
"taskid": {
"order": "ASC"
}
},
"from": 0,
"size": 5,
"query": {
"filtered": {
"query": {
"match_all": []
}
}
}
}
I have data with taskid 1 to 100. Now above query fetched me record from taskid 1 to 5 in first attempt. Now when I changed the sort direction to desc, I was expecting documents with taskid 96-100(100,99,98,97,96 sequence) should be returned, however I was returned documents with taskid 5,4,3,2,1 in that sequence. Which meant, sorting was done on previous returned result only.
Please note that taskid and _id are same in my document. I had added a redundant field in my mapping which will be same as _id
Just change the case of the value in order key and you are good to go.
{
"sort": {
"taskid": {
"order": "asc" // or "desc"
}
},
"from": 0,
"size": 5,
"query": {
"filtered": {
"query": {
"match_all": []
}
}
}
}
Hope this helps..
In elastic search, sort query is applied after the result are extracted from the es. As per the query mentioned in your question, first result is filtered based on search criteria, and then sorting is applied on the filtered result.
If it looks like you are only getting results based on an old subset of your data, then it may be that your newer data has not been indexed yet. This can happen easily in an automated test but with manual testing it is less likely.
Segments are rebuilt every second, so adding a delay/sleep of about a second between indexing and searching should fix your test if this is the problem.

Compare documents in Elasticsearch

I am new to Elasticsearch and I am trying to get all documents which have same mobile type. I couldn't find a relevant question and am currently stuck.
curl -XPUT 'http://localhost:9200/sessions/session/1' \
-d '{"useragent": "1121212","mobile": "android", "browser": "mozilla", "device": "computer", "service-code": "1112"}'
EDIT -
I need Elasticsearch equivalent of following -
SELECT * FROM session s1, session s2
where s1.device == s2.device
What you are trying to achieve is simple grouping docs on a field via self-join.
The similar notion of grouping can be achieved by terms aggregation in elasticsearch. Although this aggregation returns only the group level metrics like count, sum etc. It does not return the individual records.
However, there is another aggregation which can be applied as a sub-aggregation to the terms aggregation, top-hits aggregations.
The top_hits aggregator can effectively be used to group result sets
by certain fields via a bucket aggregator. One or more bucket
aggregators determines by which properties a result set get sliced
into.
Options
from - The offset from the first result you want to fetch.
size - The maximum number of top matching hits to return per bucket. By default the top three matching hits are returned.
sort - How the top matching hits should be sorted. By default the hits are sorted by the score of the main query.
Here is a sample query
{
"query": {
"match_all": {}
},
"aggs": {
"top-mobiles": {
"terms": {
"field": "device"
},
"aggs": {
"top_device_hits": {
"top_hits": {}
}
}
}
}
}

Limit and Offset in Term Aggregation ElasticSearch

There is way to get the top n terms result. For example:
{
"aggs": {
"apiSalesRepUser": {
"terms": {
"field": "userName",
"size": 5
}
}
}
}
Is there any way to set the offset for the terms result?
If you mean something like ignore first m results and return the next n results then no; it is not possible. A workaround to that would be to set size to m + n and do client side processing to ignore the first m results.
A little late, but (at least) since Elastic 5.2.0 you can use partitioning in the terms aggregation to paginate results.
https://www.elastic.co/guide/en/elasticsearch/reference/5.2/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions
Maybe this helps a bit:
"aggregations": {
"apiSalesRepUser": {
"terms": {
"field": "userName",
"size": 9999 ---> add here a bigger size
}
},
"aggregations": {
"limitBucket": {
"bucket_sort": {
"sort": [],
"from": 10,
"size": 20,
"gap_policy": "SKIP"
}
}
}
}
I am not sure about what value to put in the term size. I would suggest to put a reasonable value. This limits the initial aggregation, then the second limitBucket agg will limit again the term agg. This will probably still load in memory all the documents that you limited in the terms agg. That is why it depends on your scenario, if it's reasonable not get all results (i.e. if you have tens of thousands). I.e you are doing a google like search where you don't need to jump to page 1000.
Compared to the alternative to get the data on the client side, this might save you some data transfer from ES, but as I said weight this carefully as it loads all a lot of data in ES memory and you might have memory issues in ElasticSearch

Elasticsearch and aggregation of subqueries

I know that elasticsearch allows sub-aggregations (ie. nested aggregation), however I would like to apply aggregation on the result of "first" aggregation (or in generic any query - aggregation or not).
Concrete example: I log events about user actions (for simplicity I have documents with user_id and action). I can make a query that counts number of actions executed by each user. However I would like to find out percentage (or count) of "active users" (e.g. users that have executed more than 10 actions). Ideal result would be a histogram over all users showing how active the users are.
Is there a way how to create such query? Or is there any other approach I can take other than store aggregated results of subquery and compute the histogram out of that?
Note: I have seen Elastic Search and "sub queries" question, but it was about something else and it is over one and half year old and elasticsearch is being actively developed.
Additionally it seems that in version 1.4 there will be available scripted metric aggregation, but anyway that would require to store counter for every user until reduce phase. And some "approximate solution" is good for me - similar to what ES uses internally for its aggregations.
Here is the query I have used, notice the "min_doc_count" in the aggregation.
{
"query": {
"filtered": {
"filter": {
"and": [
{ "term" : { "name": "did x" } },
{ "range": { "created_at": { "gte": "now-7d", "lte": "now" } } }
]
}
}
},
"aggregations": {
"my_agg": {
"terms": {
"field": "user_id",
"min_doc_count": 10,
"size": 0
}
}
}
}
This query returns the list of buckets (users) with more than 9 events in the specified time period. Just 'count' results to get the number of active users.
I have tested this approach with thousands of events and it works well. At a certain scale you will have to use Hadoop.

Resources