Getting the count for several document properties by grouping similar values

Getting the count for several document properties by grouping similar values - elasticsearch

I'm trying to build a set of filters in a UI for an es object. I'd like to aggregate all the documents and group certain property's by value and get a count for each.
For example I'd like to be able to build a list of available filters like:
State :
TX (5)
NJ (1)
CA (10)
Source :
Location1 (30)
Location2 (25)
Location3 (22)
Where "State" and "Source" are different properties of the document type and the counts are in parenthesis obviously. I understand an Aggregation request would be what I want, I'm just looking for a little guidance. Ideally I'd like to do this with one request and not multiple requests for each property I need a group by count on.

So, If I am correct then, you just want count of 'state' for each state, and same case for source.
here is a request for that,
POST _/_search
{
"size": 0,
"aggs":{
"state":{
"terms":{
"field" :"state"
}
}
}
}
Does that helps??

Related

Navigating terms aggregation in Elastic with very large number of buckets

Hope everyone is staying safe!
I am trying to explore the proper way to tacke the following use case in elasticsearch
Lets say that I have about 700000 docs which I would like to bucket on the basis of a field (let's call it primary_id). This primary id can be same for more than one docs (usually upto 2-3 docs will have same primary_id). In all other cases the primary_id is not repeted in any other docs.
So on average out of every 10 docs I will have 8 unique primary ids, and 1 primary id same among 2 docs
To ensure uniqueness I tried using the terms aggregation and I ended up getting buckets in response to my search request but not for the subsequent scroll requests. Upon googling, I found that scroll queries do not support aggregations.
As a result, I tried finding alternates solutions, and tried the solution in this link as well, https://lukasmestan.com/learn-how-to-use-scroll-elasticsearch-aggregation/
It suggests use of multiple search requests each specifying the partition number to fetch (dependent upon how many partitions do you divide your result in). But I receive client timeouts even with high timeout settings client side.
Ideally, I want to know what is the best way to go about such data where the variance of the field which forms the bucket is almost equal to the number of docs. The SQL equivalent would be select DISTINCT ( primary_id) from .....
But in elasticsearch, distinct things can only be processed via bucketing (terms aggregation).
I also use top hits as a sub aggregation query under terms aggregation to fetch the _source fields.
Any help would be extremely appreciated!
Thanks!

There are 3 ways to paginate aggregtation.
Composite aggregation
Partition
Bucket sort
Partition you have already tried.
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"ValueCount": {
"value_count": {
"field": "id.keyword"
}
},
"pagination": {
"composite": {
"size": 2,
"sources": [
{
"TradeRef": {
"terms": {
"field": "id.keyword"
}
}
}
]
}
}
}
}
Bucket sort
The bucket_sort aggregation, like all pipeline aggregations, is
executed after all other non-pipeline aggregations. This means the
sorting only applies to whatever buckets are already returned from the
parent aggregation. For example, if the parent aggregation is terms
and its size is set to 10, the bucket_sort will only sort over those
10 returned term buckets
So this isn't suitable for your case
You can increase the result size to value greater than 10K by updating setting index.max_result_window. Setting too big a size can cause out of memory issue so you need to test it out see how much your hardware can support.
Better option is to use scroll api and perform distinct at client side

elasticsearch: get random distinct field values?

We have elastic search document with dealerId "field". Multiple documents can have the same "dealerId". We want to pick "N" random dealers from it.
What I have done so far: The following query would return max 1000 "dealerId" and their count in descending order. We will then randomly pick "N" records client side.
{
"from":0,
"size":0,
"aggs":{
"CityIdCount":{
"terms":{
"field":"dealerId",
"order" : { "_term" : "desc" },
"size":1000
}
}
}
}
The downside with this approach is that:
If in future, we have more than 1K unique dealers, this approach would fail as it would pick only top 1K dealerId occurence. What should we put as "size" for this?
We are fetching all the data although we just require random "N" i.e. 3 or 4 random "dealerId" from elastic server to the client. Can we somehow do this randomization in the elastic query itself i.e. order: "random"?
I have read something similar here but trying to check if we have some solution for this now.

Aggregating full text fields

I'm trying to display the number of markets in an index. Each document has a field called market and I want aggregate the results like this:
"Advertising and sales" : 400
"Oil Industry" : 250
"Metal Industry" : 125
I know how to display these results using the query:
"aggs":{
"group_by_market":{
"terms":{
"field": "market"
}
}
}
The problem is that when they are displayed; they don't get displayed correctly. The markets are displayed separately. For example:
"Advertising": 400
"Sales": 400
"Oil": 322
...etc
How do I make it so the markets are aggregated with all the text?

The type of your field is text. You need to specify mapping of the field as "keyword" field ( Elasticsearch version 5 + ) Mappings
In older versions,mapping need to have "not_analyzed" Mappings
The basic difference between two is that one gets tokenized and meant for full text search while other one is meant for usecases like yours.

Nested count queries

i'm looking to add a feature to an existing query. Basically, I run a query that returns say 1000 documents. Those documents all have the same structure, only the values of certain fields vary. What i'd like, is to not only get the full list as a result, but also count how many results have a field X with the value Y, how many results have the same field X with the value Z etc...
Basically get all the results + 4 or 5 "counts" that would act like the SQL "group by", in a way.
The point of this is to allow full text search over all the clients in our database (without filtering), while showing how many of those are active clients, past clients, active prospects etc...
Any way to do this without running additional / separate queries ?
EDIT WITH ANSWER :
Aggregations is the way to go. Here's how I did it, it's so straightforward that I expected much harder work !
{
"query": {
"term": {
"_type":"client"
}
},
"aggregations" : {
"agg1" : {
"terms" : {
"field" : "listType.typeRef.keyword"
}
}
}
}
Note that it's even in a list of terms and not a single field, that's just how easy it was !

I believe what you are looking for is the aggregation query.
The documentation should be clear enough, but if you struggle please give us your ES query and we will help you from there.

How to get occurrence count of specific field value in elasticsearch from 650 M data

I have indexed Twitter data in ES. There are 110 M Twitter unique users profiles and there 650 M Tweets. Both are in seperate index (index: twitter-profiles, type: profiles), for tweets (index: twitter-tweets, type: tweets).
There is user_id_str of profile is attached with every tweet.
I am running into a problem to get occurrence count of specific user. I used Facet/terms and Aggregation/Terms but both give me exception PartialShardFailureException because there are lot of data to make calculation.
I used following query
{
"aggs" : {
"userCount" : {
"terms" : { "field" : "user_id_str" }
}
}
}
Then I give another Try.
I used second method Scan. Here I get ids of profiles from profiles type then search it in tweet type. it give me results but a single result came after 2seconds OOps. There are 110 M users mean I have to wait for days.
Please give me any reasonable solution for this situation.

You could use Cardinality aggregation in combination with term filter

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio