How to perform a distinct count query in Elasticsearch - elasticsearch

I have an index with a host field. I am trying to retrieve the count of documents by distinct host name.
IE:
Host1:
Count: 72
Host2:
Count: 33
Host3:
Count: 153
Each document has a host field and it is a string. I assume I need to do something involving terms and cardinality, but I can't quite nail the syntax.

How to get all possible values for field host?
curl -XGET http://localhost:9200/articles/_search?pretty -d '
{
"aggs" : {
"whatever_you_like_here" : {
"terms" : { "field" : "host", "size":10000 }
}
},
"size" : 0
}'
Note
The result will contain a doc_count for each unique value
"size":10000 Get at most 10000 unique values. Default is 10.
"size":0 By default, "hits" contains 10 documents. We don't need them.
By default, the buckets are ordered by the doc_count in decreasing order.
Reference: bucket terms aggregation

Related

elasticsearch - query between document types

I have a production_order document_type
i.e.
{
part_number: "abc123",
start_date: "2018-01-20"
},
{
part_number: "1234",
start_date: "2018-04-16"
}
I want to create a commodity document type
i.e.
{
part_number: "abc123",
commodity: "1 meter machining"
},
{
part_number: "1234",
commodity: "small flat & form"
}
Production orders are datawarehoused every week and are immutable.
Commodities on the other hand could change over time. i.e abc123 could change from 1 meter machining to 5 meter machining, so I don't want to store this data with the production_order records.
If a user searches for "small flat & form" in the commodity document type, I want to pull all matching records from the production_order document type, the match being between part number.
Obviously I can do this in a relational database with a join. Is it possible to do the same in elasticsearch?
If it helps, we have about 500k part numbers that will be commoditized and our production order data warehouse currently holds 20 million records.
I have found that you can indeed now query between indexs in elasticsearch, however you have to ensure your data stored correctly. Here is an example from the 6.3 elasticsearch docs
Terms lookup twitter example At first we index the information for
user with id 2, specifically, its followers, then index a tweet from
user with id 1. Finally we search on all the tweets that match the
followers of user 2.
PUT /users/user/2
{
"followers" : ["1", "3"]
}
PUT /tweets/tweet/1
{
"user" : "1"
}
GET /tweets/_search
{
"query" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"id" : "2",
"path" : "followers"
}
}
}
}
Here is the link to the original page
https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-terms-query.html
In my case above I need to setup my storage so that commodity is a field and it's values are an array of part numbers.
i.e.
{
"1 meter machining": ["abc1234", "1234"]
}
I can then look up the 1 meter machining part numbers against my production_order documents
I have tested and it works.
There is no joins supported in elasticsearch.
You can query twice first by getting all the partnumbers using "small flat & form" and then using all the partnumbers to query the other index.
Else try to find a way to merge these into a single index. That would be better. Updating the Commodities would not cause you any problem by combining the both.

How to use multiple query strings with aggregation in elasticsearch

How to use multiple query strings with aggregate functions in elasticsearch?
For example:
if a>0 AND a<1, then {"low":count(aggregate count of records within 0 to 1)}
else if a > 1 AND a < 100, then {"normal":count(aggregate count of records within 1 to 100)}
else {"high":count(aggregate count of records after 100)}
How to achieve this using Request Body Query string?
Thank you in advance.
Assuming that a is a field that you search on, I think the easiest way for you to do that is using the range aggregation with buckets for each of your use-cases (low, normal, high).
You cannot bind aggregations to conditions of your query. That you would have to do in code yourself. But if you use the range aggregation, you could define your buckets like
POST /_search
{
"aggs" : {
"a_ranges" : {
"range" : {
"field" : "a",
"ranges" : [
{ "to" : 1 },
{ "from" : 1, "to" : 10 },
{ "from" : 10 }
]
}
}
}
}
Depending on your query, two of these buckets would remain empty, but this should give you the result you want

Aggregation distinct values in ElasticSearch

I'm trying to get the distinct values and their amount in ElasticSearch.
This can be done via:
"distinct_publisher": {
"terms": {
"field": "publisher", "size": 0
}
}
The problem I've is that it counts the terms, but if there are values in publishers separated via a space e.g.:
"Chicken Dog"
and 5 documents have this value in the publisher field, then I get 5 for Chicken and 5 for Dog:
"buckets" : [
{
"key" : "chicken",
"doc_count" : 5
},
{
"key" : "dog",
"doc_count" : 5
},
...
]
But I want to get as the result:
"buckets" : [
{
"key" : "Chicken Dog",
"doc_count" : 5
}
]
The reason you're getting 5 buckets for each of chicken and dog is because your documents were analyzed at the time that you indexed them.
This means elasticsearch did some small processing to turn Chicken Dog into chicken and dog (lowercase, and tokenize on space). You can see how elasticsearch will analyze a given piece of text into searchable tokens by using the Analyze API, for example:
curl -XGET 'localhost:9200/_analyze?&text=Chicken+Dog'
In order to aggregate over the "raw" distinct values, you need to utilize the not_analyzed mapping so elasticsearch doesn't do its usual processing. This reference may help. You may need to reindex your data to apply the not_analyzed mapping to get the result you want.

Is it possible to sort buckets in Terms aggregation response on a non-term field?

I need to sort the buckets in result of a ElasticSearch Terms aggregation.
Below is the one of the indexed records in ElasticSearch
{"personId":"10","Salary":10000, "Age":20, "personName":"xyz"}
I am using Terms aggregation over the field Salary. Below is the Terms aggregated ElasticSearch query:
{
"aggs" : {
"genders" : {
"terms" : {
"field" : "Salary"
}
}
}
}
This query returns the buckets on the basis of Salary values. These buckets can be sort over the Salary value using order below query:
{
"aggs" : {
"genders" : {
"terms" : {
"field" : "gender",
"order" : { "_term" : "asc" }
}
}
}
}
But I need to sort buckets on any the field Age (non terms field), is there any way to do it ?
The whole point of aggregations is to "dispatch" the documents into buckets, each of which is defined by the declared field of the terms aggregation, in your case Salary.
The buckets you get in the response are not documents anymore. For instance, in the bucket 10000, you'll get the count of documents which have Salary: 10000, and you'll have as many buckets as different Salary values there are in all your documents (by default only 10 buckets, though).
So, since buckets are not documents, and since a bucket can aggregate documents with different Age values, it's not clear how you'd like the Salary buckets to be sorted by Age.
Maybe, one way out of this could be to add a terms sub-aggregation on the Age field, so you get top Salary buckets and below that you get Age buckets. Then you can sort your Salary/Age bucket pairs any way you want.

How to view the response for multiple indices for a single query

I have created multiple indices in elasticsearch and have passed a single query to all of them. Is there any way to know,how many results came from each index?
Here is the screenshot of my elasticsearch head,showing a single aggregation applied to two indices
screenshot:
Here as in the figure you can see I have done an aggregation named "posted_time" on the indices foodfind and comics (red box 1).
But in the response window,to the right,only the results for the index "comics" is shown. How can I see the results for the other index too?
You can use terms aggregation on the field _index for this.
Lets say you need to run the same on index-a , index-b and index-c.
You need to make the request in this pattern -
curl -XPOST 'http://localhost:9200/index-a,index-b,index-c/_search' -d '{
"aggs" : {
"indexStats" : {
"terms" : {
"field" : "_index"
}
}
}
}'

Resources