elasticsearch why facet is so slow? - elasticsearch

For below query without facet. it tooks 18 milli seconds.
But after adding facet it tooks 7408 milli seconds.
I am having 183M records.
Facets provide aggregated data based on a search query. right???
Then why facet is taking so much time for doing aggregation on 40 records?
Query Without facet: Tooks 18 Milli Seconds
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"country_raw": "united states"
}
},
{
"term": {
"title_raw": "manager"
}
}
]
}
}
}
}
}
Response for without facet query:
{
"took": 18,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 40,
"max_score": 1,
"hits": [
....
]
}
}
Query With facet: : Tooks 7845 Milli Seconds
{
"size": 0
"facets": {
"title_facet": {
"terms": {
"field": "title_raw",
"size": 5
}
}
},
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"country_raw": "united states"
}
},
{
"term": {
"title_raw": "manager"
}
}
]
}
}
}
}
}
Facet Query Response
{
"took": 7408,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 40,
"max_score": 0,
"hits": [ ]
},
"facets": {
"title_facet": {
"_type": "terms",
"missing": 0,
"total": 40,
"other": 0,
"terms": [
{
"term": "manager",
"count": 40
}
]
}
}
}

did you try with "aggs" instead of "facet" ( i remember that facet are depreceated )
https://www.elastic.co/guide/en/elasticsearch/reference/1.4/search-aggregations-bucket-terms-aggregation.html
{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"must" : [{
"term" : {
"country_raw" : "united states"
}
}, {
"term" : {
"title_raw" : "manager"
}
}
]
}
}
}
},
"aggs" : {
"title_facet" : {
"terms" : {
"field" : "title_raw",
"size" : 5
}
}
},
"sort" : {
"_score" : "desc"
}
}

Related

Fetch all time date_histogram buckets results

I have the below query to fetch aggregations using Elasticsearch 7.1.
{
"query": {
"bool": {
"filter": [
{
"bool": {
"must": [
{
"match": {
"viewedInFeed": true
}
}
]
}
}
]
}
},
"size": 0,
"aggs": {
"viewed_in_feed_by_day": {
"date_histogram": {
"field": "createdDate",
"interval" : "day",
"format" : "yyyy-MM-dd",
"min_doc_count": 1
}
}
}
}
The results are greater than 10,000 and I am not sure how to work since scroll is not available for aggregations. See the response below.
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"viewed_in_feed_by_day": {
"buckets": [
{
"key_as_string": "2020-03-19",
"key": 1584576000000,
"doc_count": 3028
},
{
"key_as_string": "2020-03-20",
"key": 1584662400000,
"doc_count": 5384
},
{
"key_as_string": "2020-03-21",
"key": 1584748800000,
"doc_count": 3521
}
]
}
}
}
When using _count the count of documents is greater than 10,000 and even without the "min_doc_count": 1 doesn't return results, I know there are more data anyway.
Building on top of Jaspreet's comments I suggest the following:
Use track_total_hits=true to get the exact counts (since 7.0) while keeping the size=0 to only aggregate.
Use the stats aggregation to gain more insights before running your histograms.
GET dates/_search
{
"track_total_hits": true,
"size": 0,
"aggs": {
"dates_insights": {
"stats": {
"field": "createdDate"
}
},
"viewed_in_feed_by_day": {
"date_histogram": {
"field": "createdDate",
"interval" : "month",
"format" : "yyyy-MM-dd",
"min_doc_count": 1
}
}
}
}
yielding
...
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"viewed_in_feed_by_day" : {
"buckets" : [
{
"key_as_string" : "2020-01-01",
"key" : 1577836800000,
"doc_count" : 1
},
{
"key_as_string" : "2020-02-01",
"key" : 1580515200000,
"doc_count" : 1
},
{
"key_as_string" : "2020-03-01",
"key" : 1583020800000,
"doc_count" : 1
}
]
},
"dates_insights" : {
"count" : 3,
...
"min_as_string" : "2020-01-22T13:09:21.588Z",
"max_as_string" : "2020-03-22T13:09:21.588Z",
...
}
}
...

ElasticSearch combine MUST and MUST_NOT

I need to find all documents, that contain given id from a list and have no field "device_data".
My search query:
{
"query": {
"bool" : {
"must" : [
{
"terms" : {
"id" : [
"1cbe0c01-6e0c-11e8-b79f-097b2a39b616"
]
}
}
],
"must_not" : [
{
"exists" : {
"field" : "device_data"
}
}
]
}
}
}
Still returns this document, where i expect it not to be found as "device_data" is present. What am I doing wrong?
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 4.9881625,
"hits": [
{
"_index": "iot_data",
"_type": "sensors_by_id",
"_id": "[\"1cbe0c01-6e0c-11e8-b79f-097b2a39b616\",\"1cbe0c00-6e0c-11e8-b79f-097b2a39b616\"]",
"_score": 4.9881625,
"_source": {
"field_id": "123",
"device_data": {
"comm_nr": "xxxx1",
"id": "542b9010-67b6-11e8-ab71-997fe8a668b8",
"tag": "",
"type": ""
},
"groups": "group-test",
"id": "1cbe0c01-6e0c-11e8-b79f-097b2a39b616",
"time": "1cbe0c00-6e0c-11e8-b79f-097b2a39b616",
"username": "group-test"
}
}
]
}
}
You need to use a terminal field, such as device_data.id for instance:
"must_not" : [
{
"nested": {
"path": "device_data",
"query": {
"exists" : {
"field" : "device_data.id"
}
}
}
}
]

Get count of particular field in a document using Elasticsearch

Requirement:
I want to find the count of aID for a particular category ID.
(i.e for categoryID 2532 i want the count as 2 that means it is assigned to two aID's).
I tried with aggregations but with that i can able to get only the doc count rather than field count.
Mappings
"List": {
"properties": {
"aId": {
"type": "long"
},
"CategoryList": {
"properties": {
"categoryId": {
"type": "long"
},
"categoryName": {
"type": "string"
}
}
}
}
}
Sample Document:
"List": [
{
"aId": 33074,
"CategoryList": [
{
"categoryId": 2532,
"categoryName": "VODAFONE"
}
]
},
{
"aId": 12074,
"CategoryList": [
{
"categoryId": 2532,
"categoryName": "VODAFONE"
}
]
},
{
"aId": 120755,
"CategoryList": [
{
"categoryId": 1234,
"categoryName": "SMPLKE"
}
]
}
]
using cardinality aggregation will not help you getting the desired results. Cardinality aggregation returns the count of distinct values for the field, where are you want to find the count of appearance for number of times for a field.
You can use the following query, Here you can first filter the document for CategoryList.categoryId and then run a simple terms aggregation on this field
POST index_name1111/_search
{
"query": {
"bool": {
"must": [{
"term": {
"CategoryList.categoryId": {
"value": 2532
}
}
}]
}
},
"aggs": {
"count_is": {
"terms": {
"field": "CategoryList.categoryId",
"size": 10
}
}
}
}
Response of above query -
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"count_is": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 2532,
"doc_count": 2
}
]
}
}
}
Or you can also chuck away the filter and running the aggregation only will return you all categoryId with their count of appearance.
POST index_name1111/_search
{
size: 0,
"aggs": {
"count_is": {
"terms": {
"field": "CategoryList.categoryId",
"size": 10
}
}
}
}
Response of above query
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"count_is": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 2532,
"doc_count": 2
},
{
"key": 1234,
"doc_count": 1
}
]
}
}
}
Using cardinality aggregation you will get the following response with following query
POST index_name1111/_search
{
"size": 0,
"query": {
"bool": {
"must": [{
"term": {
"CategoryList.categoryId": {
"value": 2532
}
}
}]
}
},
"aggs": {
"id_count": {
"cardinality": {
"field": "CategoryList.categoryId"
}
}
}
}
Response of above query which doesn't give you desired result, since two documents matched both with categoryId as 252 so count of distinct is 1.
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"id_count": {
"value": 1
}
}
}
Hope this helps
Thanks

compute over results of elasticsearch aggregations

i have a documents with following structure:
{
"ga:bounces": "1",
"timestamp": "20160811",
"viewId": "125287857",
"ga:percentNewSessions": "100.0",
"ga:bounceRate": "100.0",
"ga:avgSessionDuration": "0.0",
"ga:sessions": "1",
"user": "xxcgf",
"ga:pageviewsPerSession": "1.0",
"webPropertyId": "UA-80489737-1",
"ga:pageviews": "1",
"dimension": "date",
"ga:users": "1",
"accountId": "80489737"
}
i am applying two aggregations using this query:
{
"size": 0,
"aggs": {
"total-new-sessions": {
"sum": {
"script": "doc['percentNewSessions'].value/100*doc['sessions'].value"
}
},
"total-sessions": {
"sum": {
"field": "ga:sessions"
}
}
}
}
and this is the ouput i am getting which is exactly what i want:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 32,
"max_score": 0,
"hits": [ ]
},
"aggregations": {
"total-new-sessions": {
"value": 386.0000003814697
},
"total-sessions": {
"value": 516
}
}
}
Now what i want is to divide the output of two aggregations together for some reason. how should i do that in the above query the final output is the only one that i want.
UPDATE:
i tried using this query:
{
"size": 0,
"aggs": {
"total-new-sessions": {
"sum": {
"script": "doc['ga:percentNewSessions'].value/100*doc['ga:sessions'].value"
}
},
"total-sessions": {
"sum": {
"field": "ga:sessions"
}
},
"sessions": {
"bucket_script": {
"buckets_path": {
"total_new": "total-new-sessions",
"total": "total-sessions"
},
"script": "total_new / total"
}
}
}
}
But getting this error :"reason": "Invalid pipeline aggregation named [sessions] of type [bucket_script]. Only sibling pipeline aggregations are allowed at the top level"
You can use a bucket_script aggregation to achieve this:
{
"size": 0,
"aggs": {
"all": {
"date_histogram": {
"field": "timestamp",
"interval": "year"
},
"aggs": {
"total-new-sessions": {
"sum": {
"script": "doc['percentNewSessions'].value/100*doc['sessions'].value"
}
},
"total-sessions": {
"sum": {
"field": "ga:sessions"
}
},
"ratio": {
"bucket_script": {
"buckets_path": {
"total_new": "total-new-sessions",
"total": "total-sessions"
},
"script": "total_new / total"
}
}
}
}
}
}

Filter ElasticSearch result whose array contains at least 1 tag

I query against elasticsearch with following DSL.
{
"query": {
"filtered": {
"query": {
"multi_match": {
"query": "Next",
"type": "phrase_prefix",
"fields": [
"defaultContent"
]
}
},
"filter": {
"bool": {
"must_not": {
"term": {
"_deleted": true
}
},
"should": [
{
"term": {
"site": "xxx"
}
},
{
"term": {
"site": "base"
}
}
]
}
}
}
}
}
And it works and return 1 match.
{
"took": 42,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 2.733073,
"hits": [
{
"_index": "cms",
"_type": "content",
"_id": "base>3453fm9lxkmyy_17",
"_score": 2.733073,
"_source": {
"tags": [
"tag1",
"tag2"
],
"site": "base",
"_rev": "1-3b6eb2b3c3d5554bb3ef3f16a299160c",
"defaultContent": "Next action to be settled",
"_id": "base>3453fm9lxkmyy_17",
"type": "content",
"key": "3453fm9lxkmyy_17"
}
}
]
}
}
Now I want to modify the DSL, and add a new condition -- Only returns those whose tags contains tag1 or tag8
{
"query": {
"filtered": {
"query": {
"multi_match": {
"query": "Next",
"type": "phrase_prefix",
"fields": [
"defaultContent"
]
}
},
"filter": {
"bool": {
"must": {
"term" : {
"tags" : ["tag1", "tag8"],
"minimum_should_match" : 1
}
},
"must_not": {
"term": {
"_deleted": true
}
},
"should": [
{
"term": {
"site": "xxx"
}
},
{
"term": {
"site": "base"
}
}
]
}
}
}
}
}
And then, I get nothing.
{
"took": 23,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
Am I doing something wrong? It should return 1 match because it contains tag1
The "term" filter is used when you want to match on a single term. Kind of like SQL column1 = 'foo'
You want to use the "terms" filter which is the equivalent of SQL column1 IN ('foo', 'bar')

Resources