Elasticsearch Cardinality Aggregation giving completely wrong results - elasticsearch

I am saving each page view of a website in an ES index, where each page is recognized by an entity_id.
I need to get the total count of unique page views since a given point in time.
I have the following mapping:
{
"my_index": {
"mappings": {
"page_views": {
"_all": {
"enabled": true
},
"properties": {
"created": {
"type": "long"
},
"entity_id": {
"type": "integer"
}
}
}
}
}
}
According to the Elasticsearch docs, the way to do that is using a cardinality aggregation.
Here is my search request:
GET my_index/page_views/_search
{
"filter": {
"bool": {
"must": [
[
{
"range": {
"created": {
"gte": 9999999999
}
}
}
]
]
}
},
"aggs": {
"distinct_entities": {
"cardinality": {
"field": "entity_id",
"precision_threshold": 100
}
}
}
}
Note, that I have used a timestamp in the future, so no results are returned.
And the result I'm getting is:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
},
"aggregations": {
"distinct_entities": {
"value": 116
}
}
}
I don't understand how the unique page visits could be 116, giving that there are no page visits at all for the search query. What am I doing wrong?

Your aggregation is returning the global value for the cardinality. If you want it to return only the cardinality of the filtered set, one way you could do that is to use a filter aggregation, then nest your cardinality aggregation inside that. Leaving out the filtered query for clarity (you can add it back in easily enough), the query I tried looks like:
curl -XPOST "http://localhost:9200/my_index/page_views/_search " -d'
{
"size": 0,
"aggs": {
"filtered_entities": {
"filter": {
"bool": {
"must": [
[
{
"range": {
"created": {
"gte": 9999999999
}
}
}
]
]
}
},
"aggs": {
"distinct_entities": {
"cardinality": {
"field": "entity_id",
"precision_threshold": 100
}
}
}
}
}
}'
which returns:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"filtered_entities": {
"doc_count": 0,
"distinct_entities": {
"value": 0
}
}
}
}
Here is some code you can play with:
http://sense.qbox.io/gist/bd90a74839ca56329e8de28c457190872d19fc1b
I used Elasticsearch 1.3.4, by the way.

Related

Elasticsearch aggregation shows incorrect total

Elasticsearch version is 7.4.2
I suck at Elasticsearch and I'm trying to figure out what's wrong with this query.
{
"size": 10,
"from": 0,
"query": {
"bool": {
"must": [
{
"exists": {
"field": "firstName"
}
},
{
"query_string": {
"query": "*",
"fields": [
"params.display",
"params.description",
"params.name",
"lastName"
]
}
},
{
"match": {
"status": "DONE"
}
}
],
"filter": [
{
"term": {
"success": true
}
}
]
}
},
"sort": {
"createDate": "desc"
},
"collapse": {
"field": "lastName.keyword",
"inner_hits": {
"name": "lastChange",
"size": 1,
"sort": [
{
"createDate": "desc"
}
]
}
},
"aggs": {
"total": {
"cardinality": {
"field": "lastName.keyword"
}
}
}
}
It returns:
"aggregations": {
"total": {
"value": 429896
}
}
So ~430k results, but in pagination we stop getting results around the 426k mark. Meaning, when I run the query with
{
"size": 10,
"from": 427000,
...
}
I get:
{
"took": 2215,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": null,
"hits": []
},
"aggregations": {
"total": {
"value": 429896
}
}
}
But if I change from to be 426000 I still get results.
You are comparing the cardinality aggregation value of your field lastName.keyword to your total documents in the index, which is two different things.
You can check the total no of documents in your index using the count API and from/size you are defined at query level ie it brings the documents matching your search query and as you don't have track_total_hits it shows 10k with relation gte means there are more than 10k documents matching your search query.
When it comes to your aggregation, I can see in both the case it returns the count as 429896 as this aggregation is not depend on the from/size you are mentioning for your query.
I was surprised when I found out that the cardinality parameter has Precision control.
Setting the maximum value was the solution for me.

Multiple Match Phrase Prefixes Return Zero Results In Elasticsearch

I have the following Elasticsearch, version 2.3, query which produces zero results.
{
"query": {
"bool": {
"must": [
{
"match_phrase_prefix": {
"phone": "123"
}
},
{
"match_phrase_prefix": {
"firstname": "First"
}
}
]
}
}
}
Output from above query:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
Output of above query with _explain
{
"_index": "index_name",
"_type": "doc_type",
"_id": "_explain",
"_version": 4,
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"created": false
}
However, when I do either of the following I get results including the one document that matches both parts of the above query. If I include the full phone number then the document will appear in the results.
Phone numbers are stored as strings without any formatting. i.e. "1234567890".
Any reason why the two prefix query returns zero results?
{
"query": {
"bool": {
"must": [
{
"match_phrase_prefix": {
"phone": "123"
}
}
]
}
}
}
{
"query": {
"bool": {
"must": [
{
"match_phrase_prefix": {
"firstname": "First"
}
}
]
}
}
}
I was able to get the results I wanted by changing the phone number query to a regexp query instead of a match_phrase_prefix query.
{
"query": {
"bool": {
"must": [
{
"regexp": {
"phone": "123[0-9]+"
}
},
{
"match_phrase_prefix": {
"firstname": "First"
}
}
]
}
}
}

elasticsearh how to find the percentage of document that has less prices than a number

Let's say i want to allow the users to enter the name of the city and the price of a thing (anything)
I need to know the percentage of (things) in that city that has less value for a field than an entered value:
i can search for a city like this:
"query": {
"filtered": {
"query": {
"match": {
"city": "Paris"
}
}
}
},
but i don't know how to do the other requirements, could you help me please?
Suposedly percentile-ranks-aggregation was intended as a means to achieve this
Example:
post <index>/<type>/_search
{
"filter": {
"term": {
"city": "blore"
}
},
"aggs": {
"rank": {
"percentile_ranks": {
"values": [
31
],
"field": "price"
}
}
},
"size": 0
}
But when testing I found that it is buggy I believe is related to issue.
So the work around would be to calculate the percentage on client side once the document counts have been acquired using a query similar to as follows:
post _search
{
"filter": {
"term": {
"city": "blore"
}
},
"aggs": {
"less_price_filter": {
"filter": {
"range": {
"price": {
"lt": 60
}
}
}
}
},
"size": 0
}
Response
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"less_price_filter": {
"doc_count": 2
}
}
}
The percentage can be calculated on client-side by dividing doc_count*100/total.

ElasticSearch count multiple fields grouped by

I have documents like
{"domain":"US", "zipcode":"11111", "eventType":"click", "id":"1", "time":100}
{"domain":"US", "zipcode":"22222", "eventType":"sell", "id":"2", "time":200}
{"domain":"US", "zipcode":"22222", "eventType":"click", "id":"3","time":150}
{"domain":"US", "zipcode":"11111", "eventType":"sell", "id":"4","time":350}
{"domain":"US", "zipcode":"33333", "eventType":"sell", "id":"5","time":225}
{"domain":"EU", "zipcode":"44444", "eventType":"click", "id":"5","time":120}
I want to filter these documents by eventType=sell and time between 125 and 400, group by domain followed by zipcode and count the documents in each bucket. So my output would be like (first and last docs would be ignored by the filters)
US, 11111,1
US, 22222,1
US, 33333,1
In SQL, this should have been straightforward. But I am not able to get this to work on ElasticSearch. Could someone please help me out here?
How do I write ElasticSearch query to accomplish the above?
This query seems to do what you want:
POST /test_index/_search
{
"size": 0,
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"eventType": "sell"
}
},
{
"range": {
"time": {
"gte": 125,
"lte": 400
}
}
}
]
}
}
}
},
"aggs": {
"zipcode_terms": {
"terms": {
"field": "zipcode"
}
}
}
}
returning
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"zipcode_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "11111",
"doc_count": 1
},
{
"key": "22222",
"doc_count": 1
},
{
"key": "33333",
"doc_count": 1
}
]
}
}
}
(Note that there is only 1 "sell" at "22222", not 2).
Here is some code I used to test it:
http://sense.qbox.io/gist/1c4cb591ab72a6f3ae681df30fe023ddfca4225b
You might want to take a look at terms aggregations, the bool filter, and range filters.
EDIT: I just realized I left out the domain part, but it should be straightforward to add in a bucket aggregation on that as well if you need to.

Elastic search find total hits for a date

I have a requirement to find total records in my user table for a particular date, i can able to find the total hits, but cannot find a query to fetch date for a particular date.
Query
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"filter": {
"range": {
"created_date": {
"from": "2015-01-02",
"to": "2015-01-02"
}
}
}
}
}
}
Result
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 33,
"max_score": 0,
"hits": []
},
"aggregations": {
"daily_team": {
"doc_count": 1
}
}
}
Here "total": 33, but its for total number of records in my database. I have only 22 records from "starting date" to "2015-01-02". Could you please help me to find query for the same. Thanks
I found a solution, just removed "from" parameter from range, now i can get the size from doc_count.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"filter": {
"range": {
"created_date": {
"to": "2015-01-02"
}
}
}
}
}
}

Resources