Elasticsearch Birth date aggregation - elasticsearch

I need my filter works like this:
18-24 | (16,635,890)
25-34 | (2,478,382)
35-44 | (1,129,493)
45-54 | (5,689,393)
55-64 | (4,585.933)
This is my ES mapping:
{
"dynamic": "strict",
"properties": {
"birthdate": {
"type": "date",
"format": "m/d/yyyy"
},
"first_name": {
"type": "keyword"
},
"last_name": {
"type": "keyword"
}
}
}
I would like to know if it's possible to do this with this mapping. I'm not very experienced in ES, I believe that to do this I need advanced knowledge in ES.
Also, I tried to do this to test, but without any aggregation :/
age: {
terms: {
field: 'birthdate'
}
}
--------------------
"doc_count_error_upper_bound" => 0,
"sum_other_doc_count" => 0,
"buckets" => [
{
"key" => 1072915200000,
"key_as_string" => "0/1/2004",
"doc_count" => 1
}
]
},
I tried to read the documentation and search in some forums, but without success. thanks

A good candidate for this would be the ranges aggregation but since your birthdate is formatted as a date, you'd need to calculate the age up until now before you proceeded to calculate the buckets. You can do so through a Painless script.
Putting it all together:
POST your-index/_search
{
"size": 0,
"aggs": {
"price_ranges": {
"range": {
"script": {
"source": "return (params.now_ms - doc['birthdate'].value.millis) / 1000 / 60 / 60 / 24 / 365",
"params": {
"now_ms": 1617958584396
}
},
"ranges": [
{
"from": 18,
"to": 24,
"key": "18-24"
},
{
"from": 25,
"to": 34,
"key": "25-34"
}
]
}
}
}
}
would return:
...
"aggregations" : {
"price_ranges" : {
"buckets" : [
{
"key" : "18-24",
"from" : 18.0,
"to" : 24.0,
"doc_count" : 0
},
{
"key" : "25-34",
"from" : 25.0,
"to" : 34.0,
"doc_count" : 2
},
...
]
}
}
Note that the current timestamp wasn't obtained through a dynamic new Date() call but rather hardcoded as a parametrized now_ms variable. This is the recommended way of doing date math due to the distributed nature of Elasticsearch. For more info on this, check my answer to How to get current time as unix timestamp for script use.
Shameless plug: if you're relatively new to ES, you might find my recently released Elasicsearch Handbook useful. One of the chapters is dedicated solely to aggregations and one to Painless scripting!

Related

Elasticsearch - Retrieving a list of documents which last status is 51

I have an index in Elasticsearch with this kind of documents:
"transactionId" : 5588,
"clientId" : "1",
"transactionType" : 1,
"transactionStatus" : 51,
"locationId" : 12,
"images" : [
{
"imageId" : 5773,
"imagePath" : "http://some/url/path",
"imageType" : "dummyData",
"ocrRead" : "XYZ999",
"imageName" : "SOMENUMBERSANDCHARACTERS.jpg",
"ocrConfidence" : "94.6",
"ROITopLeftCoordinate" : "839x251",
"ROIBottomRightCoordinate" : "999x323"
}
],
"creationTimestamp" : 1669645709130,
"current" : true,
"timestamp" : 1669646359686
It's an "add only" type of stack, where a record is never updated. For instance:
.- Adds a new record with "transactionStatus": 10
.- the transactionID changes status, then, adds a new record for the same transactionID with "transactionStatus": 51
and so on.
What I want to achieve, is get a list of 10 records whose last status is 51 but I can't write the correct query.
Here is what I've tried:
{ "size": 10,
"query": {
"match_all": {}
},
"collapse": {
"field": "transactionId",
"inner_hits": {
"name": "most_recent",
"size": 1,
"sort": [{"timestamp": "desc"}]
}
},
"post_filter": {
"term": {
"transactionStatus": "51"
}
}
}
If I change the "transactionStatus":51 on the post_filter term for, let's say 10, it gives me a transactionID record which last record is not 10.
I don't know if I could explain in a proper way. I apologize for my english, is not my native language.
GET test_status/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"transactionStatus": 51
}
}
]
}
},
"sort": [
{
"timestamp": {
"order": "desc"
}
}
]
}
This one will filter and then sort by timestamp. Let me know if there is something missing.

Get the number of appearances of a particular term in an elasticsearch field

I have an elasticsearch index (posts) with following mappings:
{
"id": "integer",
"title": "text",
"description": "text"
}
I want to simply find the number of occurrences of a particular term inside the description field for a single particular document (i have the document id and term to find).
e.g i have a post like this {id: 123, title:"some title", description: "my city is LA, this post description has two occurrences of word city "}.
I have the the document id/ post id for this post, just want to find how many times word "city" appears in the description for this particular post. (result should be 2 in this case)
Cant seem to find the way for this search, i don't want the occurrences across ALL the documents but just for a single document and inside its' one field. Please suggest a query for this. Thanks
Elasticsearch Version: 7.5
You can use a terms aggregation on your description but need to make sure its fielddata is set to true on it.
PUT kamboh/
{
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"title": {
"type": "text"
},
"description": {
"type": "text",
"fields": {
"simple_analyzer": {
"type": "text",
"fielddata": true,
"analyzer": "simple"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Ingesting a sample doc:
PUT kamboh/_doc/1
{
"id": 123,
"title": "some title",
"description": "my city is LA, this post description has two occurrences of word city "
}
Aggregating:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_agg": {
"terms": {
"field": "description.simple_analyzer",
"size": 20
}
}
}
}
Yielding:
"aggregations" : {
"terms_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "city",
"doc_count" : 1
},
{
"key" : "description",
"doc_count" : 1
},
...
]
}
}
Now, as you can see, the simple analyzer split the string into words and made them lowercase but it also got rid of the duplicate city in your string! I could not come up with an analyzer that'd keep the duplicates... With that being said,
It's advisable to do these word counts before you index!
You would split your string by whitespace and index them as an array of words instead of a long string.
This is also possible at search time, albeit it's very expensive, does not scale well and you need to have script.painless.regex.enabled: true in your es.yaml:
GET kamboh/_search
{
"size": 0,
"aggregations": {
"terms_script": {
"scripted_metric": {
"params": {
"word_of_interest": ""
},
"init_script": "state.map = [:];",
"map_script": """
if (!doc.containsKey('description')) return;
def split_by_whitespace = / /.split(doc['description.keyword'].value);
for (def word : split_by_whitespace) {
if (params['word_of_interest'] !== "" && params['word_of_interest'] != word) {
return;
}
if (state.map.containsKey(word)) {
state.map[word] += 1;
return;
}
state.map[word] = 1;
}
""",
"combine_script": "return state.map;",
"reduce_script": "return states;"
}
}
}
}
yielding
...
"aggregations" : {
"terms_script" : {
"value" : [
{
"occurrences" : 1,
"post" : 1,
"city" : 2, <------
"LA," : 1,
"of" : 1,
"this" : 1,
"description" : 1,
"is" : 1,
"has" : 1,
"my" : 1,
"two" : 1,
"word" : 1
}
]
}
}
...

Average of differences calculated between two date fields

I'm working on a project that uses Elasticsearch to store data and show some complex statistics.
I have an index in that looks like this:
Reservation {
id: number
check_in: Date
check_out: Date
created_at: Date
// other fields...
}
I need to calculate the average days' difference between check_in and created_at of my Reservations in a specific date range and show the result as a number.
I tried this query:
{
"script_fields": {
"avgDates": {
"script": {
"lang": "expression",
"source": "doc['created_at'].value - doc['check_in'].value"
}
}
},
"query": {
"bool": {
"must": [
{
"range": {
"created_at": {
"gte": "{{lastMountTimestamp}}",
"lte": "{{currentTimestamp}}"
}
}
}
]
}
},
"size": 0,
"aggs": {
"avgBetweenDates": {
"avg": {
"field": "avgDates"
}
}
}
}
Dates fields are saved in ISO 8601 form (eg: 2020-03-11T14:25:15+00:00), I don't know if this could produce issues.
It catches some hits, So, the query works for sure! but, it always returns null as the value of the avgBetweenDates aggregation.
I need a result like this:
"aggregations": {
"avgBetweenDates": {
"value": 3.14159 // Π is just an example!
}
}
Any ideas will help!
Thank you.
Scripted Fields are not stored fields in ES. You can only perform aggregation on the stored fields as scripted fields are created on the fly.
You can simply move the script logic in the Average Aggregation as shown below. Note that for the sake of understanding, I've created sample mapping, documents, query and its response.
Mapping:
PUT my_date_index
{
"mappings": {
"properties": {
"check_in":{
"type":"date",
"format": "date_time"
},
"check_out":{
"type": "date",
"format": "date_time"
},
"created_at":{
"type": "date",
"format": "date_time"
}
}
}
}
Sample Documents:
POST my_date_index/_doc/1
{
"check_in": "2019-01-15T00:00:00.000Z",
"check_out": "2019-01-20T00:00:00.000Z",
"created_at": "2019-01-17T00:00:00.000Z"
}
POST my_date_index/_doc/2
{
"check_in": "2019-01-15T00:00:00.000Z",
"check_out": "2019-01-22T00:00:00.000Z",
"created_at": "2019-01-20T00:00:00.000Z"
}
Aggregation Query:
POST my_date_index/_search
{
"size": 0,
"aggs": {
"my_dates_diff": {
"avg": {
"script": """
ZonedDateTime d1 = doc['created_at'].value;
ZonedDateTime d2 = doc['check_in'].value;
long differenceInMillis = ChronoUnit.MILLIS.between(d1, d2);
return Math.abs(differenceInMillis/86400000);
"""
}
}
}
}
Notice, that you wanted difference in number of days. The above logic does that.
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_dates_diff" : {
"value" : 3.5 <---- Average in Number of Days
}
}
}
Hope this helps!
Scripted fields created within the _search context can only be consumed within that scope. They're not visible within the aggregations! This means you'll have to go with either
moving your script to the aggs section and doing the avg there
a scripted metric aggregation (quite slow and difficult to get right)
or creating a dateDifference field at index time (preferably an int -- a difference of the timestamps) which will enable you to perform powerful numeric aggs like extended stats which provide a statistically useful output like:
{
...
"aggregations": {
"grades_stats": {
"count": 2,
"min": 50.0,
"max": 100.0,
"avg": 75.0,
"sum": 150.0,
"sum_of_squares": 12500.0,
"variance": 625.0,
"std_deviation": 25.0,
"std_deviation_bounds": {
"upper": 125.0,
"lower": 25.0
}
}
}
}
and are always faster than computing the timestamp differences with a script.

How to query if a time is between two field values

How do I search for documents that are between a start and end time? For example, I want to query the following document using a time only like "18:33" or "21:32". "18:33" would return the following document and "21:32" wouldn't. I don't care about the date part nor the secs.
{
"my start time field": "2020-01-23T18:32:21.768Z",
"my end time field": "2020-01-23T20:32:21.768Z"
}
I've reviewed: Using the range query with date fields. but I'm not sure how to only look at times. Also, I want to see if a time is between two fields, not if a field is between two times.
Essentially, the Elasticsearch equivalent of BETWEEN for SQL Server. Like this answer except I don't want to use the current time but a variable.
DECLARE #blah datetime2 = GETDATE()
SELECT *
FROM Table1 T
WHERE CAST(#blah AS TIME)
BETWEEN cast(T.StartDate as TIME) AND cast(T.EndDate as TIME)
As per the suggestion from the OP and the link he provided which adheres to the laws of stackoverflow I'm providing the second solution in here:
Solution 2: Insert separate fields for hour minute as hh:mm
Note the format used which says hour_minute. You can find the list of formats available under the aforementioned link.
Basically you re-ingest the documents with a separate field that would have hour and minute values and execute range queries to get what you want.
Mapping:
PUT my_time_index
{
"mappings": {
"properties": {
"start_time":{
"type": "date",
"format": "hour_minute"
},
"end_time":{
"type": "date",
"format": "hour_minute"
}
}
}
}
Sample Document:
POST my_time_index/_doc/1
{
"start_time": "18:32",
"end_time": "20:32"
}
Query Request:
POST my_time_index/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"start_time": {
"gte": "18:00"
}
}
},
{
"range": {
"end_time": {
"lte": "21:00"
}
}
}
]
}
}
}
Let me know if this helps!
Don't store times in a datetime datatype based upon this discussion.
If you want to filter for the specific hour of the day, you would need to extract that into it's own field.
Via the Kibana Dev Tools -> Console
Create some mock data:
POST between-research/_doc/1
{
"my start hour": 0,
"my end hour": 12
}
POST between-research/_doc/2
{
"my start hour": 13,
"my end hour": 23
}
Perform "between" search
POST between-research/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"my start hour": {
"lte": 10
}
}
},
{
"range": {
"my end hour": {
"gte": 10
}
}
}
]
}
}
}
Solution 1: Existing Date Format
Without changing and ingesting your hours and minutes separately, I've come up with the below solution and I don't think you would be happy with the way ES provides you the solution, but it certainly works.
I've created a sample mapping, document, the query and response based on the data you've provided.
Mapping:
PUT my_date_index
{
"mappings": {
"properties": {
"start_time":{
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
},
"end_time":{
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
}
}
}
}
Sample Documents:
POST my_date_index/_doc/1
{
"start_time": "2020-01-23T18:32:21.768Z",
"end_time": "2020-01-23T20:32:21.768Z"
}
POST my_date_index/_doc/2
{
"start_time": "2020-01-23T19:32:21.768Z",
"end_time": "2020-01-23T20:32:21.768Z"
}
POST my_date_index/_doc/3
{
"start_time": "2020-01-23T21:32:21.768Z",
"end_time": "2020-01-23T22:32:21.768Z"
}
Query Request:
POST my_date_index/_search
{
"query": {
"bool": {
"must": [
{
"script": {
"script": {
"source": """
ZonedDateTime zstart_time = doc['start_time'].value;
int zstart_hour = zstart_time.getHour();
int zstart_minute = zstart_time.getMinute();
int zstart_total_minutes = zstart_hour * 60 + zstart_minute;
ZonedDateTime zend_time = doc['end_time'].value;
int zend_hour = zend_time.getHour();
int zend_minute = zend_time.getMinute();
int zend_total_minutes = zend_hour * 60 + zend_minute;
int my_input_total_minutes = params.my_input_hour * 60 + params.my_input_minute;
if(zstart_total_minutes <= my_input_total_minutes && zend_total_minutes >= my_input_total_minutes){
return true;
}
return false;
""",
"params": {
"my_input_hour": 20,
"my_input_minute": 10
}
}
}
}
]
}
}
}
Basically
calculate number of minutes from start_date
calculate number of minutes from end_date
calculate number of minutes from params.my_input_hour & params.my_input_minute
execute the logic in if condition as start_date <= input <= end_date using the minutes of all the three values and return the documents accordingly.
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 2.0,
"hits" : [
{
"_index" : "my_time_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.0,
"_source" : {
"start_time" : "18:32",
"end_time" : "20:32"
}
}
]
}
}
Do test them thoroughly for performance issues when it comes to solution 1 as script queries generally hit performances, however they come in handy if you have no option.
Let me know if this helps!

ElasticSearch: aggregation on _score field?

I would like to use the stats or extended_stats aggregation on the _score field but can't find any examples of this being done (i.e., seems like you can only use aggregations with actual document fields).
Is it possible to request aggregations on calculated "metadata" fields for each hit in an ElasticSearch query response (e.g., _score, _type, _shard, etc.)?
I'm assuming the answer is 'no' since fields like _score aren't indexed...
Note: The original answer is now outdated in terms of the latest version of Elasticsearch. The equivalent script using Groovy scripting would be:
{
...,
"aggregations" : {
"grades_stats" : {
"stats" : {
"script" : "_score"
}
}
}
}
In order to make this work, you will need to enable dynamic scripting or, even better, store a file-based script and execute it by name (for added security by not enabling dynamic scripting)!
You can use a script and refer to the score using doc.score. More details are available in ElasticSearch's scripting documentation.
A sample stats aggregation could be:
{
...,
"aggregations" : {
"grades_stats" : {
"stats" : {
"script" : "doc.score"
}
}
}
}
And the results would look like:
"aggregations": {
"grades_stats": {
"count": 165,
"min": 0.46667441725730896,
"max": 3.1525731086730957,
"avg": 0.8296855776598959,
"sum": 136.89812031388283
}
}
A histogram may also be a useful aggregation:
"aggs": {
"grades_histogram": {
"histogram": {
"script": "doc.score * 10",
"interval": 3
}
}
}
Histogram results:
"aggregations": {
"grades_histogram": {
"buckets": [
{
"key": 3,
"doc_count": 15
},
{
"key": 6,
"doc_count": 103
},
{
"key": 9,
"doc_count": 46
},
{
"key": 30,
"doc_count": 1
}
]
}
}
doc.score doesn't seem to work anymore. Using _score seems to work perfectly.
Example:
{
...,
"aggregations" : {
"grades_stats" : {
"stats" : {
"script" : "_score"
}
}
}
}

Resources