I'm using the ingest pipeline script processors to extract the day of the week from the local time for each document.
I'm using the client_ip to extract the timezone, use that along with the timestamp to extract the local time, and then extract day of week (and other features) from that local time.
This is my ingest pipeline:
{
"processors" : [
{
"set" : {
"field" : "#timestamp",
"override" : false,
"value" : "{{_ingest.timestamp}}"
}
},
{
"date" : {
"field" : "#timestamp",
"formats" : [
"EEE MMM dd HH:mm:ss 'UTC' yyyy"
],
"ignore_failure" : true,
"target_field" : "#timestamp"
}
},
{
"convert" : {
"field" : "client_ip",
"type" : "ip",
"ignore_failure" : true,
"ignore_missing" : true
}
},
{
"geoip" : {
"field" : "client_ip",
"target_field" : "client_geo",
"properties" : [
"continent_name",
"country_name",
"country_iso_code",
"region_iso_code",
"region_name",
"city_name",
"location",
"timezone"
],
"ignore_failure" : true,
"ignore_missing" : true
}
},
{
"script" : {
"description" : "Extract details of Dates",
"lang" : "painless",
"ignore_failure" : true,
"source" : """
LocalDateTime local_time LocalDateTime.ofInstant( Instant.ofEpochMilli(ctx['#timestamp']), ZoneId.of(ctx['client_geo.timezone']));
int day_of_week = local_time.getDayOfWeek().getValue();
int hour_of_day = local_time.getHour();
int office_hours = 0;
if (day_of_week<6 && day_of_week>0) { if (hour_of_day >= 7 && hour_of_day <= 19 ) {office_hours =1;} else {office_hours = -1;}} else {office_hours = -1;}
ctx['day_of_week'] = day_of_week;
ctx['hour_of_day'] = hour_of_day;
ctx['office_hours'] = office_hours;
"""
}
}
]
}
The first two processors were added before for other purposes. I've added the last 3.
An example document could be the following:
"docs": [
{
"_source": {
"#timestamp": 43109942361111,
"client_ip": "89.160.20.128"
}
}
]
I'm getting the GeoIP fields in the data now, but none of the fields created by the script processor. What am I doing wrong?
EDIT
A few notes about the index that is affected by these changes:
The Dynamic mapping is off.
I have manually added the client_geo.timezone field to the mapping of the index as a keyword.
When I run the following scripted search on the index
GET index_name/_search
{
"script_fields": {
"day_of_week": {
"script": "doc['#timestamp'].value.withZoneSameInstant(ZoneId.of(doc['client_geo']['timezone'])).getDayOfWeek().getValue()"
}
}
}
I get the following runtime error in script execution:
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "No field found for [client_geo] in mapping"
}
Thanks for a well formed question + example.
I was able to replicate your problem and figured it out.
ctx is "The document source as it is". Consequently, ingest does not automatically dig-up dot-delimited fields.
Your client data is added as such:
"client_geo" : {
"continent_name" : "Europe"
//<snip>..</snip>
}
So, you have to access it directly as a nested hash map.
Meaning ctx['client_geo.timezone'] should actually be ctx['client_geo']['timezone']
Here is the full pipeline that worked for me:
"processors": [
{
"set": {
"field": "#timestamp",
"override": false,
"value": "{{_ingest.timestamp}}"
}
},
{
"date": {
"field": "#timestamp",
"formats": [
"EEE MMM dd HH:mm:ss 'UTC' yyyy"
],
"ignore_failure": true,
"target_field": "#timestamp"
}
},
{
"convert": {
"field": "client_ip",
"type": "ip",
"ignore_failure": true,
"ignore_missing": true
}
},
{
"geoip": {
"field": "client_ip",
"target_field": "client_geo",
"properties": [
"continent_name",
"country_name",
"country_iso_code",
"region_iso_code",
"region_name",
"city_name",
"location",
"timezone"
],
"ignore_failure": true,
"ignore_missing": true
}
},
{
"script": {
"description": "Extract details of Dates",
"lang": "painless",
"ignore_failure": true,
"source": """
LocalDateTime local_time = LocalDateTime.ofInstant(Instant.ofEpochMilli(ctx['#timestamp']), ZoneId.of(ctx['client_geo']['timezone']));
int day_of_week = local_time.getDayOfWeek().getValue();
int hour_of_day = local_time.getHour();
int office_hours = 0;
if (day_of_week<6 && day_of_week>0) { if (hour_of_day >= 7 && hour_of_day <= 19 ) {office_hours =1;} else {office_hours = -1;}} else {office_hours = -1;}
ctx['day_of_week'] = day_of_week;
ctx['hour_of_day'] = hour_of_day;
ctx['office_hours'] = office_hours;
"""
}
}
]
Related
Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
I have a mapping as such (simplified and obfuscated)
{
"video_entry" : {
"aliases" : { },
"mappings" : {
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"type" : "nested",
"properties" : {
"country" : {
"type" : "keyword",
},
"date_of_birth" : {
"type" : "date",
}
}
}
}
Each video_entry document can have 0 or more members nested documents.
Sample Document
{
"captions_added": true,
"category" : "Mental Health",
"is_votable: : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
If one or more nested document exist, we want to write some painless scripts that'd check certain fields across all the nested documents. My script works on mappings with a few documents but when I try it on larger set of documents I get null pointer exceptions despite having every null check possible. I've tried various access patterns, error checking mechanisms but I get exceptions.
POST /video_entry/_search
{
"query": {
"script": {
"script": {
"source": """
// various NULL checks that I already tried
// also tried short circuiting on finding null values
if (!params['_source'].empty && params['_source'].containsKey('members')) {
def total = 0;
for (item in params._source.members) {
// custom logic here
// if above logic holds true
// total += 1;
}
return total > 3;
}
return true;
""",
"lang": "painless"
}
}
}
}
Other Statements That I've Tried
if (params._source == null) {
return true;
}
if (params._source.members == null) {
return true;
}
if (!ctx._source.contains('members')) {
return true;
}
if (!params['_source'].empty && params['_source'].containsKey('members') &&
params['_source'].members.value != null) {
// logic here
}
if (doc.containsKey('members')) {
for (mem in params._source.members) {
}
}
Error Message
&& params._source.members",
^---- HERE"
"caused_by" : {
"type" : "null_pointer_exception",
"reason" : null
}
I've looked into changing the structure (flattening the document) and the usage of must_not as indicated in this answer. They don't suit our use case as we need to incorporate some more custom logic.
Different tutorials use ctx, doc and some use params. To add to the confusion Debug.explain(doc.members), Debug.explain(params._source.members) return empty responses and I'm having a hard time figuring out the types.
Gist: Trying to write a custom filter on nested documents using painless. Want to write error checks when there are no nested documents to surpass null_pointer_exception
Any help is appreciated.
TLDr;
Elastic flatten objects. Such that
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
Turn into:
{
"group" : "fans",
"user.first" : [ "alice", "john" ],
"user.last" : [ "smith", "white" ]
}
To access members inner value you need to reference it using doc['members.<field>'] as members will not exist on its own.
Details
As you may know, Elastic handles inner documents in its own way. [doc]
So you will need to reference them accordingly.
Here is what I did to make it work.
Btw, I have been using the Dev tools of kibana
PUT /so_test/
PUT /so_test/_mapping
{
"properties" : {
"captions_added" : {
"type" : "boolean"
},
"category" : {
"type" : "keyword"
},
"is_votable" : {
"type" : "boolean"
},
"members" : {
"properties" : {
"country" : {
"type" : "keyword"
},
"date_of_birth" : {
"type" : "date"
}
}
}
}
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental Health",
"is_votable" : true,
"members": [
{"country": "Denmark", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Denmark", "date_of_birth": "1999-05-05T00:00:00"}
]
}
PUT /so_test/_doc/
{
"captions_added": true,
"category" : "Mental breakdown",
"is_votable" : true,
"members": []
}
POST /so_test/_doc/
{
"captions_added": true,
"category" : "Mental success",
"is_votable" : true,
"members": [
{"country": "France", "date_of_birth": "1998-04-04T00:00:00"},
{"country": "Japan", "date_of_birth": "1999-05-05T00:00:00"}
]
}
And then I did this query (it is only a bool filter, but I guess making it work for your own use case should not prove too difficult)
GET /so_test/_search
{
"query":{
"bool": {
"filter": {
"script": {
"script": {
"lang": "painless",
"source": """
def flag = false;
// /!\ notice how the field is referenced /!\
if(doc['members.country'].size() != 0)
{
for (item in doc['members.country']) {
if (item == params.country){
flag = true
}
}
}
return flag;
""",
"params": {
"country": "Japan"
}
}
}
}
}
}
}
BTW you were saying you were a bit confused about the context for painless. you can find in the documentation so details about it.
[doc]
In this case the filter context is the one we want to look at.
I'm using elasticsearch 7.13 and code on kibana
This is my mapping
{
"full_text" : {
"properties" : {
"title" : {
"type" : "text",
"fielddata" : true
},
}
}
}
This is my data
"full_text" : [
{
"title" : "Pkd chuyên cho thuê kingdom 101 1pn đến 3pn giá rẻ nhất thị trường chỉ 11 triệu/căn. lh 0919504***"
}
]
This is my code to sort by length of full_text.title
"sort": {
"_script": {
"type": "number",
"order": "desc",
"script": {
"lang": "painless",
"source": "doc['full_text.title'].value.length()"
}
}
}
So why sort result return only 7?
"_source" : {
"full_text" : [
{
"title" : "Pkd chuyên cho thuê kingdom 101 1pn đến 3pn giá rẻ nhất thị trường chỉ 11 triệu/căn. lh 0919504***"
}
]
},
"sort": [
7.0
]
Because doc['full_text.title'] will split "title" into array, you need to join that array to string.
Try this:
"source": "int length = String.join(' ',doc['full_text.title']).length(); return length;"
Given the following mapping where variants are a nested type and options is a flattened type:
{
"doc_type" : "product",
"id" : 1,
"variants" : [
{
"options" : {
"Size" : "XS",
},
"price" : 1,
},
{
"options" : {
"Size" : "S",
"Material": "Wool"
},
"price" : 6.99,
},
]
}
I want to run an aggregation that produces data in the following format:
{
"variants.options.Size": {
"buckets" : [
{
"key" : "XS",
"doc_count" : 1
},
{
"key" : "S",
"doc_count" : 1
},
],
},
"variants.options.Material": {
"buckets" : [
{
"key" : "Wool",
"doc_count" : 1
}
],
},
}
I could very easily do something like:
"aggs": {
"variants.options.Size": {
"terms": {
"field": "variants.options.Size"
}
},
"variants.options.Material": {
"terms": {
"field": "variants.options.Material"
}
}
}
The caveat here is that we're using the flattened type for options because the fields in options are dynamic and so there is no way for me to know before hand that we want to aggregate on Size and Material.
Essentially, I want to tell Elasticsearch that it should aggregate on whatever keys it finds under options. Is there a way to do this?
I want to tell Elasticsearch that it should aggregate on whatever keys it finds under options. Is there a way to do this?
Not directly. I had the same question a while back. I haven't found a clean solution to this day and I'm convinced there isn't one.
Luckily, there's a scripted_metric workaround that I outlined here. Applying it to your use case:
POST your_index/_search
{
"size": 0,
"aggs": {
"dynamic_variant_options": {
"scripted_metric": {
"init_script": "state.buckets = [:];",
"map_script": """
def variants = params._source['variants'];
for (def variant : variants) {
for (def entry : variant['options'].entrySet()) {
def key = entry.getKey();
def value = entry.getValue();
def path = "variants.options." + key;
if (state.buckets.containsKey(path)) {
if (state.buckets[path].containsKey(value)) {
state.buckets[path][value] += 1;
} else {
state.buckets[path][value] = 1;
}
} else {
state.buckets[path] = [value:1];
}
}
}
""",
"combine_script": "return state",
"reduce_script": "return states"
}
}
}
}
would yield:
"aggregations" : {
"dynamic_variant_options" : {
"value" : [
{
"buckets" : {
"variants.options.Size" : {
"S" : 1,
"XS" : 1
},
"variants.options.Material" : {
"Wool" : 1
}
}
}
]
}
}
You'll need to adjust the painless code if you want the buckets to be arrays of key-doc_count pairs instead of hash maps like in my example.
How do I search for documents that are between a start and end time? For example, I want to query the following document using a time only like "18:33" or "21:32". "18:33" would return the following document and "21:32" wouldn't. I don't care about the date part nor the secs.
{
"my start time field": "2020-01-23T18:32:21.768Z",
"my end time field": "2020-01-23T20:32:21.768Z"
}
I've reviewed: Using the range query with date fields. but I'm not sure how to only look at times. Also, I want to see if a time is between two fields, not if a field is between two times.
Essentially, the Elasticsearch equivalent of BETWEEN for SQL Server. Like this answer except I don't want to use the current time but a variable.
DECLARE #blah datetime2 = GETDATE()
SELECT *
FROM Table1 T
WHERE CAST(#blah AS TIME)
BETWEEN cast(T.StartDate as TIME) AND cast(T.EndDate as TIME)
As per the suggestion from the OP and the link he provided which adheres to the laws of stackoverflow I'm providing the second solution in here:
Solution 2: Insert separate fields for hour minute as hh:mm
Note the format used which says hour_minute. You can find the list of formats available under the aforementioned link.
Basically you re-ingest the documents with a separate field that would have hour and minute values and execute range queries to get what you want.
Mapping:
PUT my_time_index
{
"mappings": {
"properties": {
"start_time":{
"type": "date",
"format": "hour_minute"
},
"end_time":{
"type": "date",
"format": "hour_minute"
}
}
}
}
Sample Document:
POST my_time_index/_doc/1
{
"start_time": "18:32",
"end_time": "20:32"
}
Query Request:
POST my_time_index/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"start_time": {
"gte": "18:00"
}
}
},
{
"range": {
"end_time": {
"lte": "21:00"
}
}
}
]
}
}
}
Let me know if this helps!
Don't store times in a datetime datatype based upon this discussion.
If you want to filter for the specific hour of the day, you would need to extract that into it's own field.
Via the Kibana Dev Tools -> Console
Create some mock data:
POST between-research/_doc/1
{
"my start hour": 0,
"my end hour": 12
}
POST between-research/_doc/2
{
"my start hour": 13,
"my end hour": 23
}
Perform "between" search
POST between-research/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"my start hour": {
"lte": 10
}
}
},
{
"range": {
"my end hour": {
"gte": 10
}
}
}
]
}
}
}
Solution 1: Existing Date Format
Without changing and ingesting your hours and minutes separately, I've come up with the below solution and I don't think you would be happy with the way ES provides you the solution, but it certainly works.
I've created a sample mapping, document, the query and response based on the data you've provided.
Mapping:
PUT my_date_index
{
"mappings": {
"properties": {
"start_time":{
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
},
"end_time":{
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
}
}
}
}
Sample Documents:
POST my_date_index/_doc/1
{
"start_time": "2020-01-23T18:32:21.768Z",
"end_time": "2020-01-23T20:32:21.768Z"
}
POST my_date_index/_doc/2
{
"start_time": "2020-01-23T19:32:21.768Z",
"end_time": "2020-01-23T20:32:21.768Z"
}
POST my_date_index/_doc/3
{
"start_time": "2020-01-23T21:32:21.768Z",
"end_time": "2020-01-23T22:32:21.768Z"
}
Query Request:
POST my_date_index/_search
{
"query": {
"bool": {
"must": [
{
"script": {
"script": {
"source": """
ZonedDateTime zstart_time = doc['start_time'].value;
int zstart_hour = zstart_time.getHour();
int zstart_minute = zstart_time.getMinute();
int zstart_total_minutes = zstart_hour * 60 + zstart_minute;
ZonedDateTime zend_time = doc['end_time'].value;
int zend_hour = zend_time.getHour();
int zend_minute = zend_time.getMinute();
int zend_total_minutes = zend_hour * 60 + zend_minute;
int my_input_total_minutes = params.my_input_hour * 60 + params.my_input_minute;
if(zstart_total_minutes <= my_input_total_minutes && zend_total_minutes >= my_input_total_minutes){
return true;
}
return false;
""",
"params": {
"my_input_hour": 20,
"my_input_minute": 10
}
}
}
}
]
}
}
}
Basically
calculate number of minutes from start_date
calculate number of minutes from end_date
calculate number of minutes from params.my_input_hour & params.my_input_minute
execute the logic in if condition as start_date <= input <= end_date using the minutes of all the three values and return the documents accordingly.
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 2.0,
"hits" : [
{
"_index" : "my_time_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.0,
"_source" : {
"start_time" : "18:32",
"end_time" : "20:32"
}
}
]
}
}
Do test them thoroughly for performance issues when it comes to solution 1 as script queries generally hit performances, however they come in handy if you have no option.
Let me know if this helps!
Using Elasticsearch 1.7.1, I have the following document structure
"_source" : {
"questions" : {
"defaultQuestion" : {
"tag" : 0,
"gid" : 0,
"rid" : 0,
"caption" : "SRID",
},
"tableQuestion" : {
"rows" : [{
"ids" : {
"answerList" : ["3547", "3548"],
"tag" : "0",
"caption" : "Accounts",
},
"name" : {
"answerList" : ["Some Name"],
"tag" : "0",
"caption" : "Name",
}
}
],
"caption" : "BPI 1500541753537",
"id" : 644251570,
"tag" : ""
}
},
"id" : "447722821"
}
I want to add a new object in in questions.tableQuestion.rows. My current script is replacing the existing object with the new one. Kindly suggest how to append it instead. Following is my update script.
{ "update": {"_id": "935663867", "_retry_on_conflict" : 3} }
{ "script" : "ctx._source.questions += param1", "params" : {"param1" : {"tableQuestion": {"rows" : [ NEWROWOBJECT ]} } }}
You can build the path with next nested fields, right to the rows property and then use += operator. It's also good to have a check if rows array is null and initialize it in this case.
Checked with ES 2.4, but should be similar for earlier versions:
POST http://127.0.0.1:9200/sample/demo/{document_id}/_update
{
"script": {
"inline": "if (ctx._source.questions.tableQuestion.rows == null) ctx._source.questions.tableQuestion.rows = new ArrayList(); ctx._source.questions.tableQuestion.rows += param1;",
"params" : {
"param1" : {
"ids": {
"answerList": [
"478",
"255"
],
"tag": "2",
"caption": "My Test"
},
"name": {
"answerList": [
"My Name"
],
"tag": "1",
"caption": "My Demo"
}
}
}
}
}
For ES 5.x and Painless language the script is a bit different:
POST http://127.0.0.1:9200/sample/demo/{document_id}/_update
{
"script": {
"inline": "if (ctx._source.questions.tableQuestion.rows == null) { ctx._source.questions.tableQuestion.rows = new ArrayList();} ctx._source.questions.tableQuestion.rows.add(params.param1);",
"params" : {
"param1" : {
...
}
}
}
}
Update to the additional comment
If some part of the path is dynamic, you can also use parameters to build the path - with get(param_name) method - try this syntax (I removed the null check for simplicity):
{
"script": {
"inline": "ctx._source.questions.get(param2).rows += param1;",
"params" : {
"param2" : "6105243",
"param1" : {
"ids": {
"answerList": [
"478",
"255"
],
"tag": "2",
"caption": "My Test"
},
"name": {
"answerList": [
"My Name"
],
"tag": "1",
"caption": "My Demo"
}
}
}
}
}