Elasticsearch update date in nested - elasticsearch

I want to remove 15 minutes from all dates in the history that are less than 15 minutes old.
So I have to compare the date now - 15 minutes to the record date.
However, when I retrieve the date, it can not compare it because it is like a String and adding ".value" returns that the attribute does not exist.
Error response :
"if(ctx._source.histories[i].creation_date.value"
dynamic getter [java.lang.String, value] not found
Try other solutions with others error :
"if(ctx._source.histories[i].creation_date.date"
"if(ctx._source.histories[i].creation_date.getMillis()"
"if(ctx._source.histories[i].creation_date.value.getMillis()"
Update request (elasticsearch.js) :
{
"query": { "term": { "user_id": "USER_ID" } },
"script":
{
"lang": "painless",
"source": "for(int i = ctx._source.histories.length-1; i > 0; --i){ if(ctx._source.histories[i].creation_date.value > params.date) { ctx._source.histories[i].creation_date -= 1000 * 60 * 15; } }",
"params": { "date": new Date() - 1000 * 60 * 15 }
}
}
Mapping :
{
"mappings":
{
"_doc":
{
"properties":
{
"histories":
{
"type": "nested",
"properties":
{
"type": { "type": "text" },
"key": { "type": "text" },
"value": { "type": "text" },
"ip": { "type": "ip" },
"useragent": { "type": "text" },
"creation_date": { "type": "date" }
}
}
}
}
}
}
Infos elasticsearch :
{
"name" : "ZZZ",
"cluster_name" : "YYY",
"cluster_uuid" : "XXX",
"version" : {
"number" : "6.5.2",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "WWW",
"build_date" : "2018-11-29T23:58:20.891072Z",
"build_snapshot" : false,
"lucene_version" : "7.5.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}
Sample datas :
{
"hits":
{
"total": 1,
"max_score": 4.13468,
"hits":
[
{
"_index": "myindex",
"_type": "_doc",
"_id": "H1dQ4WgBypYasGfnnXXI",
"_score": 4.13468,
"_source":
{
"infos":
{
"firsname": "John",
"lastname": "Doe",
"mail": "john.doe#stackoverflow.com"
},
"histories":
[
{
"type": "auth",
"key": "try",
"value": "fail",
"ip": "127.0.0.1",
"useragent": "iPhoneX",
"creation_date": "2019-02-19T16:49:00.396Z"
},
{
"type": "auth",
"key": "try",
"value": "fail",
"ip": "127.0.0.1",
"useragent": "iPhoneX",
"creation_date": "2019-02-19T16:50:00.396Z"
}
]
}
}
]
}
}

I think I might have something that might help you (tested on ES 6.6.0).
{
"query": {
"match_all": {}
},
"script": {
"lang": "painless",
"source": """
// parse params.data to Instant
def paramDate = Instant.parse(params.date);
for(int i = ctx._source.histories.length-1; i > 0; --i) {
// parse the creation date to Instant
def creationDate = Instant.parse(ctx._source.histories[i].creation_date);
// check time difference between both
if (ChronoUnit.MINUTES.between(creationDate, paramDate) <= 15) {
// remove 15 minutes if condition satisfied
ctx._source.histories[i].creation_date = creationDate.minusSeconds(900).toString();
}
}
""",
"params": {
"date": "2019-02-19T16:45:00.000Z"
}
}
}
Note: I'm using triple quotes to make the query more readable, but feel free to inline it again as you see fit and remove the comments.

Related

Aggregation on Latest Records Of same status in ElasticSearch

I Have following data in ElasticSearch index some_index.
[ {
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 1,
"cart_status": "new",
"grandTotal": 12,
"event": "some_event",
"timestamp": "2022-12-01T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 1,
"cart_status": "paid",
"grandTotal": 12,
"event": "some_event",
"timestamp": "2022-12-02T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 2,
"cart_status": "new",
"grandTotal": 23,
"event": "some_event",
"timestamp": "2022-12-01T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 2,
"cart_status": "paid",
"grandTotal": 23,
"event": "some_event",
"timestamp": "2022-12-04T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 3,
"cart_status": "new",
"grandTotal": 17,
"event": "some_event",
"timestamp": "2022-12-01T00:00:00.000Z"
}
}
},
{
"_index": "some_index",
"_source": {
"cart": {
"cart_id": 3,
"cart_status": "new",
"grandTotal": 17,
"event": "some_event",
"timestamp": "2022-12-04T00:00:00.000Z"
}
}
}
]
What I want to get is sum of the grandTotals by the latest cart_statuses of each cart within a given time range.
Having the example above, the result for timestamp >= 2022-12-01 00:00:00 and timestamp<= 2022-12-03 00:00:00 should be something like
cart_status:new, sum grandTotal: 40 because within that time range latest status new have cart_id 3 and 2.
and cart_status:paid, sum grandTotal: 12 and this one because paid is the latest status of only cart_id=1.
What I tried is to use sub-aggregation on top_result, top_hits but ElasticSearch complains that "Aggregator [top_result] of type [top_hits] cannot accept sub-aggregations"
Besides I tried with collapse as well to get the latest by status, but according to docs there is also no possibility to aggregate over the results of collapse.
Can someone please help me solving this, it seems like a common calculation but not very trivial in ElasticSearch.
In SQL this is quite easy with window functions.
I want to avoid persisting intermediate data into another index. Because I need the dynamic query, as the users may want to get their calculations for any time range.
you can try the following way. meanwhile, for card_status, sum value will be 52 as it includes card_id 1 that has "new" as card status along with 2 and 3 for given timestamp.
Mappings:
PUT some_index
{
"mappings" : {
"properties": {
"timestamp" : {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||strict_date_optional_time ||epoch_millis"
},
"cart_id" : {
"type": "keyword"
},
"cart_status" : {
"type": "keyword"
},
"grand_total" : {
"type": "long"
},
"event":{
"type": "keyword"
}
}
}
}
Bulk Insert:
POST _bulk
{ "index" : { "_index" : "some_index", "_id" : "1" } }
{ "cart_id" : "1" , "grand_total":12, "cart_status" : "new","timestamp":"2022-12-01T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "2" } }
{ "cart_id" : "1" , "grand_total":12, "cart_status" : "paid","timestamp":"2022-12-02T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "3" } }
{ "cart_id" : "2" , "grand_total":23, "cart_status" : "new","timestamp":"2022-12-01T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "4" } }
{ "cart_id" : "2" , "grand_total":23, "cart_status" : "paid","timestamp":"2022-12-04T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "5" } }
{ "cart_id" : "3" , "grand_total":17, "cart_status" : "new","timestamp":"2022-12-01T00:00:00.000Z", "event" : "some_event"}
{ "index" : { "_index" : "some_index", "_id" : "6" } }
{ "cart_id" : "3" , "grand_total":17, "cart_status" : "new","timestamp":"2022-12-04T00:00:00.000Z", "event" : "some_event"}
Query:
GET some_index/_search
{
"size":0,
"query": {
"bool": {
"filter": [
{
"range": {
"timestamp": {
"gte": "2022-12-01 00:00:00",
"lte": "2022-12-03 00:00:00"
}
}
}
]
}
},
"aggs": {
"card_status": {
"terms": {
"field": "cart_status"
},
"aggs": {
"grandTotal": {
"sum": {
"field": "grand_total"
}
}
}
}
}
}
Output:
{
"took": 86,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"card_status": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "new",
"doc_count": 3,
"grandTotal": {
"value": 52
}
},
{
"key": "paid",
"doc_count": 1,
"grandTotal": {
"value": 12
}
}
]
}
}
}

Weighted Average value over documents in elastic search

I need to calculate the weighted average value using the elastic search, I can't change the structure of the documents. If we assume that there are 2 indexed documents. The first document
const doc1 = {
"id": "1",
"userId: "2",
"scores" : [
{
"name": "score1",
"value": 93.0
},
{
"name": "score2",
"value": 90.0
},
{
"name": "score3",
"value": 76.0
}
],
"metadata": {
"weight": 130
}
}
Second document
const doc2 = {
"id": "2",
"userId: "2",
"scores" : [
{
"name": "score1",
"value": 80.0
},
{
"name": "score2",
"value": 70.0
},
{
"name": "score3",
"value": 88.0
}
],
"metadata": {
"weight": 50
}
}
Calculations should be done by the following formula:
score1Avg = (doc1.scores['score1'].value * doc1.metadata.weight +
doc2.scores['score1'].value * doc2.metadata.weight)/(doc1.weight+doc2.weight)
score2Avg = (doc1.scores['score2'].value * doc1.metadata.weight +
doc2.scores['score2'].value * doc2.metadata.weight)/(doc1.weight+doc2.weight)
score3Avg = (doc1.scores['score3'].value * doc1.metadata.weight +
doc2.scores['score3'].value * doc2.metadata.weight)/(doc1.weight+doc2.weight)
I tried something with nested type for mapping scores, but I can't access the parent document field metadata.weight. How this should be approached, should I use nested type mapping or this can be done in some other way without that?
Edit: I ended up storing scores element as separated documents. Instead of doc1, now I have the following documents.
{
"id": "1",
"userId: "2",
"score": {
"name": "score1",
"value": 93.0
},
"metadata": {
"weight": 130
}
}
{
"id": "1",
"userId: "2",
"score": {
"name": "score2",
"value": 90.0
},
"metadata": {
"weight": 130
}
}
{
"id": "1",
"userId: "2",
"score": {
"name": "score3",
"value": 76.0
},
"metadata": {
"weight": 130
}
}
And the query is:
GET /scores/_search
{
"size": 0,
"aggs": {
"group_by_score_and_user": {
"composite": {
"sources": [
{
"scoreName": {
"terms": {
"field": "score.name.keyword"
}
}
},{
"userId": {
"terms": {
"field": "userId.keyword"
}
}
}
]
},
"aggs": {
"avg": {
"weighted_avg": {
"value":{ "field": "score.value" },
"weight":{ "field": "metadata.weight" }
}
}
}
}
}
}
Btw, the query with the script approach against 5k documents takes 120 ms on average compared to this which takes about 35-40 ms over 100k documents.
Edited to fit the requirement in the comment, like I said before this is not an optimal solution at all, the usage of scripts + params._source + my subpar java will cause this to be very slow or unusable with a lot of docs.
Still I learned a lot
Mapping:
{
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"userId": {
"type": "keyword"
},
"scores": {
"properties": {
"name": {
"type": "keyword"
},
"value": {
"type": "float"
}
}
},
"metadata": {
"properties": {
"weight": {
"type": "float"
}
}
}
}
}
}
Docs:
POST ron_test/_doc/1
{
"id": "1",
"userId": "2",
"scores" : [
{
"name": "score1",
"value": 93.0
},
{
"name": "score2",
"value": 90.0
},
{
"name": "score3",
"value": 76.0
}
],
"metadata": {
"weight": 130
}
}
POST ron_test/_doc/2
{
"id": "2",
"userId": "2",
"scores" : [
{
"name": "score1",
"value": 80.0
},
{
"name": "score2",
"value": 70.0
},
{
"name": "score3",
"value": 88.0
}
],
"metadata": {
"weight": 50
}
}
POST ron_test/_doc/3
{
"id": "2",
"userId": "2",
"scores" : [
{
"name": "score1",
"value": 80.0
},
{
"name": "score2",
"value": 70.0
},
{
"name": "score9",
"value": 88.0
}
],
"metadata": {
"weight": 12
}
}
POST ron_test/_doc/4
{
"id": "2",
"userId": "2",
"scores" : [
{
"name": "score9",
"value": 50.0
}
],
"metadata": {
"weight": 17
}
}
Query
GET ron_test/_search
{
"size": 0,
"aggs": {
"weigthed_avg": {
"scripted_metric": {
"init_script": """
state.name_to_sum = new HashMap();
state.name_to_weight = new HashMap();
""",
"map_script": """
for (score in params._source['scores']){
def name = score['name'];
def value = score['value'];
def weight = doc['metadata.weight'].value;
if (state.name_to_sum.containsKey(name)){
state.name_to_sum[name] += value * weight;
}
else {
state.name_to_sum[name] = value * weight;
}
if (state.name_to_weight.containsKey(name)){
state.name_to_weight[name] += weight;
}
else {
state.name_to_weight[name] = weight;
}
}
""",
"combine_script": "return [state.name_to_sum, state.name_to_weight]",
"reduce_script": """
def total_score_per_name = new HashMap();
def total_weigth_per_name = new HashMap();
for (state in states){
total_score_per_name = Stream.concat(total_score_per_name.entrySet().stream(), state[0].entrySet().stream())
.collect(Collectors.groupingBy(Map.Entry::getKey,
Collectors.summingDouble(Map.Entry::getValue)));
total_weigth_per_name = Stream.concat(total_weigth_per_name.entrySet().stream(), state[1].entrySet().stream())
.collect(Collectors.groupingBy(Map.Entry::getKey,
Collectors.summingDouble(Map.Entry::getValue)));
}
def results = new HashMap();
total_score_per_name.forEach((name, score) -> results[name] = score / total_weigth_per_name[name]);
return results;
"""
}
}
}
}
Results
{
"took" : 258,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 2,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"weigthed_avg" : {
"value" : {
"score9" : 65.72413793103448,
"score2" : 83.54166666666667,
"score3" : 79.33333333333333,
"score1" : 88.80208333333333
}
}
}
}
More info on scripted metrics
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-scripted-metric-aggregation.html
Btw, the way I would choose to simplify this is to insert the metadata.weight value inside every nested score

Nested Query in Elastic Search

I have a schema in elastic search of this form:
{
"index1" : {
"mappings" : {
"properties" : {
"key1" : {
"type" : "keyword"
},
"key2" : {
"type" : "keyword"
},
"key3" : {
"properties" : {
"components" : {
"type" : "nested",
"properties" : {
"sub1" : {
"type" : "keyword"
},
"sub2" : {
"type" : "keyword"
},
"sub3" : {
"type" : "keyword"
}
}
}
}
}
}
}
}
}
and then the data stored in elastic search would be of the format:
{
"_index" : "index1",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"key1" : "val1",
"key2" : "val2",
"key3" : {
components : [
{
"sub1" : "subval11",
"sub3" : "subval13"
},
{
"sub1" : "subval21",
"sub2" : "subval22",
"sub3" : "subval23"
},
{
"sub1" : "subval31",
"sub2" : "subval32",
"sub3" : "subval33"
}
]
}
}
}
As you can see that the sub1, sub2 and sub3 might not be present in few of the objects under key3.
Now if I try to write a query to fetch the result based on key3.sub2 as subval22 using this query
GET index1/_search
{
"query": {
"nested": {
"path": "components",
"query": {
"bool": {
"must": [
{
"match": {"key3.sub2": "subval22"}
}
]
}
}
}
}
}
I always get the error as
{
"error": {
"root_cause": [
{
"type": "query_shard_exception",
"reason": "failed to create query: {...}",
"index_uuid": "1",
"index": "index1"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "index1",
"node": "1aK..",
"reason": {
"type": "query_shard_exception",
"reason": "failed to create query: {...}",
"index_uuid": "1",
"index": "index1",
"caused_by": {
"type": "illegal_state_exception",
"reason": "[nested] failed to find nested object under path [components]"
}
}
}
]
},
"status": 400
}
I understand that since sub2 is not present in all the objects under components, this error is being thrown. I am looking for a way to search such scenarios such that it matches and finds all the objects in the array. If a value is matched, then this doc should get returned.
Can someone help me to get this working.
You made mistake while defining your schema, below schema works fine, Note I just defined key3 as nested. and changed the nested path to key3
Index def
{
"mappings": {
"properties": {
"key1": {
"type": "keyword"
},
"key2": {
"type": "keyword"
},
"key3": {
"type": "nested"
}
}
}
}
Index you sample doc without any change
{
"key1": "val1",
"key2": "val2",
"key3": {
"components": [ --> this was a diff
{
"sub1": "subval11",
"sub3": "subval13"
},
{
"sub1": "subval21",
"sub2": "subval22",
"sub3": "subval23"
},
{
"sub1": "subval31",
"sub2": "subval32",
"sub3": "subval33"
}
]
}
}
Searching with your criteria
{
"query": {
"nested": {
"path": "key3", --> note this
"query": {
"bool": {
"must": [
{
"match": {
"key3.components.sub2": "subval22" --> note this
}
}
]
}
}
}
}
}
This brings the proper search result
"hits": [
{
"_index": "so_nested_61200509",
"_type": "_doc",
"_id": "2",
"_score": 0.2876821,
"_source": {
"key1": "val1",
"key2": "val2",
"key3": {
"components": [ --> note this
{
"sub1": "subval11",
"sub3": "subval13"
},
{
"sub1": "subval21",
"sub2": "subval22",
"sub3": "subval23"
},
{
"sub1": "subval31",
"sub2": "subval32",
"sub3": "subval33"
}
]
Edit:- Based on the comment from OP, updated sample doc, search query and result.

Elastic Search Won't Match For Arrays

I'm trying to search a document with the following structure:
{
"_index": "XXX",
"_type": "business",
"_id": "1252809",
"_score": 1,
"_source": {
"url": "http://Samuraijapanese.com",
"raw_name": "Samurai Restaurant",
"categories": [
{
"name": "Cafe"
},
{
"name": "Cajun Restaurant"
},
{
"name": "Candy Stores"
}
],
"location": {
"lat": "32.9948649",
"lon": "-117.2528171"
},
"address": "979 Lomas Santa Fe Dr",
"zip": "92075",
"phone": "8584810032",
"short_name": "samurai-restaurant",
"name": "Samurai Restaurant",
"apt": "",
"state": "CA",
"stdhours": "",
"city": "Solana Beach",
"hours": "",
"yelp": "",
"twitter": "",
"closed": 0
}
}
Searching it for url, raw_name, address, etc, all work, but searching the categories returns nothing. I'm trying to search like so: If I switch anything else in for categories.name it works:
"query": {
"filtered" : {
"filter" : {
"geo_distance" : {
"location" : {
"lon" : "-117.15726",
"lat" : "32.71533"
},
"distance" : "5mi"
}
},
"query" : {
"multi_match" : {
"query" : "Cafe",
"fields" : [
"categories.name"
]
}
}
}
},
"sort": [
{
"_score" : {
"order" : "desc"
}
},
{
"_geo_distance": {
"location": {
"lat": 32.71533,
"lon": -117.15726
},
"order": "asc",
"sort_mode": "min"
}
}
],
"script_fields": {
"distance_from_origin": {
"script": "doc['location'].arcDistanceInKm(32.71533,-117.15726)"
}
},
"fields": ["_source"],
"from": 0,
"size": 10
}
If I switch out, for example, categories.name with address, and change the search term to Lomas, it returns the result
Without seeing your type mapping I can't answer definitively, but I would guess you have mapped categories as nested. When querying sub-documents of type nested (opposed to object) you have to use a nested query.

Elasticsearch - MySQL Index Search Distance Search

I am trying to use Elasticsearch indexed on a MySQL table to find all addresses that are within x km from a particular data point. I have indexed the table with the following:
{
"type": "jdbc",
"jdbc": {
"strategy": "simple",
"url": "jdbc:mysql://hostname/databasename",
"user": "username",
"password": "password",
"sql": "SELECT name,address1,city,state,zip,lat as `location.lat`,lng as `location.lon` FROM addresses",
"poll": "24h",
"max_retries": 3,
"max_retries_wait": "10s",
"index" : "teststores",
"type" : "providers"
},
"index": {
"index": "addressindex",
"autocommit": "true",
"type": "mysql",
"bulk_size": 100,
"type_mapping": {
"location_mapping" : {
"properties" : {
"pin" : {
"type" : "geo_point"
}
}
}
}
}
}
An example of the indexed data is the following:
"_index": "teststores",
"_type": "providers",
"_id": "Rue2Yxo7SSa_mi5-AzRycA",
"_score": 1,
"_source": {
"zip": "10003",
"name": "I Salon",
"state": "NY",
"address1": "150 East 14th Street",
"location":{
"lat": 40.7337,
"lon": -73.9881
},
"city": "New York"
}
I want to adjust the following query to use lat and lng for calculating the distance.
{
"query": {
"filtered": {
"query": {
"match_all": {
}
},
"filter": {
"geo_distance" : {
"distance" : "2km",
"pin.location" : {
"lat" : 40.686511,
"lon" : -73.986574
}
}
}
}
},
}
How can I adjust this to make the distance work and get all addresses within x kilometers?

Resources