Elastic Search how do you remove object property using painless - elasticsearch

Wondering if anybody knows why this update by query runs fine but nothing gets deleted even though task runs to completion and claims all records are updated? I can use the same syntax to delete all person objects without any issues. Why does ES not delete object properties?
POST /16cf303e902f4445a560a8e9a5b9ea51/_update_by_query?conflicts=proceed&wait_for_completion=false
{
"conflicts": "proceed",
"query": {
"exists": {
"field": "person.hair_color"
}
},
"script": {
"lang": "painless",
"source": "ctx._source.remove('person.hair_color');"
}
}
Now if I watch the task using the ID returned from the above call
GET /_tasks/qLeuj8jqQgOPFGsEzL7u9Q:1776664
I get this (shorted version) claiming all documented were updated. However all person's still have their hair color for some reason.
{
"completed" : true,
"task" : {
"status" : {
"updated" : 110345,
}
}
Thanks for any guidance!

Figured this out finally! Apparently the _source contains all objects in your document. Therefore to delete an object property you need to run:
POST /16cf303e902f4445a560a8e9a5b9ea51/_update_by_query?conflicts=proceed&wait_for_completion=false
{
"conflicts": "proceed",
"query": {
"exists": {
"field": "person.hair_color"
}
},
"script": {
"lang": "painless",
"source": "ctx._source.person.remove('hair_color');"
}
}
This works if anybody has the same issue!

Related

Elasticsearch inline string replace seems to do nothing

We have some legacy fields in Elastic search index, which cause us some troubles and we would like to perform a string replace over the whole index.
For instance some old timestamps are stored in format of 2000-01-01T00:00:00.000+0100 but should be stored as 2000-01-01T00:00:00.000+01:00.
I tried to run following query:
POST /my_index/_update_by_query
{
"script":
{
"lang": "painless",
"inline": "ctx._source.timestamp = ctx._source.timestamp.replace('+0100', '+01:00')"
}
}
I run the query within Kibana, but I always get a query timeout - I guess that is not necessarily bad considering the database is huge, however I never see the fields updated.
Is there a way to see the status of such query?
I also tried to create a search query for the update, but with no luck:
GET /my_index/_search
{
"query": {
"query_string": {
"query": "*0100",
"allow_leading_wildcard": true,
"analyze_wildcard": true,
"fields": ["timestamp"]
}
}
}
Which unfortunately always returns empty set - not sure what might be wrong.
What would be a correct way to achieve such update?
I would solve this using an ingest pipeline that you'll use to update your whole index.
First, create the ingest pipeline like below. What it does is detect documents which have a timestamp field ending with +0100 and then updates the timestamp to use the timezone with the correct format.
PUT _ingest/pipeline/fix-tz
{
"processors": [
{
"dissect": {
"if": "ctx.timestamp.endsWith('+0100')",
"field": "timestamp",
"pattern": "%{timestamp}+%{tz}"
}
},
{
"set": {
"if": "ctx.tz != null",
"field": "timestamp",
"value": "{{timestamp}}+01:00"
}
},
{
"remove": {
"if": "ctx.tz!= null",
"field": "tz"
}
}
]
}
Then, when the pipeline is created, you just have to update your index with it, like this:
POST my_index/_update_by_query?pipeline=fix-tz&wait_for_completion=false
Once this has run completely, your index should be properly updated.

Elasticsearch upsert based on query

Two years ago someone asked how to do upserts when you don't know a document's id. The (unaccepted) answer referenced the feature request
that resulted in the _update_by_query API.
However _update_by_query does not allow insertion if no hits exist, so it is not really an upsert, but just another way to do update.
Is there a way to do an upsert without an _id yet? I know that my query will always return one or zero results. Or am I forced to do multiple requests (and maintain the uniqueness myself)?
This doesn't seem to be possible right now. _update provides an upsert attribute, but this doesn't work with _update_by_query unfortunately. The following just gives you an error around Unknown key for a START_OBJECT in [upsert].
POST website/doc/_update_by_query?conflicts=proceed
{
"query": {
"term": {
"url": "http://foo.com"
}
},
"script": {
"inline": "ctx._source.views+=1",
"lang": "painless"
},
"upsert": {
"views": 1,
"url": "http://foo.com"
}
}
Without knowing in_stock values in all the document now you can reduce its count by 1:
POST products/_update_by_query
{
"script": {
"source": "ctx._source.in_stock--"
},
"query": {
"match_all": {}
}
}

Elasticsearch aggregate by field prefix

I have data entries of the form
{
"id": "ABCxxx",
// Other fields
}
Where ABC is a unique identifier that defines the "type" of this record. (For example a user would be USR1234..., an image would be IMG1234...)
I want to get a list of all the different types of records that I have in my ES. So in essence I want to do a sort by id but only looking at the first three characters of the id.
This doesn't work obviously, because it sorts by id (so USR123 is different than USR456):
{
"fields": ["id"],
"aggs": {
"group_by_id": {
"terms": {
"field": "id"
}
}
}
}
How do I write this query?
You can use the painless scripting language to get this accomplished.
{
"fields": ["id"],
"aggs": {
"group_by_id": {
"terms": {
"script" : {
"inline": "doc['id'].substring(0,3)",
"lang": "painless"
}
}
}
}
}
More info here. Please note that the syntax for the substring method may not be exactly right.
As suggested by paqash already that the same can be achieved via script but I would suggest an alternate of storing "type" as a different field altogether in your schema.
For eg.
USR1234 : {id:"USR1234", type:"USR"}
IMG1234 : {id:"USR1234", type:"IMG"}
This would avoid unnecessary complications in scripting and keep your query interface clean.

Sum-aggregation script for term frequencies without dynamic scripting

I try to evaluate a web-application for my masterthesis. For this I want to make a user study, where I prepare the data in elasitc found, and send my web application to the testers. As far as I know, elastic found does not allow dynamic scripting for security reasons. I try to refomulate the following dynamic script query:
GET my_index/document/_search
{
"query": {
"match_all":{}
},
"aggs": {
"stadt": {
"sum": {
"script": "_index['textBody']['frankfurt'].tf()"
}
}
}
}
This query sums up all term frequencies in the document field textBody for the term frankfurt.
In order to reformulate the query without dynamic scripting, I've taken a look on groovy scripts without dynamic scripting, but I still get parsing errors.
My approach to this was:
GET my_index/document/_search
{
"query": {
"match_all":{}
},
"aggs": {
"stadt": {
"sum": {
"script": {
"script_id": "termFrequency",
"lang" : "groovy",
"params": {
"term" : "frankfurt"
}
}
}
}
}
}
and the file termFrequency.groovy in the scripts directory:
_index['textBody'][term].tf()
I get the following parsing error:
Parse Failure [Unexpected token START_OBJECT in [stadt].]
This is the correct syntax assuming your file is inside config/scripts directory.
{
"query": {
"match_all": {}
},
"aggs": {
"stadt": {
"sum": {
"script_file": "termFrequency",
"lang": "groovy",
"params": {
"term": "frankfurt"
}
}
}
},
"size": 0
}
Also the term should be variable rather than string so it should be
_index['textBody'][term].tf()
Hope this helps!

Aggregation over "LastUpdated" property or _timestamp

My Elasticsearch mapping looks like roughly like this:
{
"myIndex": {
"mappings": {
"myType": {
"_timestamp": {
"enabled": true,
"store": true
},
"properties": {
"LastUpdated": {
"type": "date",
"format": "dateOptionalTime"
}
/* lots of other properties */
}
}
}
}
}
So, _timestamp is enabled, and there's also a LastUpated property on every document. LastUpdated can have a different value than _timestamp: sometimes, documents get updated physically (e.g. updates to denormalized data) which updates _timestamp, but LastUpdated remains unchanged because the document hasn't actually been "updated" from a business perspective.
Also, there are many of documents without a LastUpdated value (mostly old data).
What I'd like to do is run an aggregation which counts the number of documents per calendar day (kindly ignore the fact that the dates need to be midnight-aligned, please). For every document, use LastUpdated if it's there, otherwise use _timestamp.
Here's what I've tried:
{
"aggregations": {
"counts": {
"terms": {
"script": "doc.LastUpdated == empty ? doc._timestamp : doc.LastUpdated"
}
}
}
}
The bucketization appears to work to some extent, but the keys in the result looks weird:
buckets: [
{
key: org.elasticsearch.index.fielddata.ScriptDocValues$Longs#7ba1f463doc_count: 300544
}{
key: org.elasticsearch.index.fielddata.ScriptDocValues$Longs#5a298acbdoc_count: 257222
}{
key: org.elasticsearch.index.fielddata.ScriptDocValues$Longs#6e451b5edoc_count: 101117
},
...
]
What's the proper way to run this aggregation and get meaningful keys (i.e. timestamps) in the result?
I've tested and made a groovy script for you,
POST index/type/_search
{
"aggs": {
"counts": {
"terms": {
"script": "ts=doc['_timestamp'].getValue();v=doc['LastUpdated'].getValue();rv=v?:ts;rv",
"lang": "groovy"
}
}
}
}
This returns the required result.
Hope this helps!! Thanks!!

Resources