I'm using the reindex API to adapt data from an old format into a new format like so:
POST /_reindex
{
"source": {
"index": "old_index"
},
"dest": {
"index": "new_index"
},
"script": {
"source": """
ArrayList convertField(def str) {
// [complicated conversion]
return reformatted_data;
}
ctx._source.specific_field = convertField(ctx._source.specific_field);
"""
}
}
For the sake of a load test I would like to duplicate the data into the new index (it doesnt need to be exaclty the same, some scripted alterations would be fine).
The problem is, everytime I run the reindex, all data in the target index is deleted and replaced bu the new batch. How do I keep the current data and add to it, instead of replacing?
The easiest way is to set the _id field of the reindexed documents to null, using the script field. This will generate a new GUID for the reindexed document. In your case:
POST /_reindex
{
"source": {
"index": "old_index"
},
"dest": {
"index": "new_index"
},
"script": {
"source": """
ArrayList convertField(def str) {
// [complicated conversion]
return reformatted_data;
}
ctx._source.specific_field = convertField(ctx._source.specific_field);
ctx._id = null
"""
}
}
The reason is that can only be one document with a given _id.
For my test all I did was to edit each _id.
ctx._id = ctx._id + "_1";
Related
I tried reindexing daily indices from remote cluster and following reindex-daily-indices example
POST _reindex
{
"source": {
"remote": {
"host": "http://remote_es:9200"
},
"index": "telemetry-*"
},
"dest": {
"index": "dummy"
},
"script": {
"lang": "painless",
"source": """
ctx._index = 'telemetry-' + (ctx._index.substring('telemetry-'.length(), ctx._index.length()));
"""
}
}
It looks like if the new ctx._index is exactly the same as the original ctx._index, it will use the dest.index instead. It reindex all the records into "dummy" index
Is this a bug or intended behaviour? I could not find any explanation to this behaviour.
Is there a way to reindex (multiple indices) from remote and still preserve the original name?
It's because according to your logic, the destination index name is the same as the source index name. In the documentation you linked at, they are appending '-1' at the end of the index name.
In your case, the following logic just sets the same destination index name as the source index name, and reindex doesn't allow that, so it's using the destination index name specified in dest.index
ctx._index = 'telemetry-' + (ctx._index.substring('telemetry-'.length(), ctx._index.length()));
Also worth noting that this case has been reported here and here.
I have a query like below and when date_partition field is "type" => "float" it returns queries like 20220109, 20220108, 20220107.
When field "type" => "long", it only returns 20220109 query. Which is what I want.
Each queries below, the result is returned as if the query 20220119 was sent.
--> 20220109, 20220108, 20220107
PUT date
{
"mappings": {
"properties": {
"date_partition_float": {
"type": "float"
},
"date_partition_long": {
"type": "long"
}
}
}
}
POST date/_doc
{
"date_partition_float": "20220109",
"date_partition_long": "20220109"
}
#its return the query
GET date/_search
{
"query": {
"match": {
"date_partition_float": "20220108"
}
}
}
#nothing return
GET date/_search
{
"query": {
"match": {
"date_partition_long": "20220108"
}
}
}
Is this a bug or is this how float type works ?
2 years of data loaded to Elasticsearch (like day-1, day-2) (20 gb pri shard size per day)(total 15 TB) what is the best way to change the type of just this field ?
I have 5 float type in my mapping, what is the fastest way to change all of them.
Note: In my mind I have below solutions but I'm afraid it's slow
update by query API
reindex API
run time search request (especially this one)
Thank you!
That date_partition field should have the date type with format=yyyyMMdd, that's the only sensible type to use, not long and even worse float.
PUT date
{
"mappings": {
"properties": {
"date_partition": {
"type": "date",
"format": "yyyyMMdd"
}
}
}
}
It's not logical to query for 20220108 and have the 20220109 document returned in the results.
Using the date type would also allow you to use proper time-based range queries and create date_histogram aggregations on your data.
You can either recreate the index with the adequate type and reindex your data, or add a new field to your existing index and update it by query. Both options are valid.
It can be answer of my question => https://discuss.elastic.co/t/elasticsearch-data-type-float-returns-incorrect-results/300335
I have a Lambda that receives events from Kinesis and writes the event to ElasticSearch cluster.
doc id
FirstTimestamp
d1
15974343498
Now when we receive another event, I want to update the document in the ElasticSearch to
doc id
FirstTimestamp
SecondTimestamp
TimeTag
d1
15974343498
15974344498
1000
How can I do this without having to first GET the existing doc from ElasticSearch and then doing a PUT?
I found the update option here using which I can add the field SecondTimestamp, but how can I add the TimeTag field; it requires us to do an operation using the FirstTimestamp.
The GET operation won't be necessary.
Depending on how easily you can configure how your writes happen, you could do the following:
Store a script which expects the doc-to-be-updated content as params:
POST _scripts/manage_time_tags
{
"script": {
"lang": "painless",
"source": """
if (ctx._source.FirstTimestamp != null && params.FirstTimestamp != null) {
ctx._source.SecondTimestamp = params.FirstTimestamp;
ctx._source.TimeTag = ctx._source.SecondTimestamp - ctx._source.FirstTimestamp;
}
"""
}
}
Instead of directly writing to ES as you were up until now, use the upsert method of the Update API:
POST myindex/_update/1
{
"upsert": {
"id": 1,
"FirstTimestamp": 15974343498
},
"script": {
"id": "manage_time_tags",
"params": {
"id": 1,
"FirstTimestamp": 15974343498
}
}
}
This will ensure that if the document does not exist yet, the contents of upsert are synced and the script doesn't even run.
As new events come in, simply call /_update/your_id again but with the most recent contents of id and FirstTimestamp.
POST myindex/_update/1
{
"upsert": {
"id": 1,
"FirstTimestamp": 15974344498
},
"script": {
"id": "manage_time_tags",
"params": {
"id": 1,
"FirstTimestamp": 15974344498
}
}
}
Note: this should not be confused with a rather poorly named scripted upsert which'll run the script irregardless of whether the doc already exists or not. This option should be omitted (or set to false).
I have index_A, which includes a number field "foo".
I copy the mapping for index_A, and make a dev tools call PUT /index_B with the field foo changed to text, so the mapping portion of that is:
"foo": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
I then reindex index_A to index_B with:
POST _reindex
{
"source": {
"index": "index_A"
},
"dest": {
"index": "index_B"
}
}
When I go to view any document for index_B, the entry for the "foo" field is still a number. (I was expecting for example: "foo": 30 to become "foo" : "30" in the new document's source).
As much as I've read on Mappings and reindexing, I'm still at a loss on how to accomplish this. What specifically do I need to run in order to get this new index with "foo" as a text field, and all number entries for foo in the original index changed to text entries in the new index?
There's a distinction between how a field is stored vs indexed in ES. What you see inside of _source is stored and it's the "original" document that you've ingested. But there's no explicit casting based on the mapping type -- ES stores what it receives but then proceeds to index it as defined in the mapping.
In order to verify how a field was indexed, you can inspect the script stack returned in:
GET index_b/_search
{
"script_fields": {
"debugging_foo": {
"script": {
"source": "Debug.explain(doc['foo'])"
}
}
}
}
as opposed to how a field was stored:
GET index_b/_search
{
"script_fields": {
"debugging_foo": {
"script": {
"source": "Debug.explain(params._source['foo'])"
}
}
}
}
So in other words, rest assured that foo was indeed indexed as text + keyword.
If you'd like to explicitly cast a field value into a different data type in the _source, you can apply a script along the lines of:
POST _reindex
{
"source": {
"index": "index_a"
},
"dest": {
"index": "index_b"
},
"script": {
"source": "ctx._source.foo = '' + ctx._source.foo"
}
}
I'm not overly familiar with java but I think ... = ctx._source.foo.toString() would work too.
FYI there's a coerce mapping parameter which sounds like it could be of use here but it only works the other way around -- casting/parsing from strings to numerical types etc.
FYI#2 There's a pipeline processor called convert that does exactly what I did in the above script, and more. (A pipeline is a pre-processor that runs before the fields are indexed in ES.) The good thing about pipelines is that they can be run as part of the _reindex process too.
Let's say I've an elasticsearch index with around 10M documents on it. Now I need to add a new filed with a default value e.g is_hotel_type=0 for each and every ES document. Later I'll update as per my requirments.
To do that I've modified myindex with a PUT request like below-
PUT myindex
{
"mappings": {
"rp": {
"properties": {
"is_hotel_type": {
"type": "integer"
}
}
}
}
}
Then run a painless script query with POST to update all the existing documents with the value is_hotel_type=0
POST myindex/_update_by_query
{
"query": {
"match_all": {}
},
"script" : "ctx._source.is_hotel_type = 0;"
}
But this process is very time consuming for a large index with 10M documents. Usually we can set default values on SQL while creating new columns. So my question-
Is there any way in Elasticsearch so I can add a new field with a default value.I've tried below PUT request with null_value but it doesn't work for.
PUT myindex/_mapping/rp
{
"properties": {
"is_hotel_type": {
"type": "integer",
"null_value" : 0
}
}
}
I just want to know is there any other way to do that without the script query?