How to copy some ElasticSearch data to a new index - elasticsearch

Let's say I have movie data in my ElasticSearch and I created them like this:
curl -XPUT "http://192.168.0.2:9200/movies/movie/1" -d'
{
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972
}'
And I have a bunch of movies from different years. I want to copy all the movies from a particular year (so, 1972) and copy them to a new index of "70sMovies", but I couldn't see how to do that.

Since ElasticSearch 2.3 you can now use the built in _reindex API
for example:
POST /_reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter"
}
}
Or only a specific part by adding a filter/query
POST /_reindex
{
"source": {
"index": "twitter",
"query": {
"term": {
"user": "kimchy"
}
}
},
"dest": {
"index": "new_twitter"
}
}
Read more: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

The best approach would be to use elasticsearch-dump tool https://github.com/taskrabbit/elasticsearch-dump.
The real world example I used :
elasticdump \
--input=http://localhost:9700/.kibana \
--output=http://localhost:9700/.kibana_read_only \
--type=mapping
elasticdump \
--input=http://localhost:9700/.kibana \
--output=http://localhost:9700/.kibana_read_only \
--type=data

Check out knapsack:
https://github.com/jprante/elasticsearch-knapsack
Once you have the plugin installed and working, you could export part of your index via query. For example:
curl -XPOST 'localhost:9200/test/test/_export' -d '{
"query" : {
"match" : {
"myfield" : "myvalue"
}
},
"fields" : [ "_parent", "_source" ]
}'
This will create a tarball with only your query results, which you can then import into another index.

To reindex specific type from source index to destination index type syntax is
POST _reindex/
{
"source": {
"index": "source_index",
"type": "source_type",
"query": {
// add filter criteria
}
},
"dest": {
"index": "dest_index",
"type": "dest_type"
}
}

If the intent were to copy some portion of the data or the entire data to an index with the same settings/mappings as that of the original index one could use the clone api to achieve the same. Something like below:
POST /<index>/_clone/<target-index>
OR
PUT /<index>/_clone/<target-index>
However if the intent is to copy the data to a new index with the different settings/mappings than the original index one could use the reindex api to achieve the same. Something like below:
POST _reindex/
{
"source": {
"index": "source_index",
"type": "source_type",
"query": {
// add filter criteria
}
},
"dest": {
"index": "dest_index",
"type": "dest_type"
}
}
*Note: In case of reindex api the target index has to be created prior to actual api call.
For further reading on difference between clone and reindex refer What's the difference between cloning and reindexing an index in Elasticsearch?

You can do it easily with elasticsearch-dump (https://github.com/taskrabbit/elasticsearch-dump) in three steps. In the following example I copy the index "thor" to "thor2"
elasticdump --input=http://localhost:9200/thor --output=http://localhost:9200/thor2 --type=analyzer
elasticdump --input=http://localhost:9200/thor --output=http://localhost:9200/thor2 --type=mapping
elasticdump --input=http://localhost:9200/thor --output=http://localhost:9200/thor2 --type=data

Well the straightforward way to do this is to write code, with the API of your choice, querying for "year": 1972 and then indexing that data into a new index. You would use the Search api or the Scan and Scroll API to get all the documents and then either index them one by one or use the Bulk Api:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-search.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html
Assuming you don't want to do this via code but are looking for a direct way of doing this, I suggest the Elasticsearch Snapshot and Restore. Basically you would take a snapshot of your existing index, restore it into a new index and then use the Delete command to delete all documents with a year other than 1972.
Snapshot And Restore
The snapshot and restore module allows to create snapshots of
individual indices or an entire cluster into a remote repository. At
the time of the initial release only shared file system repository was
supported, but now a range of backends are available via officially
supported repository plugins.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-snapshots.html
Delete By Query API
The delete by query API allows to delete documents from one or more
indices and one or more types based on a query. The query can either
be provided using a simple query string as a parameter, or using the
Query DSL defined within the request body.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html

Since v7.4 the _clone api was introduced and can easily satisfy your need: (read for the relevant prerequisites and monitoring involved)
POST /<index>/_clone/<target-index>
Or:
PUT /<index>/_clone/<target-index>

You can use elasticdump --searchBody:
# Copy documents from movies to 70sMovies (filtering using query)
elasticdump \
--input=http://localhost:9200/movies \
--output=http://localhost:9200/70sMovies \
--type=data \
--searchBody="{\"query\":{\"term\":{\"username\": \"admin\"}}}" # <--- Your query here
more on elasticdump options here.

Related

Elasticsearch reindex API - Not able to reindex large number of documents

I'm using Elasticsearch's reindex API to migrate logs from an old cluster to a new version 7.9.2 cluster. Here is the command I'm using.
curl -X POST "new_host:9200/_reindex?pretty&refresh&wait_for_completion=true" -H 'Content-Type: application/json' -d'
{
"source": {
"remote": {
"host": "old_host:9200"
},
"index": "*",
"size": 10000,
"query": {
"match_all": {}
}
},
"conflicts": "proceed",
"dest": {
"index": "logstash"
}
}'
This gets only the last 10000 documents or 1 batch and request gets completed after that. However, I need to reindex more than a million documents. Is there a way to make the request run for all the matched documents? Can we set the number of batches in the request or make the request issue batches till all documents are indexed?
One option I can think of is to send request recursively by modifying query on datetime. Is there a better way to do it? Can I get all the matched documents (1 million plus) in one request?
Remove the query and size params in order to get all the data. If you need to filter only desired documents using a query, just remove the size to fetch all matched logs.
Using wait_for_completion=false as query param will return the task id and you will be able to monitor the reindex progress using GET /_tasks/<task_id>.
If you need or want to break the reindexing into serveral steps/chunks consider using the slice feature.
BTW: Reindex one index after another instead all at one using * and consider using daily/monthly indicies as it becomes easier to resume the process on errors and manage the log retention in comparison to one whole index.
In order to improve the speed, you should reduce the replicas to 0 and set refresh_interval=-1 in the destination index bevore reindexing and reset the values afterwards.
curl -X POST "new_host:9200/_reindex?pretty&wait_for_completion=false" -H 'Content-Type: application/json' -d'
{
"source": {
"remote": {
"host": "old_host:9200"
},
"index": "index_name"
},
"conflicts": "proceed",
"dest": {
"index": "logstash"
}
}'
UPDATE based on comments:
While reindexing, there is at least one error what causes the reindexing to stop. The error is being caused by at least one document (id=xiB9...) having 'OK' as value in field 'fields.StatusCode'. But the mapping in the destination index has long as data type what is causing the mentioned exception.
The solution is to change the source documents StatusCode to 200 for example, but there will be probably more documents causing the very same error.
Another solution is to change the mapping in the destination index to keyword type - that requires a handmade mapping set before any data has been inserted and maybe reindexing the already present data.

Elasticsearch 1.x add field copy of timestamp

I am working in ES 1.5.2. I have an index with documents, with stored timestamp values. I want to add a regular field to it, which will assume the value of the _timestamp field for that document. How can I do this? I could do
PUT twitter/_mapping/new_timestamp
{
"properties": {
"name": {
"type": "float"
}
}
}
to create a regular field, but how can I copy over all the _timestamp values to it?
In ES 1.5.2, you can use the update by query plugin in order to reindex your documents and copy the _timestamp field to a regular field.
After installing the plugin with the following command:
bin/plugin -url http://oss.sonatype.org/content/repositories/releases/com/yakaz/elasticsearch/plugins/elasticsearch-action-updatebyquery/1.0.0/elasticsearch-action-updatebyquery-1.0.0.zip install elasticsearch-action-updatebyquery
And making sure that dynamic scripting is enabled in your elasticsearch.yml configuration file, you'll be able to run the following command
POST /twitter/_update_by_query
{
"script": {
"inline": "ctx._source.new_timestamp = ctx._timestamp”
},
"query": {
"match_all": {}
}
}

Best way to reindex multiple indices in ElasticSearch

I am using Elasticsearch 5.1.1 and have 500 + indices created with default mapping provided by ES.
Now we have decided to use dynamic templates.
In order to apply this template/mapping to old indices I need to reindex all indices.
What is the best way to do it? Can we use Kibana for this ? Couldn't find sufficient documentation to do so.
Example: Reindex from a daily index to a monthly index (August)
POST _reindex?slices=10&refresh
{
"source": {
"index": "myindex-2019.08.*"
},
"dest": {
"index": "myindex-2019.08"
}
}
Monitor reindex task (wait until is finished)
GET _tasks?detailed=true&actions=*reindex
Check if new index was created
GET _cat/indices/myindex-2019.08*?v&s=index
You can delete old indices
DELETE myindex-2019.08.*
Source:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
You can use the _reindex API which can also reindex multiple indices. It was specifically built for this.
Bash script to re-index all indices matching a pattern: https://gist.github.com/hartfordfive/e507bc47e17f4e03a89055918900e44d
If you want to filter some field and reindex it from index you can use this.
POST _reindex
{
"source": {
"index": "auditbeat",
"query": {
"match": {
"agent.version": "7.6.0"
}
}
},
"dest": {
"index":"auditbeat-7.6.0"
}
}

Create new Elasticsearch index from query?

SQL has the "INSERT INTO ... SELECT" statement to fill a table with data from a query. Does anything like this exist for Elasticsearch?
This would prevent me from mass deleting data from an existing index using a query - which is something the official Elasticsearch 2.1 guide warns against:
Don’t use delete-by-query to clean out all or most documents in an index. Rather create a new index and perhaps reindex the documents you want to keep.
(Source: https://www.elastic.co/guide/en/elasticsearch/plugins/current/plugins-delete-by-query.html).
You can use the excellent utility from taskrabbit called elasticdump.
There are many options to customize the import process. In your case, I would use the searchBody option and go with something like this:
elasticdump \
--input=http://HOST:9200/source_index \
--output=http://HOST:9200/target_index \
--bulk=true \
--searchBody='{"query": { "match_all": {} } }'
You can customize the query and only the matched documents from the source_index will be copied over to the target_index
Take a look at the create index API:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html
PUT /test
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"properties": {
"field1": { "type": "text" }
}
}
}

ElasticSearch - Reindexing your data with zero downtime

https://www.elastic.co/blog/changing-mapping-with-zero-downtime/
I try to create a new index and reindexing my data with zero downtime with this guide.
Now I have an index called "photoshooter" and I follow the steps
1) Create new index "photoshooter_v1" with the new mapping... (Done)
2) Create alias...
curl -XPOST localhost:9200/_aliases -d '
{
"actions": [
{ "add": {
"alias": "photoshooter",
"index": "photoshooter_v1"
}}
]
}
and I get this error...
{
"error": "InvalidAliasNameException[[photoshooter_v1] Invalid alias name [photoshooter], an index exists with the same name as the alias]",
"status": 400
}
I think I lose something with the logic..
Lets say your current index is named as "photoshooter " if i am guessing it right ok.
Now Create a Alias for this index first - OK
{
"actions": [
{ "add": {
"alias": "photoshooter_docs",
"index": "photoshooter"
}}
]
}
test it - curl -XGET 'localhost:9200/photoshooter_docs/_search'
Note - now you will use 'photoshooter_docs' as index name to interact with your index which is actually 'photoshooter' Ok.
Now we create a new index with your new mapping let's say we name it 'photoshooter_v2' now copy your 'photoshooter' index data to new index(photoshooter_v2)
Once you have copied all your data now simply
Remove the alias from previous index to new index -
curl -XPOST localhost:9200/_aliases -d '
{
"actions": [
{ "remove": {
"alias": "photoshooter_docs",
"index": "photoshooter"
}},
{ "add": {
"alias": "photoshooter_docs",
"index": "photoshooter_v2"
}}
]
}
test it again -> curl -XGET 'localhost:9200/photoshooter_docs/_search'
Congrats you have changed your mapping without zero downtime .
And to copy data you can use tools like this
https://github.com/mallocator/Elasticsearch-Exporter
Note - this tools also copies the mapping from old index to new index which you might don't want to do. So that you have read in its documentation or edit it according to your use .
Thanks
Hope this helps
It's it very simple, you cannot create an alias with a name of an index that already exists.
You'll need to consider a new name for the new index, re-index the data in the new one and then remove the old one to be able to give it the same name.
If you want to do that on daily basis, you might consider adding per say the date to your index's name and switch upon it every day.

Resources