Retrieve data after deleting mapping in Elastic Search - elasticsearch

I am fairly new to elastic search. Just this weekend I started trying out stuff in it, and while I think it's a pretty neat way to store documents, I came across the following problem. I was fooling around a bit with the mappings (without actually knowing at the time what they were and what they were for), and I accidentally deleted the mapping of my index, along with all the data stored by performing a
DELETE tst_environment/object/_mapping
{
"properties" : {
"title" : { "type": "string" }
}
}
Is there any way to retrieve the lost data or am I, well .. fucked? Any information regarding the issue is more than welcome :)

Unless you have taken a snapshot of the index it is not possbile to retrieve the data once you deleted the mapping.
You would have to reindex the data from initial source
FWIW the upcoming V2.0 of elasticsearch does not allow one to delete mappings .

Related

elasticsearch querying only the document I want and saving it as a snapshot

I want to find a way to back up (snapshot) and restore only the document(data) that I want out of Elastic Search.
I looked up the reference page of Elastic Search, but there was only a way to backup the entire index, but I couldn't find a backup by querying the desired document(data).
Is there a way to back up only the desired data using mysql?
The code below backs up the entire index by storing a basic snapshot.
How can I modify something here?
PUT /_snapshot/my_backup
{
"type": "fs",
"settings": {
"location": "my_backup_location"
}
}
You can try creating another index using your query, and then snapshot this new index.
You can use the reindex api to create this new index

ElasticSearch : Concurrent updates to index while _reindex for the same index in progress

We have been using this link as a reference to accommodate any change in the mappings for a field in our index with zero downtime.
Question:
Considering the same example taken in the above link, when we reindex the data from
my_index_v1 to my_index_v2 using _reindex API. Does ElasticSearch guarantee that any concurrent updates happening in my_index_v1 would make it to my_index_v2 for sure?
For example, a document might get updated in my_index_v1 before or after it is reindexed by api to my_index_v2.
Ultimately, we just need to ensure that while we did not want any downtime for doing any mapping changes (hence did _reindex using alias and other cool stuff by ES), we also want to ensure that none of the add/update were missed while this huge reindex was in progress, as we are talking about reindexing >50GB data.
Thanks,
Sandeep
The reindex api will not consider the changes made after the process has started..
One thing you can do is once you are done reindexing process.You can again start process with version_type:external.
This will cause only documents from source index to destination index that have different version and are not present
Here is the example
POST _reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter",
"version_type": "external"
}
}
Setting version_type to external will cause Elasticsearch to preserve the version from the source, create any documents that are missing, and update any documents that have an older version in the destination index than they do in the source index:
One way to solve this is by using two aliases instead of one. One for queries (let’s call it read_alias), and one for indexing (write_alias). We can write our code so that all indexing happens through the write_alias and all queries go through the read_alias. Let's consider three periods of time:
Before rebuild
read_alias: points to current_index
write_alias: points to current_index
All queries return current data.
All modifications go into current_index.
During rebuild
read_alias: points to current_index
write_alias: points to new_index
All queries keep getting data as it existed before the rebuild, since searching code uses read_alias.
All rows, including modified ones, get indexed into the new_index, since both the rebuilding loop and the DB trigger use the write_alias.
After rebuild
read_alias: points to new_index
write_alias: points to new_index
All queries return new data, including the modifications made during rebuild.
All modifications go into new_index.
It should even be possible to get the modified data from queries while rebuilding, if we make the DB trigger code index modified rows into both the indices while the rebuild is going on (i.e., while the aliases point to different indices).
It is often better to rebuild the index from source data using custom code instead of relying on the _reindex API, since that way we can add new fields that may not have been stored in the old index.
This article has some more details.
It looks like it does it based off of snapshots of the source index.
Which would suggest to me that they couldn't reasonably honor changes to the source happening in the middle of the process. You avoid downtime on the search side, but I think you would need to pause updates on the indexing side during this process.
Something you could do is keep track on your index of when the document was last modified. Then once you finish indexing and switch the alias, you query the old index for what changed in the middle. Propagate those changes over to the new index and you get eventual consistency.

Elastic Search head plugin - Delete all records in an index

Can anyone please give me pointers on how to delete all records of an index using the head plugin of the elastic search?
What we usually do is form the query like
http://ElasticSearchServerURL/entities/entityName/uniqueIdentifierOfRecord
and then select DELETE from the GET/PUT/POST/DELETE drop down.
Now I want to delete all records of a particular entityName.
I tried referring to Elastic search delete operation and https://www.elastic.co/guide/en/elasticsearch/guide/current/_deleting_an_index.html but these do not resolve my issue, since this is not how we do it in the Head plug-in.
I also tried to find documentation but all I could find was some curl queries, which I again do not know how to use.
Any pointers will be of great help.
I hope you know about curl or ES Marvel sense to perform REST actions
The below command will delete all the data and mapping from the index
DELETE /<index-name>/
If you like to delete all the data including metadata created by ES
DELETE /_all
If you like to delete only the data but not the mapping and other meta information about the index
DELETE <index-name/_query
{
"query": {
"match_all" : {}
}
}
One way I often used and which is quite fast is by using the delete by query API
In the location field of your /head/ plugin you can input the following
`/entities/entityName/_query?q=*`
and then select the DELETE HTTP method from the drop down and click on the "Request" button. Voilà.
The only downside with this method is that it is deprecated since ES 1.5.3 and will be removed in ES 2.0, but until then it's still there for you to use occasionnally for your development needs.

What is the best way to index Couchbase data on Elastic Search

I work with Couchbase DB and I want to index part of its data on Elastic Search (ES).
The data from Couchbase should be synced, i.e. if the document on CB changes, it should change the document on ES.
I have several questions about what is the best way to do it:
What is the best way to sync the data ? I saw that there is a CB plugin for ES (http://www.couchbase.com/couchbase-server/connectors/elasticsearch), but it that the recommended way ?
I don't want to store all the CB document on ES, but only part of it, e.g. some of the fields I want to store and some not - how can I do it ?
My documents may have different attributes and the difference may be big (e.g. 50 different attributes/fields). Assuming I want to index all these attributes to ES, will it effect the performance because I have a lot of fields indexed ?
10x,
Given the doc link, I am assuming you are using Couchbase and not CouchDB.
You are following the correct link for use of Elastic Search with Couchbase. Per the documentation, configure the Cross Data Center Replication (XDCR) capabilities of Couchbase to push data to ES automatically as mutations occur.
Without a defined mapping file, ES will create a default mapping. You can provide your own mapping file (or alter the one it generates) to control which fields get indexed. Refer to the enabled property in the ES documentation at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-object-type.html.
Yes, indexing all fields will affect performance. You can find some performance management tips for the Couchbase integration at http://docs.couchbase.com/couchbase-elastic-search/#managing-performance. The preferred approach to the integration is perform the search in ES and only get keys back for the matched documents. You then make a multiget call against the Couchbase cluster to retrieve the document details themselves. So while ES will index many fields, you do not store all fields there nor do you retrieve their values from ES. The in-memory multiget against Couchbase is the fastest way to retrieve the matching documents, using the IDs from ES.
Lot of questions..!
Let me answer one by one:
1)The best way and already available solution to use river plugin to dynamically sync the data.And also it ll index the changed document alone..It ll help a lot in performance.
2)yes you can restrict the field to be indexed in river plugin. refer
The documents of plugin is available in couchbase website itself.
Refer: http://docs.couchbase.com/couchbase-elastic-search/
Github river is still in development.,but you can use the code and modify as your need.
https://github.com/mschoch/elasticsearch-river-couchbase
3)If you index all the fields, yes there will be some lag in performance.So better to index the needed fields alone. if you need to store some field just to store, then mention in mapping as not analyzed to specific.It will decrease indexing time and also searching time.
HOpe it helps..!
You might find this additional explanation regarding Don Stacy's answer to question 2 useful:
When replicating from Couchbase, there are 3 ways in which you can interfere with Elasticsearch's default mapping (before you start XDCR) and thus, as desired, not store certain fields by setting "store" = false:
Create manual mappings on your index
Create a dynamic template
Edit couchbase_template.json
Hints:
Note that when we do XDCR from Couchbase to Elasticsearch, Couchbase wraps the original document in a "doc" field. This means that you have to take this modified structure into account when you create your mapping. It would look something like this:
curl -XPUT 'http://localhost:9200/test/couchbaseDocument/_mapping' -d '
{
"couchbaseDocument": {
"_source": {
"enabled": false
},
"properties": {
"doc": {
"properties": {
"your_field_name": {
"store": true,
...
},
...
}
}
}
}
}'
Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
Including/Excluding fields from _source: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html
Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/2.0/dynamic-templates.html
https://forums.couchbase.com/t/about-elasticsearch-plugin/2433
https://forums.couchbase.com/t/custom-maps-for-jsontypes-with-elasticsearch-plugin/395

Is there a smarter way to reindex elasticsearch?

I ask because our search is in a state of flux as we work things out, but each time we make a change to the index (change tokenizer or filter, or number of shards/replicas), we have to blow away the entire index and re-index all our Rails models back into Elasticsearch ... this means we have to factor in downtime to re-index all our records.
Is there a smarter way to do this that I'm not aware of?
I think #karmi makes it right. However let me explain it a bit simpler. I needed to occasionally upgrade production schema with some new properties or analysis settings.
I recently started to use the scenario described below to do live, constant load, zero-downtime index migrations. You can do that remotely.
Here are steps:
Assumptions:
You have index real1 and aliases real_write, real_read pointing to it,
the client writes only to real_write and reads only from real_read ,
_source property of document is available.
1. New index
Create real2 index with new mapping and settings of your choice.
2. Writer alias switch
Using following bulk query switch write alias.
curl -XPOST 'http://esserver:9200/_aliases' -d '
{
"actions" : [
{ "remove" : { "index" : "real1", "alias" : "real_write" } },
{ "add" : { "index" : "real2", "alias" : "real_write" } }
]
}'
This is atomic operation. From this time real2 is populated with new client's data on all nodes. Readers still use old real1 via real_read. This is eventual consistency.
3. Old data migration
Data must be migrated from real1 to real2, however new documents in real2 can't be overwritten with old entries. Migrating script should use bulk API with create operation (not index or update). I use simple Ruby script es-reindex which has nice E.T.A. status:
$ ruby es-reindex.rb http://esserver:9200/real1 http://esserver:9200/real2
UPDATE 2017 You may consider new Reindex API instead of using the script. It has lot of interesting features like conflicts reporting etc.
4. Reader alias switch
Now real2 is up to date and clients are writing to it, however they are still reading from real1. Let's update reader alias:
curl -XPOST 'http://esserver:9200/_aliases' -d '
{
"actions" : [
{ "remove" : { "index" : "real1", "alias" : "real_read" } },
{ "add" : { "index" : "real2", "alias" : "real_read" } }
]
}'
5. Backup and delete old index
Writes and reads go to real2. You can backup and delete real1 index from ES cluster.
Done!
Yes, there are smarter ways how to re-index your data without downtime.
First, never, ever use the "final" index name as your real index name. So, if you'd like to name your index "articles", don't use that name as a physical index, but create an index such as "articles-2012-12-12" or "articles-A", "articles-1", etc.
Second, create an alias "alias" pointing to that index. Your application will then use this alias, so you'll never need to manually change the index name, restart the application, etc.
Third, when you want or need to re-index the data, re-index them into a different index, let's say "articles-B" -- all the tools in Tire's indexing toolchaing support you here.
When you're done, point the alias to the new index. In this way, you not only minimize downtime (there isn't any), you also have a safe snapshot: if you somehow mess up the indexing into the new index, you can just switch back to the old one, until you resolve the issue.
Wrote up a blog post about how I handled reindexing with no downtime recently. Takes some time to figure out all the little things that need to be in place to do so. Hope this helps!
https://summera.github.io/infrastructure/2016/07/04/reindexing-elasticsearch.html
To summarize:
Step 1: Prepare New Index
Create your new index with your new mapping. This can be on the same instance of Elasticsearch or on a brand new instance.
Step 2: Keep Indexes Up To Date
While you're reindexing you want to keep both your new and old indexes up to date. For a write operation, this can be done by sending the write operation to a background worker on both the new and old index.
Deletes are a bit trickier because there is a race condition between deleting and reindexing the record into the new index. So, you'll want to keep track of the records that need to be deleted during your reindex and process these when you are finished. If you aren't performing many deletes, another way would be to eliminate the possibility of a delete during your reindex.
Step 3: Perform Reindexing
You’ll want to use a scrolled search for reading the data and bulk API for inserting. Since after Step 2 you'll be writing new and updated documents to the new index in the background, you want to make sure you do NOT update existing documents in the new index with your bulk API requests.
This means that the operation you want for your bulk API requests is create, not index. From the documentation: “create will fail if a document with the same index and type exists already, whereas index will add or replace a document as necessary”. The main point here is you do not want old data from the scrolled search snapshot to overwrite new data in the new index.
There's a great script on github to help you with this process: es-reindex.
Step 4: Switch Over
Once you’re finished reindexing, it’s time to switch your search over to the new index. You’ll want to turn deletes back on or process the enqueued delete jobs for the new index. You may notice that searching the new index is a bit slow at first. This is because Elasticsearch and the JVM need time to warm up.
Perform any code changes you need so your application starts searching the new index. You can continue writing to the old index incase you run into problems and need to rollback. If you feel this is unnecessary, you can stop writing to it.
Step 5: Clean Up
At this point you should be completely transitioned to the new index. If everything is going well, perform any necessary cleanup such as:
Delete the old index host if it’s different from the new
Remove serialization code related to your old index
Maybe create another index, and reindex all the data onto that one, and then make the switch when it's done re-indexing ?

Resources