Elasticsearch Reindex - elasticsearch

Theres an index that I want to apply updated mappings to, I have done my best to follow the documentation on ES and Stackoverflow but I am now stuck.
The original index: logstash-index-YYYY.MM with data in it
I created index: logstash-index-new-YYYY.MM (which has a template for the new mapping)
Using the following query:
/logstash-index-YYYY.MM/_search?search_type=scan&scroll=1m
{
"query": {
"match_all": {}
},
"size": 30000
}
I get a _scroll_id and I have less than 30k docs so I should only need to run once.
How do I use that id to push the data into the new index?

You are not using scrollid to push the data into the new index. You use it to get another portion of data from the scroll query.
When you run scan query, first pass doesn't return any results, it scans through shards in your cluster and returns scrollid. Another pass (using scrollid from first one) will return actual results.
If you want to put that data into new index you should write some kind of simple program in language of your choice that will get this data and then put it into your new index.
There is a very good article on elasticsearch blog how to change mappings of your indices on the fly. Unfortunately, reindexing itself is not covered there.

Related

elasticsearch querying only the document I want and saving it as a snapshot

I want to find a way to back up (snapshot) and restore only the document(data) that I want out of Elastic Search.
I looked up the reference page of Elastic Search, but there was only a way to backup the entire index, but I couldn't find a backup by querying the desired document(data).
Is there a way to back up only the desired data using mysql?
The code below backs up the entire index by storing a basic snapshot.
How can I modify something here?
PUT /_snapshot/my_backup
{
"type": "fs",
"settings": {
"location": "my_backup_location"
}
}
You can try creating another index using your query, and then snapshot this new index.
You can use the reindex api to create this new index

ElasticSearch reindex nested field as new documents

I am currently changing my ElasticSearch schema.
I previously had one type Product in my index with a nested field Product.users.
And I now wants to get 2 different indices, one for Product, an other one for User and make links between both in code.
I use reindex API to reindex all my Product documents to the new index, removing the Product.users field using script:
ctx._source.remove('users');
But I don't know how to reindex all my Product.users documents to the new User index as in script I'll get an ArrayList of users and I want to create one User document for each.
Does anyone knows how to achieve that?
For those who may face this situation, I finally ended up reindexing users nested field using both scroll and bulk APIs.
I used scroll API to get batches of Product documents
For each batch iterate over those Product documents
For each document iterate over Product.users
Create a new User document and add it to a bulk
Send the bulk when I end iterating over Product batch
Doing the job <3
What you need is called ETL (Extract, Transform, Load).
Most the time, this is more handy to write a small python script that does exactly what you want, but, with elasticsearch, there is one I love: Apache Spark + elasticsearch4hadoop plugin.
Also, sometime logstash can do the trick, but with Spark you have:
SQL syntax or support Java/Scala/Python code
read/write elasticsearch very fast because distributed worker (1 ES shard = 1 Spark worker)
fault tolerant (a worker crash ? no problem)
clustering (ideal if you have billion of documents)
Use with Apache Zeppelin (a notebook with Spark packaged & ready), you will love it!
The simplest solution I can think of is to run the reindex command twice. Once selecting the Product fields and re indexing into the newProduct index and once for the user:
POST _reindex
{
"source": {
"index": "Product",
"type": "_doc",
"_source": ["fields", "to keep in", "new Products"]
"query": {
"match_all": {}
}
},
"dest": {
"index": "new_Products"
}
}
Then you should be able to do the re-index again on the new_User table by selecting Product.users only in the 2nd re-index

Searching through an alias with filter is very slow in Elasticsearch

I have an elasticsearch index, my_index, with millions of documents, with key my_uuid. On top of that index I have several filtered aliases of the following form (showing only my_alias as retrieved by GET my_index/_alias/my_alias):
{
"my_index": {
"aliases": {
"my_alias": {
"filter": {
"terms": {
"my_uuid": [
"0944581b-9bf2-49e1-9bd0-4313d2398cf6",
"b6327e90-86f6-42eb-8fde-772397b8e926",
thousands of rows...
]
}
}
}
}
}
}
My understanding is that the filter will be cached transparently for me, without having to do any configuration. The thing is I am experiencing very slow searches, when going through the alias, which suggests that 1. the filter is not cached, or 2. it is wrongly written.
Indicative numbers:
GET my_index/_search -> 50ms
GET my_alias/_search -> 8000ms
I can provide further information on the cluster scale, and size of data if anyone considers this relevant.
I am using elasticsearch 2.4.1. I am getting the right results, it is just the performance that concerns me.
Matching each document with a 4MB list of uids is definetly not the way to go. Try to imagine how many CPU cycles it requires. 8s is quite fast.
I would duplicate the subset of data in another index.
If you need to immediately reflect changes, you will have to manage the subset index by hand :
when you delete a uuid from the list, you delete the corresponding documents
when you add a uuid, you copy the corresponding documents (reindex api with a query is your friend)
when you insert a document, you have to check if the document should be added in subset index too
when you delete a document, delete it in both indices
Force the document id so they are the same in both indices. Beware of refresh time if you store the uuid list in elasticsearch index.
If updating the subset with new uuid is not time critical, you can just run the reindex every day or every hour.

Elasticsearch remove "one level" from the mapping

I need to destructurate my index mapping.
My index has the following mapping
"A": {
"properties": {
"B": {
"properties": {
-c
-d
-e
}
}
}
}
What I need is to delete "one level" in order to have a mapping like this:
"A": {
"properties": {
-c
-d
-e
}
}
Is it possible to obtain this result without reindexing all my data?
Short answer, No.
Longer answer, also No. This question has been asked so many times. The answer will always be no and this is why :
You can only find that which is stored in your index. In order to make your data searchable, your database needs to know what type of data each field contains and how it should be indexed. If you switch a field type from e.g. a string to a date, all of the data for that field that you already have indexed becomes useless. One way or another, you need to reindex that field.
This applies not just to Elasticsearch, but to any database that uses indices for searching. And if it isn't using indices then it is sacrificing speed for flexibility.
Elasticsearch (and Lucene) stores its indices in immutable segments — each segment is a “mini" inverted index. These segments are never updated in place. Updating a document actually creates a new document and marks the old document as deleted. As you add more documents (or update existing documents), new segments are created. A merge process runs in the background merging several smaller segments into a new big segment, after which the old segments are removed entirely.
Typically, an index in Elasticsearch will contain documents of different types. Each _type has its own schema or mapping. A single segment may contain documents of any type. So, if you want to change the field definition for a single field in a single type, you have little option but to reindex all of the documents in your index.
If you are interested with more info, you can read the rest of the excerpt here by Clinton Gormley.
I also suggest the following readings :
Elasticsearch Zero Downtime Reindexing – Problems and Solutions
The SO question : Is there a smarter way to reindex elasticsearch?
You have to create a new index with the updated (one level deleted) mapping. You cannot updated the same mapping to achieve what you want.

What is the best way to index Couchbase data on Elastic Search

I work with Couchbase DB and I want to index part of its data on Elastic Search (ES).
The data from Couchbase should be synced, i.e. if the document on CB changes, it should change the document on ES.
I have several questions about what is the best way to do it:
What is the best way to sync the data ? I saw that there is a CB plugin for ES (http://www.couchbase.com/couchbase-server/connectors/elasticsearch), but it that the recommended way ?
I don't want to store all the CB document on ES, but only part of it, e.g. some of the fields I want to store and some not - how can I do it ?
My documents may have different attributes and the difference may be big (e.g. 50 different attributes/fields). Assuming I want to index all these attributes to ES, will it effect the performance because I have a lot of fields indexed ?
10x,
Given the doc link, I am assuming you are using Couchbase and not CouchDB.
You are following the correct link for use of Elastic Search with Couchbase. Per the documentation, configure the Cross Data Center Replication (XDCR) capabilities of Couchbase to push data to ES automatically as mutations occur.
Without a defined mapping file, ES will create a default mapping. You can provide your own mapping file (or alter the one it generates) to control which fields get indexed. Refer to the enabled property in the ES documentation at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-object-type.html.
Yes, indexing all fields will affect performance. You can find some performance management tips for the Couchbase integration at http://docs.couchbase.com/couchbase-elastic-search/#managing-performance. The preferred approach to the integration is perform the search in ES and only get keys back for the matched documents. You then make a multiget call against the Couchbase cluster to retrieve the document details themselves. So while ES will index many fields, you do not store all fields there nor do you retrieve their values from ES. The in-memory multiget against Couchbase is the fastest way to retrieve the matching documents, using the IDs from ES.
Lot of questions..!
Let me answer one by one:
1)The best way and already available solution to use river plugin to dynamically sync the data.And also it ll index the changed document alone..It ll help a lot in performance.
2)yes you can restrict the field to be indexed in river plugin. refer
The documents of plugin is available in couchbase website itself.
Refer: http://docs.couchbase.com/couchbase-elastic-search/
Github river is still in development.,but you can use the code and modify as your need.
https://github.com/mschoch/elasticsearch-river-couchbase
3)If you index all the fields, yes there will be some lag in performance.So better to index the needed fields alone. if you need to store some field just to store, then mention in mapping as not analyzed to specific.It will decrease indexing time and also searching time.
HOpe it helps..!
You might find this additional explanation regarding Don Stacy's answer to question 2 useful:
When replicating from Couchbase, there are 3 ways in which you can interfere with Elasticsearch's default mapping (before you start XDCR) and thus, as desired, not store certain fields by setting "store" = false:
Create manual mappings on your index
Create a dynamic template
Edit couchbase_template.json
Hints:
Note that when we do XDCR from Couchbase to Elasticsearch, Couchbase wraps the original document in a "doc" field. This means that you have to take this modified structure into account when you create your mapping. It would look something like this:
curl -XPUT 'http://localhost:9200/test/couchbaseDocument/_mapping' -d '
{
"couchbaseDocument": {
"_source": {
"enabled": false
},
"properties": {
"doc": {
"properties": {
"your_field_name": {
"store": true,
...
},
...
}
}
}
}
}'
Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
Including/Excluding fields from _source: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html
Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/2.0/dynamic-templates.html
https://forums.couchbase.com/t/about-elasticsearch-plugin/2433
https://forums.couchbase.com/t/custom-maps-for-jsontypes-with-elasticsearch-plugin/395

Resources