ElasticSearch reindex nested field as new documents - elasticsearch

I am currently changing my ElasticSearch schema.
I previously had one type Product in my index with a nested field Product.users.
And I now wants to get 2 different indices, one for Product, an other one for User and make links between both in code.
I use reindex API to reindex all my Product documents to the new index, removing the Product.users field using script:
ctx._source.remove('users');
But I don't know how to reindex all my Product.users documents to the new User index as in script I'll get an ArrayList of users and I want to create one User document for each.
Does anyone knows how to achieve that?

For those who may face this situation, I finally ended up reindexing users nested field using both scroll and bulk APIs.
I used scroll API to get batches of Product documents
For each batch iterate over those Product documents
For each document iterate over Product.users
Create a new User document and add it to a bulk
Send the bulk when I end iterating over Product batch
Doing the job <3

What you need is called ETL (Extract, Transform, Load).
Most the time, this is more handy to write a small python script that does exactly what you want, but, with elasticsearch, there is one I love: Apache Spark + elasticsearch4hadoop plugin.
Also, sometime logstash can do the trick, but with Spark you have:
SQL syntax or support Java/Scala/Python code
read/write elasticsearch very fast because distributed worker (1 ES shard = 1 Spark worker)
fault tolerant (a worker crash ? no problem)
clustering (ideal if you have billion of documents)
Use with Apache Zeppelin (a notebook with Spark packaged & ready), you will love it!

The simplest solution I can think of is to run the reindex command twice. Once selecting the Product fields and re indexing into the newProduct index and once for the user:
POST _reindex
{
"source": {
"index": "Product",
"type": "_doc",
"_source": ["fields", "to keep in", "new Products"]
"query": {
"match_all": {}
}
},
"dest": {
"index": "new_Products"
}
}
Then you should be able to do the re-index again on the new_User table by selecting Product.users only in the 2nd re-index

Related

Is it possible to partition an ElasticSearch index?

I have a large amount of source code which changes frequently on disk. The source code is organized (and probably best managed) in chunks of "projects". I would like to maintain a current index of the source code so that they can be searched. Historical versions of the documents are not required.
To avoid infinitely growing indexes from the delete/add process, I would like to manage the index in chunks (partitions?). The ingestion process would drop the chunk corresponding to a project before re-indexing the project. A brief absence of the data during re-indexing is tolerable.
When execute I query, I need to hit all of the chunks. Management of the indexes is my primary concern -- performance less so.
I can imagine that there could be two ways this might work:
partition an index. Drop a partition, then rebuild it.
a meta-index. Each project would be created as an individual index, but some sort of a "meta" construct would allow all of them to be queried in a single operation.
From what I have read, this does not seem to be a candidate for rollover indexes.
There are more than 1000 projects. Specifying a list of projects when the query is executed is not practical.
Is it possible to partition an index so that I can manage (drop and reindex) it in named chunks, while maintaining the ability to query it as a single unified index?
Yes, you can achieve this using aliases.
Let's say you have the "old" version of the project data in index "project-1" and that index also has an alias "project".
Then you index the "new" version of the project data in index "project-2". All the queries are done on the alias "project" instead of querying the index directly.
So when you're done reindexing the new version of the data, you can simply switch the alias from "project-1" to "project-2". No interruption of service for your queries.
That's it!
POST _aliases
{
"actions": [
{
"add": {
"index": "project-1",
"alias": "project"
}
},
{
"remove": {
"index": "project-2",
"alias": "project"
}
}
]
}

How to re-index Elasticsearch without stale reads?

I have indices with heavy read/write operations.
My indices have a read and a write alias.
When I need to update the mapping in my indices I go about this process:
create a new index with the new mapping,
add write-alias to the new index.
delete the write-alias to the old index.
reindex the data like this
POST _reindex?wait_for_completion=false
{
"conflicts": "proceed",
"source": {
"index": "old-index"
},
"dest": {
"op_type": "create",
"index": "new-index"
}
}
While reindexing read-alias points to old index while the write-alias points to the new index
When the re-indexing is complete, I create a read-alias on the new index and delete the read-alias on the old index.
This process works fine, but there is one caveat. While re-indexing the data is stale to the applications reading, i.e. updates can not be read until I have switched read to the new index.
Since I have quite large indices, the re-indexing takes many hours.
Is there any way to handle re-indexing without reading stale data?
I would of course like to write to two indices at the same time while re-indexing, but as I understand it's not possible.
The only workaround I can think of is to edit on the client-side, so all writes go to both indexes in two separate requests during re-indexing.
Any ideas or comments are much appreciated 🙏

Something "Materialized view"-like in ElasticSearch

I have a query which runs every time a website is loaded. This Query aggregates over three different term-fields and around 3 million documents and therefore needs 6-7 seconds to complete. The data does not change that frequently and the currentness of the result is not critical.
I know that I can use an alias to create something "View" like in the RDMS world. Is it also possible to populate it, so the query result gets cached? Is there any other way caching might help in this scenario or do I have to create an additional index for the aggregated data and update it from time to time?
I know that the post is old, but about view, elastic add the Data frames in the 7.3.0.
You could also use the _reindex api
POST /_reindex
{
"source": {
"index": "live_index"
},
"dest": {
"index": "caching_index"
}
}
But it will not change your ingestion problem.
About this, I think the solution is sharding for your index.
with 2 or more shards, and several nodes, elastic will be able to paralyze.
But an easier thing to test is to disable the refresh_interval when indexing and to re-enable it after. It generally improve a lot the ingestion time.
You can see a full article on this use case on
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html
You create materialised view.Its a table eventually which has data of aggregated functions. As you have already inserted the aggregated data ,now when you query it, it will be faster. I feel there is no need to cache as well.Even i have created the MVs , it improves the performance tremendously. Having said that you can even go for elastic search as well where you can cache the aggregated queries if your data is not changing frequently.I feel MV and elastic search gives the same performance.

How to reindex AWS Elasticsearch?

My Ruby/Sinatra app connects to an AWS ES cluster using the elasticsearch-ruby gem to index text documents that authorised (by indexing using their user ID) users can search through. Now, I want to copy a document from one index to another to make a document query-able by a different, authorised user. I tried the _reindex endpoint as documented on this file only to get the following error:
Elasticsearch::Transport::Transport::Errors::Unauthorized - [401] {"Message":"Your request: '/_reindex' is not allowed."}:
Googling around, I stumbled across an Amazon docs page that lists all supported operations on both their API's, and for some twisted reason _reindex isn't there yet. Why is that? More importantly,
how do I get around this efficiently and achieve what I want to do?
You should double check the Elasticsearch version deployed by AWS ES. The _reindex API became available in version 2.2 I believe. You can check the version number by GETting the ES root ip & port with curl e.g. and checking version.number.
To work around not having the _reindex endpoint, I would recommend you implement it yourself. This isn't too bad. You can use a scroll to iterate through all the documents you want to reindex. If it is the entire index, you can use a matchall query with the scroll. You can then manipulate the documents as you wish or simply use the bulk api to post (i.e. reindex) the documents to the new index.
Make sure to have created the new index with the mapping template you want ahead of time.
This procedure above is best for reindexing lots of documents; if you just want to move a few or one (which it sounds like you do). Grab the document from its existing index by id and submit it to your second index.
AWS Elasticsearch now supports remote reindex, check this documentation:
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/remote-reindex.html
Example below:
'''
POST <local-domain-endpoint>/_reindex
{
"source": {
"remote": {
"host": "https://remote-domain-endpoint:443"
},
"index": "remote_index"
},
"dest": {
"index": "local_index"
}
}
'''

Elasticsearch Reindex

Theres an index that I want to apply updated mappings to, I have done my best to follow the documentation on ES and Stackoverflow but I am now stuck.
The original index: logstash-index-YYYY.MM with data in it
I created index: logstash-index-new-YYYY.MM (which has a template for the new mapping)
Using the following query:
/logstash-index-YYYY.MM/_search?search_type=scan&scroll=1m
{
"query": {
"match_all": {}
},
"size": 30000
}
I get a _scroll_id and I have less than 30k docs so I should only need to run once.
How do I use that id to push the data into the new index?
You are not using scrollid to push the data into the new index. You use it to get another portion of data from the scroll query.
When you run scan query, first pass doesn't return any results, it scans through shards in your cluster and returns scrollid. Another pass (using scrollid from first one) will return actual results.
If you want to put that data into new index you should write some kind of simple program in language of your choice that will get this data and then put it into your new index.
There is a very good article on elasticsearch blog how to change mappings of your indices on the fly. Unfortunately, reindexing itself is not covered there.

Resources