Updating all data elasticsearch - elasticsearch

Is there any way to update all data in elasticsearch.
In below example, update done for external '1'.
curl -XPOST 'localhost:9200/customer/external/1/_update?pretty' -d '
{
"doc": { "name": "Jane Doe", "age": 20 }
}'
Similarly, I need to update all my data in external. Is there any way or query to updating all data.

Updating all documents in an index means that all documents will be deleted and new ones will be indexed. Which means lots of "marked-as-deleted" documents.
When you run a query ES will automatically filter out those "marked-as-deleted" documents, which will have an impact on the response time of the query. How much impact it depends on the data, use case and query.
Also, if you update all documents, unless you run a _force_merge there will be segments (especially the larger ones) that will still have "marked-as-deleted" documents and those segments are hard to be automatically merged by Lucene/Elasticsearch.
My suggestion, if your indexing process is not too complex (like getting the data from a relational database and process it before indexing into ES, for example), is to drop the index completely and index fresh data. It might be more effective than updating all the documents.

Related

ElasticSearch : Concurrent updates to index while _reindex for the same index in progress

We have been using this link as a reference to accommodate any change in the mappings for a field in our index with zero downtime.
Question:
Considering the same example taken in the above link, when we reindex the data from
my_index_v1 to my_index_v2 using _reindex API. Does ElasticSearch guarantee that any concurrent updates happening in my_index_v1 would make it to my_index_v2 for sure?
For example, a document might get updated in my_index_v1 before or after it is reindexed by api to my_index_v2.
Ultimately, we just need to ensure that while we did not want any downtime for doing any mapping changes (hence did _reindex using alias and other cool stuff by ES), we also want to ensure that none of the add/update were missed while this huge reindex was in progress, as we are talking about reindexing >50GB data.
Thanks,
Sandeep
The reindex api will not consider the changes made after the process has started..
One thing you can do is once you are done reindexing process.You can again start process with version_type:external.
This will cause only documents from source index to destination index that have different version and are not present
Here is the example
POST _reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter",
"version_type": "external"
}
}
Setting version_type to external will cause Elasticsearch to preserve the version from the source, create any documents that are missing, and update any documents that have an older version in the destination index than they do in the source index:
One way to solve this is by using two aliases instead of one. One for queries (let’s call it read_alias), and one for indexing (write_alias). We can write our code so that all indexing happens through the write_alias and all queries go through the read_alias. Let's consider three periods of time:
Before rebuild
read_alias: points to current_index
write_alias: points to current_index
All queries return current data.
All modifications go into current_index.
During rebuild
read_alias: points to current_index
write_alias: points to new_index
All queries keep getting data as it existed before the rebuild, since searching code uses read_alias.
All rows, including modified ones, get indexed into the new_index, since both the rebuilding loop and the DB trigger use the write_alias.
After rebuild
read_alias: points to new_index
write_alias: points to new_index
All queries return new data, including the modifications made during rebuild.
All modifications go into new_index.
It should even be possible to get the modified data from queries while rebuilding, if we make the DB trigger code index modified rows into both the indices while the rebuild is going on (i.e., while the aliases point to different indices).
It is often better to rebuild the index from source data using custom code instead of relying on the _reindex API, since that way we can add new fields that may not have been stored in the old index.
This article has some more details.
It looks like it does it based off of snapshots of the source index.
Which would suggest to me that they couldn't reasonably honor changes to the source happening in the middle of the process. You avoid downtime on the search side, but I think you would need to pause updates on the indexing side during this process.
Something you could do is keep track on your index of when the document was last modified. Then once you finish indexing and switch the alias, you query the old index for what changed in the middle. Propagate those changes over to the new index and you get eventual consistency.

Searching through an alias with filter is very slow in Elasticsearch

I have an elasticsearch index, my_index, with millions of documents, with key my_uuid. On top of that index I have several filtered aliases of the following form (showing only my_alias as retrieved by GET my_index/_alias/my_alias):
{
"my_index": {
"aliases": {
"my_alias": {
"filter": {
"terms": {
"my_uuid": [
"0944581b-9bf2-49e1-9bd0-4313d2398cf6",
"b6327e90-86f6-42eb-8fde-772397b8e926",
thousands of rows...
]
}
}
}
}
}
}
My understanding is that the filter will be cached transparently for me, without having to do any configuration. The thing is I am experiencing very slow searches, when going through the alias, which suggests that 1. the filter is not cached, or 2. it is wrongly written.
Indicative numbers:
GET my_index/_search -> 50ms
GET my_alias/_search -> 8000ms
I can provide further information on the cluster scale, and size of data if anyone considers this relevant.
I am using elasticsearch 2.4.1. I am getting the right results, it is just the performance that concerns me.
Matching each document with a 4MB list of uids is definetly not the way to go. Try to imagine how many CPU cycles it requires. 8s is quite fast.
I would duplicate the subset of data in another index.
If you need to immediately reflect changes, you will have to manage the subset index by hand :
when you delete a uuid from the list, you delete the corresponding documents
when you add a uuid, you copy the corresponding documents (reindex api with a query is your friend)
when you insert a document, you have to check if the document should be added in subset index too
when you delete a document, delete it in both indices
Force the document id so they are the same in both indices. Beware of refresh time if you store the uuid list in elasticsearch index.
If updating the subset with new uuid is not time critical, you can just run the reindex every day or every hour.

Something "Materialized view"-like in ElasticSearch

I have a query which runs every time a website is loaded. This Query aggregates over three different term-fields and around 3 million documents and therefore needs 6-7 seconds to complete. The data does not change that frequently and the currentness of the result is not critical.
I know that I can use an alias to create something "View" like in the RDMS world. Is it also possible to populate it, so the query result gets cached? Is there any other way caching might help in this scenario or do I have to create an additional index for the aggregated data and update it from time to time?
I know that the post is old, but about view, elastic add the Data frames in the 7.3.0.
You could also use the _reindex api
POST /_reindex
{
"source": {
"index": "live_index"
},
"dest": {
"index": "caching_index"
}
}
But it will not change your ingestion problem.
About this, I think the solution is sharding for your index.
with 2 or more shards, and several nodes, elastic will be able to paralyze.
But an easier thing to test is to disable the refresh_interval when indexing and to re-enable it after. It generally improve a lot the ingestion time.
You can see a full article on this use case on
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html
You create materialised view.Its a table eventually which has data of aggregated functions. As you have already inserted the aggregated data ,now when you query it, it will be faster. I feel there is no need to cache as well.Even i have created the MVs , it improves the performance tremendously. Having said that you can even go for elastic search as well where you can cache the aggregated queries if your data is not changing frequently.I feel MV and elastic search gives the same performance.

Elasticsearch remove "one level" from the mapping

I need to destructurate my index mapping.
My index has the following mapping
"A": {
"properties": {
"B": {
"properties": {
-c
-d
-e
}
}
}
}
What I need is to delete "one level" in order to have a mapping like this:
"A": {
"properties": {
-c
-d
-e
}
}
Is it possible to obtain this result without reindexing all my data?
Short answer, No.
Longer answer, also No. This question has been asked so many times. The answer will always be no and this is why :
You can only find that which is stored in your index. In order to make your data searchable, your database needs to know what type of data each field contains and how it should be indexed. If you switch a field type from e.g. a string to a date, all of the data for that field that you already have indexed becomes useless. One way or another, you need to reindex that field.
This applies not just to Elasticsearch, but to any database that uses indices for searching. And if it isn't using indices then it is sacrificing speed for flexibility.
Elasticsearch (and Lucene) stores its indices in immutable segments — each segment is a “mini" inverted index. These segments are never updated in place. Updating a document actually creates a new document and marks the old document as deleted. As you add more documents (or update existing documents), new segments are created. A merge process runs in the background merging several smaller segments into a new big segment, after which the old segments are removed entirely.
Typically, an index in Elasticsearch will contain documents of different types. Each _type has its own schema or mapping. A single segment may contain documents of any type. So, if you want to change the field definition for a single field in a single type, you have little option but to reindex all of the documents in your index.
If you are interested with more info, you can read the rest of the excerpt here by Clinton Gormley.
I also suggest the following readings :
Elasticsearch Zero Downtime Reindexing – Problems and Solutions
The SO question : Is there a smarter way to reindex elasticsearch?
You have to create a new index with the updated (one level deleted) mapping. You cannot updated the same mapping to achieve what you want.

elasticsearch - routing VS. indexing for query performance

I'm planning a strategy for querying millions of docs in date and user directions.
Option 1 - indexing by user. routing by date.
Option 2 - indexing by date. routing by user.
What are the differences or advantages when using routing or indexing?
One of the design patterns that Shay Banon # Elasticsearch recommends is: index by time range, route by user and use aliasing.
Create an index for each day (or a date range) and route documents on user field, so you could 'retire' older logs and you don't need queries to execute on all shards:
$ curl -XPOST localhost:9200/user_logs_20140418 -d '{
"mappings" : {
"user_log" : {
"_routing": {
"required": true,
"path": "user"
},
"properties" : {
"user" : { "type" : "string" },
"log_time": { "type": "date" }
}
}
}
}'
Create an alias to filter and route on users, so you could query for documents of user_foo:
$ curl -XPOST localhost:9200/_aliases -d '{
"actions": [{
"add": {
"alias": "user_foo",
"filter": {"term": {"user": "foo"}},
"routing": "foo"
}
}]
}'
Create aliases for time windows, so you could query for documents this_week:
$ curl -XPOST localhost:9200/_aliases -d '{
"actions": [{
"add": {
"index": ["user_logs_20140418", "user_logs_20140417", "user_logs_20140416", "user_logs_20140415", "user_logs_20140414"],
"alias": "this_week"
},
"remove": {
"index": ["user_logs_20140413", "user_logs_20140412", "user_logs_20140411", "user_logs_20140410", "user_logs_20140409", "user_logs_20140408", "user_logs_20140407"],
"alias": "this_week"
}
}]
}'
Some of the advantages of this approach:
if you search using aliases for users, you hit only shards where the users' data resides
if a user's data grows, you could consider creating a separate index for that user (all you need is to point that user's alias to the new index)
no performance implications over allocation of shards
you could 'retire' older logs by simply closing (when you close indices, they consume practically no resources) or deleting an entire index (deleting an index is simpler than deleting documents within an index)
Indexing is the process of parsing
[Tokenized, filtered] the document that you indexed[Inverted Index]. It's like appendix of an text book.
When the indexed data exceeds one server limit. instead of upgrading server configurations, add another server and share data with them. This process is called as sharding.
If we search it will search in all shards and perform map reduce and return results.If we group similar data together and search some data in specific data means it reduce processing power and increase speed.
Routing is used to store group of data in particular shards.To select a field for routing. The field should be present in all docs,field should not contains different values.
Note:Routing should be used in multiple shards environment[not in single node]. If we use routing in single node .There is no use of it.
Let's define the terms first.
Indexing, in the context of Elasticsearch, can mean many things:
indexing a document: writing a new document to Elasticsearch
indexing a field: defining a field in the mapping (schema) as indexed. All fields that you search on need to be indexed (and all fields are indexed by default)
Elasticsearch index: this is a unit of configuration (e.g. the schema/mapping) and of data (i.e. some files on disk). It's like a database, in the sense that a document is written to an index. When you search, you can reach out to one or more indices
Lucene index: an Elasticsearch index can be divided into N shards. A shard is a Lucene index. When you index a document, that document gets routed to one of the shards. When you search in the index, the search is broadcasted to a copy of each shard. Each shard replies with what it knows, then results are aggregated and sent back to the client
Judging by the context, "indexing by user" and "indexing by date" refers to having one index per user or one index per date interval (e.g. day).
Routing refers to sending documents to shards as I described earlier. By default, this is done quite randomly: a hash range is divided by the number of shards. When a document comes in, Elasticsearch hashes its _id. The hash falls into the hash range of one of the shards ==> that's where the document goes.
You can use custom routing to control this: instead of hashing the _id, Elasticsearch can hash a routing value (e.g. the user name). As a result, all documents with the same routing value (i.e. same user) land on the same shard. Routing can then be used at query time, so that Elasticsearch queries just one shard (per index) instead of N. This can bring massive query performance gains (check slide 24 in particular).
Back to the question at hand, I would take it as "what are the differences or advantages when breaking data down by index or using routing?"
To answer, the strategy should account for:
how indexing indexing (writing) is done. If there's heavy indexing, you need to make sure all nodes participate (i.e. write similar amounts of data on the same number of shards), otherwise there will be bottlenecks
how data is queried. If queries often refer to a single user's data, it's useful to have data already broken down by user (index per user or routing by user)
total number of shards. The more shards, nodes and fields you have, the bigger the cluster state. If the cluster state size becomes large (e.g. larger than a few 10s of MB), it becomes harder to keep in sync on all nodes, leading to cluster instability. As a rule of thumb, you'll want to stay within a few 10s of thousands of shards in a single Elasticsearch cluster
In practice, I've seen the following designs:
one index per fixed time interval. You'll see this with logs (e.g.
Logstash writes to daily indices by default)
one index per time interval, rotated by size. This maintains constant index sizes even if write throughput varies
one index "series" (either 1. or 2.) per user. This works well if you have few users, because it eliminates filtering. But it won't work with many users because you'd have too many shards
one index per time interval (either 1. or 2.) with lots of shards and routing by user. This works well if you have many users. As Mahesh pointed out, it's problematic if some users have lots of data, leading to uneven shards. In this case, you need a way to reindex big users into their own indices (see 3.), and you can use aliases to hide this logic from the application.
I didn't see a design with one index per user and routing by date interval yet. The main disadvantage here is that you'll likely write to one shard at a time (the shard containing today's hash). This will limit your write throughput and your ability to balance writes. But maybe this design works well for a high-but-not-huge number of users (e.g. 1K), few writes and lots of queries for limited time intervals.
BTW, if you want to learn more about this stuff, we have an Elasticsearch Operations training, where we discuss a lot about architecture, trade-offs, how Elasticsearch works under the hood. (disclosure: I deliver this class)

Resources