I am working on Elasticsearch where I need to index the new data while replacing the old data. This replacing happens every day.
My requirement is that until the new data indexing is completed, user should be able to search from old data only. And when this indexing is completed, there should be a pointer in Elasticsearch which would just point to new indexed data in no time following with the deletion of old data. In this way I want to achieve zero downtime in this process. This indexing of data may take around 1 hour to complete.
Is there any switching concept in Elasticsearch which can handle this scenario?
Index Aliases is what you want.
Related
I recently made a new version of an index for my elasticsearch data with some new fields included. I re-indexed from the old index, so the new index has all of the old data with also the new mapping to include the new fields.
Now, I'd like to update all of my elasticsearch data in the index to include these new fields, which I can calculate by making some separate database + api calls to other sources.
What is the best way to do this, given that there are millions of records in the index?
Logistically speaking I'm not sure how to accomplish this... as in how can I keep track of the records that I've updated? I've been reading about the scroll api, but not certain if this is valid because of the max scroll time of 24 hours (what if it takes longer than that)? Also a serious consideration is that since I need to make other database calls to calculate the new field values, I don't want to hammer that database for too long in a single session.
Would there be some way to run an update for say 10 minutes every night, but keep track of what records have been updated/need updating?
I'm just not sure about a lot of this, would appreciate any insights or other ideas on how to go about it.
you would need to run an update by query on your original index, which is expensive
you might be able to use aliases to point to indices behind that, and when you want to make a change, create a new index with the new mappings etc and attach it to the alias so new data coming in gets written correctly. then reindex the "old" data into the new index
that will depend on the details of what you're doing though
I have two types of indices in my elasticsearch. The first contains data that is updated in near-real time. The second is data I can use to enhance the first that is updated nightly. I am new to elasticsearch and I'm wondering if there are any good patterns that easily allows me to update the streaming data with the nightly batches.
I've looked at the enrichment processor, but that appears to enrich at time of index. The enrichment data I have might be there, or might show up that night.
My goal is to create a dashboard that uses the enrichment index to help identify what documents in the streaming data I care about; and eventually add more fields for detailed exploration from there. In SQL terms: "count the number of documents where the ID of the stream document exists in the enrichment data", but that is pretty much a JOIN which I believe I should be avoiding given the large size of both indices.
Enrichment processors can be run at index time but also after documents have already been indexed using the _update_by_query endpoint.
The idea is this: you index your streaming data in real-time. Once your second data set comes in, you can create a new index to store it, then create an enrichment index out of it and finally update your first data set with the enrich processor.
I am using ElasticSearch with Kibana to store and visualize data from my logs. I know it is customary to use Logstash, but I just use the elasticsearch Rest API and POST new elements to it.
I am trying to look for best practices in terms of how I should manage my indices, given I have about 50k logs per day, and I want to visualize sometimes weekly, sometimes monthly and sometimes yearly data. And also I have no need for more than one node. I don't need a high available cluster.
So I am basically trying to determine:
-How should I store my indexes, by time? Monthly? Weekly? One index for everything?
-What are the disadvantages of a huge index (one index that contains all my data)? Does it mean that the entire index is in memory?
Thank you.
I like to match indexes to the data retention policy. Daily indexes work very well for log files, so you can expire one day's worth after X days of retention.
The fewer indexes/shards you have, the less RAM is used in overhead by Elasticsearch to manage them.
The mapping for a field is frozen when the field is added to the index. With a daily index, I can update the mapping and have it take effect for the new indexes, and wait for the old ones to expire. With a longer-term indexes, you'd probably need to reindex the data, which I always try to avoid.
The settings for shards and replicas are also frozen when you create the index.
You can visualize them in Kibana regardless of how they're stored. Use the #timestamp field as your X-axis and change the "interval" to the period you want.
Using logstash would be important if you wanted to alter your logs at all. We do a lot of normalization and creation of new fields, so it's very helpful. If it's not a requirement for you, you might also look into filebeats, which can write directly to elasticsearch.
Lots to consider...
Example for this is Logstash format. They formatted their index in elasticsearch with [logstash-]YYYY.MM.DD, where a new index will be used each day. The elasticsearch itself will be used by Kibana. Is there any reason why it's being done? What is the advantage?
Advantages that come to mind:
If you're looking for Tuesday's data, you can just look in Tuesday's index.
You can delete old data more easily.
If you want to modify the mappings, you can update the template and the changes will take effect the next day. It's your choice if you want to reindex the old data or not.
We are thinking about implementing some sort of message cache which would hold onto the messages we send to our search index so we could persist while the index was down for an extended period of time (for example a complete re-index) then 're-apply' the messages. These messages are creations or updates of the documents we index. If space were cheap enough, with something as scalable as Couchbase we may even be able to hold all messages but I haven't done any sort of estimations of message size and quantity yet. Anyway, I suggested Couchbase + XDCR + Elasticsearch for this task as most of the work would be done automatically however there are 4 questions I have remaining:
If we were implementing this as a cache, I would not want Elasticsearch to remove any documents that were not in Couchbase, is this possible to do (perhaps it is even the default behaviour)?
Is it possible to apply some sort of versioning so that a document in the index is not over-written by an older version coming from Couchbase?
If I were to add a new field to the index, I might need to re-index from the actual document datasource then re-apply all the messages stored in Couchbase. I may have 100 million documents in Elasticsearch and say 500,000 documents in Couchbase that I want to re-apply to Elasticsearch? What would the speed be like.
Would I be able to apply any sort of logic in-between Couchbase and Elasticsearch?
Update:
So we store documents in an RDBMS as we need instant access to inserted docs plus some other stuff. We send limited versions of the document to a search engine via messages. If we want to add a field to the index we need to re-index the system from the RDBMS somehow. If we have this Couchbase message cache we could add the field to messages first, then switch off the indexing of old messages and re-index from the RDBMS. We could then switch back on the indexing of the messages and the entire 'queue' of messages would be indexed without having lost anything.
This system (if it worked) would remove the need for an MQ server, a message listener and make sure no documents were missing from the index.
The versioning would be necessary as we don't want to apply an 'update' to the index which actually contains a more recent document (not sure if this would ever happen now I think about it).
I appreciate it's probably not too great a job to implement points 1 and 4 by changing the Elasticsearch plugin code but I would like to confirm that the idea is reasonable first!
The Couchbase-Elasticsearch integration today should be seen as an indexing engine for Couchbase. This means the index is "managed/controlled" by the data that are in Couchbase.
The XDCR is used to sent "all the events" to Elasticsearch. This means the index is update/delete every time a document (stored in Couchbase) is created, modified or deleted.
So "all the documents" stored into a Couchbase bucket are indexed into Elasticsearch.
Let's answer your questions one by one, based on the current implementation of the Couchbase-Elasticsearch.
When a document is removed from Couchbase, the Elasticsearch index is update (entry removed).
Not sure to understand the question. How an "older" version could come from Couchbase? Anyway once again everytime the document that is stored into Couchbase is modified, the index in Elasticsearch is updated.
Not sure to understand where you want to add a new field? If this is into a document that is stored into Couchbase, when the document will be sent to Elasticsearch the index will be updated. But based on what I have said before : all document "stored" into Couchbase will be present in Elasticsearch index.
Not with the plugin as it is today, but as you know it is an open source project so you can either add some logic to it or even contribute your ideas to the project ( https://github.com/couchbaselabs/elasticsearch-transport-couchbase )
So let me ask you more questions:
- how do you inser the document into you application? (and where Couchbase? Elasticsearch?)
- what are the types of documents?
- what do you want to cache into Couchbase?