Contents of elasticsearch snapshot - elasticsearch

We are going to be using the snapshot API for blue green deployment of our cluster. We want to snapshot the existing cluster, spin up a new cluster, restore the data from the snapshot. We also need to apply any changes to the existing cluster data to our new cluster (before we switchover and make the new cluster live).
The thinking is we can index data from our database that has changes after the timestamp of when the snapshot was created, to ensure that any writes that happened to the running live cluster will get applied to the new cluster (the new cluster only has the data restored from the snapshot). My question is what timestamp date to use? Snapshot API has start_time and end_time values for a given snapshot - but I am not certain that end_time in this context means “all data modified up to this point”. I feel like it is just a marker to tell you how long the operation took. I may be wrong.
Does anyone know how to find what a snapshot contains? Can we use the end_time as a marker to know that th snapshot contains all data modifications before that date?
Thanks!

According to documentation
Snapshotting process is executed in non-blocking fashion. All indexing
and searching operation can continue to be executed against the index
that is being snapshotted. However, a snapshot represents the
point-in-time view of the index at the moment when snapshot was
created, so no records that were added to the index after the snapshot
process was started will be present in the snapshot.
You will need to use start_time or start_time_in_millis.
Because snapshots are incremental, you can create first full snapshot and than one more snapshot right after the first one is finished, it will be almost instant.
One more question: why create functionality already implemented in elasticsearch? If you can run both clusters at the same time, you can merge both clusters, let them sync, switch write queries to new cluster and gradually disconnect old servers from merged cluster leaving only new ones.

Related

Debezium Connector: Capture changes starting from specific SCN

I wonder whether I can start capturing changes from specific Oracle SCN using Debezium Connector (LogMiner enabled), the official spec states only two properties that I can tune:
log.mining.scn.gap.detection.gap.size.min - Specifies the minimum gap size. (Default - 1000000)
log.mining.scn.gap.detection.time.interval.max.ms - Specifies the maximum time interval. (Default - 20000)
So that means no SCN as a point from which I can start replication, or am I missing something?
As an example, what I am trying to do is to find a solution when having an Oracle snapshot №1 I can fully load and convert all data to another database using special tools. Whenever I get another, new, updated snapshot №2, the using tool didn't match the requirements for replicating the delta between snapshots 1 and 2, and It is necessary to find another approach. Probably, Debezium as an open source tool can help here.
A workaround that first comes to mind is running Debezium with the initial load, prior to the end of snapshot №1, then restarting the Debezium process already with snapshot №2 as a source and replicating all data through Kafka and Sink Connector to the target database.
Are there any pitfalls that I don't see at this moment?

How to delete empty partitions in cratedb?

Cratedb:4.x.x
We have one table in which we are doing partition based on day.
we will take snapshot of tables based on that partition and after taking backup we delete the data of that day.
Due to multiple partition, shards count is more than 2000 and configured shard is 6
I have observed that old partitions have no data but still exist in database.
So it will take more time to become healthy and available to write data after restarting the crate.
So Is there any way to delete those partition?
Is there any way to stop replication of data on startup the cluster? cause it takes too much time to become healthy cluster and due to that table is not writable until that process finished.
Any solution for this issue will be great help?
You should be able to delete empty partitions with a DELETE with an exact match on the partitioned by column. Like DELETE FROM <tbl> WHERE <partitioned_by_column> = <value>

Is there any option of cold-bootstraping a persistent store in Kafka streams?

I have been working on kafka-streams for a couple of months. We are using RocksDB to store data. Now, changelog topic keeps data of only a few days and if our application's persistent stores have data of few months. How will store state be restored if a partition is moved from one node to another(which I think, happens through changelog).
Also, if the node goes containing active task and a new node is introduced. So, the replica will be promoted to active and a new replica will start building on this new node. So, if changelog has only few days of data the new replica will have only that data, instead of original few months.
So, is there any option where we can transfer data to a replica from the active store rather than changelog(as it only has fraction of data).
Changelog topics that are used to backup stores don't have a retention time but are configured with log-compaction enabled (cf. https://kafka.apache.org/documentation/#compaction). Thus, it's guaranteed that no data is lost no matter how long you run. The changelog topic will always contain the exact same data as your RocksDB stores.
Thus, for fail-over or scale-out, when a task migrates and a store need to be rebuild, it will be a complete copy of the original store.

Hbase snapshot incremental backup

In HBase, we can use ExportSnapshot which will export the snapshot data & metadata to suppose another hbase cluster. So that on the second cluster we'll be able to do "list_snapshots" to check the exported snapshot. And we do clone_snapshot "snapshot_name", "new_table_name" then we can restored the snapshot on the second cluster.
So, is there any method or utility available to take incremental backup of hbase snapshot.let's say 7 days period.

Cassandra node does not pull data after cleanup and start

I deleted by mistake some data files from one of Cassandra nodes.
After that I stopped the said node, removed data, commitlog and saved_caches dirs from it, and started it again.
The node joined and is UN in nodetool status and in OpsCenter, also it owns 15.3% tokens.
I expect it to start to pull the data from the other nodes, but its data stays on 157.31 KB and it's not doing anything.
In log it can be seen that last log entry was half an hour ago and it was Handshaking version with DB03/10.2.106.3 (it's its own IP).
How can I balance the data again?
EDIT: Cassandra version we use is 2.1 2.0.12
EDIT: in cassandra.yaml there is no entry auto_bootstrap, thus it is supposed to be the default true setting, according to http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html
try [nodetool rebuild][1] which Datastax describes as "rebuilds data by streaming from other nodes"

Resources