Create a EBS volume from last EBS snapshot of a few snapshots of same EBS volume - snapshot

I have a EBS backed EC2 instance. I take snapshot for the EBS volume a few times say s1, s2, s3 where s3 being the last one took. Now I need to launch another EBS backed EC2 instance and also want to apply the snapshot took earlier onto the EBS volume of the new instance. I know that the EBS snapshots were taken incrementally, meaning that only the changed blocks since last snapshot will be captured. I wonder if I only apply the last snapshot (s3) on to the new EBS volume, does it mean that the data captured in s1 and s2 won't get on to the new volume? Or put in another way, do I need to apply s1, s2, s3 sequentially and manually on to the new volume in order to get the full data set?

When you launch a new image from an EBS snapshot from s3, you will get the full state the EBS instance was in when you created s3.
Snapshots are created using an incremental method to increase performance, but you will get back the state consistent with the entire system from a given snapshot.
Even though the snapshots are saved incrementally, when you delete a snapshot, only the data not needed for any other snapshot is removed. So regardless of which prior snapshots have been deleted, all active snapshots will contain all the information needed to restore the volume.
http://aws.amazon.com/ebs/

Related

Contents of elasticsearch snapshot

We are going to be using the snapshot API for blue green deployment of our cluster. We want to snapshot the existing cluster, spin up a new cluster, restore the data from the snapshot. We also need to apply any changes to the existing cluster data to our new cluster (before we switchover and make the new cluster live).
The thinking is we can index data from our database that has changes after the timestamp of when the snapshot was created, to ensure that any writes that happened to the running live cluster will get applied to the new cluster (the new cluster only has the data restored from the snapshot). My question is what timestamp date to use? Snapshot API has start_time and end_time values for a given snapshot - but I am not certain that end_time in this context means “all data modified up to this point”. I feel like it is just a marker to tell you how long the operation took. I may be wrong.
Does anyone know how to find what a snapshot contains? Can we use the end_time as a marker to know that th snapshot contains all data modifications before that date?
Thanks!
According to documentation
Snapshotting process is executed in non-blocking fashion. All indexing
and searching operation can continue to be executed against the index
that is being snapshotted. However, a snapshot represents the
point-in-time view of the index at the moment when snapshot was
created, so no records that were added to the index after the snapshot
process was started will be present in the snapshot.
You will need to use start_time or start_time_in_millis.
Because snapshots are incremental, you can create first full snapshot and than one more snapshot right after the first one is finished, it will be almost instant.
One more question: why create functionality already implemented in elasticsearch? If you can run both clusters at the same time, you can merge both clusters, let them sync, switch write queries to new cluster and gradually disconnect old servers from merged cluster leaving only new ones.

Is there any option of cold-bootstraping a persistent store in Kafka streams?

I have been working on kafka-streams for a couple of months. We are using RocksDB to store data. Now, changelog topic keeps data of only a few days and if our application's persistent stores have data of few months. How will store state be restored if a partition is moved from one node to another(which I think, happens through changelog).
Also, if the node goes containing active task and a new node is introduced. So, the replica will be promoted to active and a new replica will start building on this new node. So, if changelog has only few days of data the new replica will have only that data, instead of original few months.
So, is there any option where we can transfer data to a replica from the active store rather than changelog(as it only has fraction of data).
Changelog topics that are used to backup stores don't have a retention time but are configured with log-compaction enabled (cf. https://kafka.apache.org/documentation/#compaction). Thus, it's guaranteed that no data is lost no matter how long you run. The changelog topic will always contain the exact same data as your RocksDB stores.
Thus, for fail-over or scale-out, when a task migrates and a store need to be rebuild, it will be a complete copy of the original store.

How backup works when flow.xml size more than max storage?

i have check below properties used for the backup operations in Nifi-1.0.0 with respect to JIRA.
https://issues.apache.org/jira/browse/NIFI-2145
nifi.flow.configuration.archive.max.time=1 hours
nifi.flow.configuration.archive.max.storage=1 MB
Since we have two backup operations first one is "conf/flow.xml.gz" and "conf/archive/flow.xml.gz"
I have saved archive workflows(conf/archive/flow.xml.gz) as per hours in "max.time" property.
At particular time i have reached "1 MB"[set as size of default storage].
So it will delete existing conf/archive/flow.xml.gz completely and doesn't write new flow files in conf/archive/flow.xml.gz due to size exceeds.
No logs has shows that new flow.xml.gz has higher size than specified storage.
Why it could delete existing flows and doesn't write new flows due to storage?
In this case in one backup operation has failed or not?

Greenplum DCA-How to backup & restore Version V2 to V3

We have small array of greenplum DCA V1 and V3.
Trying to conduct backup/restore process steps between them.
As novice to DCA Appliances.banging my head against the wall to understand the parallel backup process in logical way.
We tried
Trying to conduct parallel backup.
using gpcrondump/gpdbrestore. But did not understand working process how it execute
on Master host
on segment host
Question is :
How parallel backup works in master-segment DCA env from version to version.
gpcrondump executes a backup in parallel. It basically coordinates the backups across all segments. By default, each segment will create a db_dumps directory in each segment's $PGDATA directory and a sub-directory under that with a date format.
For example, let's say you have 4 segments per host and hosts sdw1-4. The dumps will be created in:
/data1/gpseg0/db_dumps/20161111/
/data1/gpseg1/db_dumps/20161111/
/data2/gpseg2/db_dumps/20161111/
/data2/gpseg3/db_dumps/20161111/
This repeats across all segments.
The segment will dump only its data to this dump location. grcrondump will name the files, make sure it completes successfully, etc as each segment dumps data independently of the other segments. Thus, it is done in parallel.
The master will also have a backup directory created but there isn't much data in this location. It is mainly metadata about the backup that was executed.
The metadata for each backup is pretty important. It contains the segment id and the content id for the backup.
gpdbrestore restores a backup created by gpcrondump. It reads the files and loads it into the database. It reads those backup files and makes sure the segment id and content id match the target. So, the number of segments from a backup must match the number of segments to restore to. It also has to have the same mapping of segment id to content id.
Migration from one cluster can be done multiple ways. One way is to do a backup and then restore. This requires the same configuration in both clusters. You have to copy all of the backup files from one cluster to the other as well. Alternatively, you could backup and restore from a backup device like DataDomain.
You can also use a built-in tool call gptransfer. This doesn't use a backup but instead, uses external tables to transfer from one cluster to another. The configuration of the two clusters doesn't have to be the same when using this tool but if you are going from a larger cluster to a smaller cluster, it will not be done in parallel.
I highly recommend you reach out to your Pivotal Account Rep to get some assistance. More than likely, you have already paid for services when buying the new DCA that will cover part or all of the migration work. You will have to configure networking between the two clusters which requires some help from EMC too.
Good luck!!

How to store last processed file of S3 to Redshift database

For now i have copied Data from Amazon S3 to Amazon Redshift Using AWS Data Pipeline only for current date and time. I want to copy data from S3 to Redshift for every 30 minutes. And also the last processed S3 file name is stored into another Redshift table.
Could somebody answer this question ?
You can use the RedshiftCopyActivity data pipeline object to do exactly this. The schedule field in the RedshiftCopyActivity object accepts a data pipeline schedule object that can run on 30 minute intervals. You'll need to define a full pipeline in JSON including all your AWS resource info (Redshift data nodes, EC2 instances, S3 bucket & key). The filepath for the source data file in the JSON template could point to a static file which is overwritten every 30 minutes by whatever produces the data.

Resources