Is there a way to preserve data during disk migration on Softlayer? - disk

I have lots of VMs with system and portable local disks. I want to migrate them to SAN disks for system and portable disks through SL API. I did some experiment from SL portal, seems there has to migrate the system disk to SAN first, and open a SL support ticket to delete the local portable disk, then order a SAN portable disk after system disk migration is completed. So in this process, the data in local disk can't be automatically copied to the SAN disk, users have to figure out a way to copy data manually by themselves? I am wondering if there is safe/better way to help do data migration during disk migration? Thanks.

Sorry there is no way to preserve data during disk migration. The user needs to preserve and restore the data manually.

Related

How to backup entire hdfs data on local machine

We have small CDH cluster of 3 nodes with approx 2TB data. we are planning to expand it but before that current hadoop machines/racks are being relocated. And I just want to make sure I have backup in local machine, in case racks somehow are not relocated (or gets damaged on the way) and we have to install new ones. How do I ensure this?
I have taken snapshot of HDFS data from cloudera manager as backup and it resides on the cluster. But in this case I need to take backup of whole data on local machine or hard drive. Please advise.
Distcp the data somewhere.
Posslible options:
own solution - the temporary cluster - 2TB is not so much, hardware is cheap.
managed solution - to the cloud. There are plenty of storage as a service providers. If not sure, S3 should work for You. Of course data transfer is Your cost, but there is always a trade of between managed service and own crafted.

How do s3n/s3a manage files?

I've been using services like Kafka Connect and Secor to persist Parquet files to S3. I'm not very familiar with HDFS or Hadoop but it seems like these services typically write temporary files either into local memory or to disk before writing in bulk to s3. Do the s3n/s3a file systems virtualize an HDFS-style file system locally and then push at configured intervals or is there a one-to-one correspondence between a write to s3n/s3a and a write to s3?
I'm not entirely sure if I'm asking the right question here. Any guidance would be appreciated.
S3A/S3N just implement the Hadoop FileSystem APIs against the remote object store, including pretending it has directories you can rename and delete.
They have historically saved all the data you write to the local disk until you close() the output stream, at which point the upload takes place (which can be slow). This means that you must have as much temporary space as the biggest object you plan to create.
Hadoop 2.8 has a fast upload stream which uploads the file in 5+MB blocks as it gets written, then in the final close() makes it visible in the object store. This is measurably faster when generating lots of data in a single stream. This also avoids needing so much disk space.

How to do a Backup and Restore of Cassandra Nodes in AWS?

We have 2 m3 large instances that we want to do backup of. How to go about it?
The data is in the SSD drive.
nodetool snapshot will cause the data to be written back to the same SSD drive . Whats the correct procedure to be followed?
You can certainly use nodetool snapshot to back up your data on each node. You will have to have enough SSD space to account for snapshots and the compaction frequency. Typically, you would need about 50% of the SSD storage reserved for this. There are other options as well. Datastax Opscenter has backup and recover capabilities that use snapshots and help automate some of the steps but you will need storage allocated for that as well. Talena also has a solution for back/restore & test-dev management for Cassandra (and other data stores like HDFS, Hive, Impala, Vertica, etc.). It relies less on Snapshots by making copies off-cluster and simplifying restores.

Do Amazon EBS snapshots retain deleted data?

Consider the following scenario:
I have a file with sensitive information stored on an EBS-backed EC2 instance. I delete this file in the standard non-secure way (rm -f my_secret_file). Once the file is deleted, I immediately shut down the instance and take an EBS snapshot. (Or create an AMI... either one, really.)
If a malicious party was able to gain access to the snapshot and mount/boot it, could they undelete any portion of my_secret_file using the various filesystem tools available? Put another way, do the EBS snapshots retain the data that existed in "unallocated"/deleted blocks at the time?
Yes - I would be extremely surprised if they didn't. The EBS snapshots are block-level snapshots so they will capture everything, regardless of the logical state of the file system similar to a hard disk image.

EBS for storing databases vs. website files

I spent the day experimenting with AWS for the first time. I've got an EC2 instance running and I mounted an Elastic Block Store (EBS) to keep the MySQL databases.
Does it make sense to also put my web application files on the EBS, or should I just deploy them to the normal EC2 file system?
When you say your web application files, I'm not sure what exactly you are referring to.
If you are referring to your deployed code, it probably doesn't make sense to use EBS. What you want to do is create an AMI with your prerequisites, then have a script to create an instance of that AMI and deploy your latest code. I highly recommend you automate and test this process as it's easy to forget about some setting you have to manually change somewhere.
If you are storing data files, that are modified by the running application, EBS may make sense. If this is something like user-uploaded images or similar, you will likely find that S3 gives you a much simpler model.
EBS would be good for: databases, lucene indexes, file based CMS, SVN repository, or anything similar to that.
EBS gives you persistent storage so if you EC2 instance fails the files still exist. Apparently their is increased IO performance but I would test it to be sure.
If your files are going to change frequently (like a DB does) and you don't want to keep syncing them to S3 (or somewhere else), then an EBS is a good way to go. If you make infrequent changes and you can manually (or scripted) sync the files as necessary then store them in S3. If you need to shutdown or you lose your instance for whatever reason, you can just pull them down when you start up the new instance.
This is also assuming that you care about cost. If cost is not an issue, using the EBS is less complicated.
I'm not sure if you plan on having a separate EBS for your DB and your web files but if you only plan on having one EBS and you have enough empty space on it for your web files, then again, the EBS is less complicated.
If it's performance you are worried about, as mentioned, it's best to test your particular app.
Our approach is to have a script pre-deployed on our AMI that fetches the latest and greatest version of the code from source control. That makes it very straightforward to launch new instances quickly, or update all running instances (we take them out of the load balancing rotation one at a time, run the script, and put them back in the rotation).
UPDATE:
Reading between the lines it looks like you're mounting a separate EBS volume to an instance-store backed instance. AWS recently introduced EBS backed instances that have a ton of benefits vs. the old instance-store ones. I still mount my MySQL data on a separate EBS partition, though, so that I can easily mount it to a different server if needed.
I strongly suggest an EBS backed instance with a separate EBS volume for the MySQL data.

Resources