I have a large set of data to be analyzed and I am planning to use Amazon EC2 to compute. So I am wondering where can I store the data for computing.
There is a lot of lingo in the amazon world.
You can either store the data on an EBS drive connected to your EC2 instance, or if it is in MySQL format or a simple format, you could consider storing it on Amazon's managed MySQL service called RDS.
EC2 units can either be backed by S3 storage, or EBS volumes. If you want to have rapid access to your data, you will need to choose an EC2 instance backed by Amazon Elastic Block Storage (EBS). EBS gives you the flexibility to use any database or data structure you want.
Related
We are trying to use the d2.xlarge instance type for worker nodes in Hadoop cluster.
When i look at the AWS Site
https://aws.amazon.com/ec2/instance-types/
It says HDD based local storage.
Is it EBS Storage? or attached to the EC2 instance?
You can attach Elastic Block Store (EBS) volumes to any Amazon EC2 instance. EBS is persistent disk storage.
Some EC2 instances also have Instance Store, which is locally-attached disk storage. If the instance is stopped or terminated, the contents of Instance Store is lost.
Instance Store is popular for use with Amazon EMR because it provides very large amounts of storage for HDFS. However, please be aware that the data is lost if the cluster is terminated.
The d2.xlarge instance type has 3 x 2000 GB instance store drives, which are stored on magnetic disk. This is in addition to any EBS volumes you attach.
Local means attached storage. HDD means spinning disks (not SSD's).
I do some scientific calculations and I have some intermidiate results on each iteration, so I think I can use spot instance reduce cost of processing.
How can I save intermidiate results on each iteration?
How can I automatically rerun instance from last checkpoint when it's terminated?
When the spot price of an Amazon EC2 instance rises above your bid price, your Amazon EC2 instance is terminated. A 2-minute notice is provided via the metadata interface. You can use this notice as a trigger for saving your work, or you could simply save work at regular intervals regardless of the notice period.
Do not save your work "locally", since the Amazon EBS volumes will either be deleted (eg boot volume) or disconnected (eg data volumes). I would recommend that you save your work in a persistent datastore, such as a database or Amazon S3.
One option would be to save files to your local disk, but use the AWS Command-Line Interface (CLI) to copy the files to Amazon S3 using the aws s3 sync command.
Then, if you have configured a persistent spot instance, simply copy the files from Amazon S3 when the new Amazon EC2 spot instance is started.
See:
Spot Instance Interruptions
Basically, I'm in a free tier with my single t1.micro instance. I want to use the Wikipedia dump file public data set.
Would Amazon charge me if I'm processing some 2-4 GB of data in my instance from that dataset?
Any data into AWS network is free , it will be charged if your data is moving out from AWS network
In order to use a public dataset, you need to create an EBS snapshot with the corresponding volume and attach it to your instance. SO you will be charged against your storage when using an AWS public dataset as your EBS volume will count, if I'm not mistaken.
The way I see it, Amazon is simply hosting the full datasets on their servers, but if you want to use them you will most likely be charged because you're using EBS as your storage.
source
I am about to migrate a large web project (many sites using common data) to EC2 and i wondered what would be the best setup (I am very much a newbie with Amazon AWS).
The site pages are rebuilt by scripts once a week and the resultant static pages are served (currently about 7 to 10k views a day). Inbetween the weekly builds I would like to access the db to add/edit data.
I am thinking either EC2 + RDS or EC2 and S3 (S3 having the advantage of keeping a copy of the static pages too). Do these options sound reasonable, based on what I have mentioned?
Thanks in advance
We're using EC2 (experimtented with a few instance types just to learn cpu extra large worked best for our type of application), and rather than using RDS we extensively use EBS -
one EBS for running code, one EBS which holds mysql database files.
S3 is used for incremental backups mostly- as the EBS can be mounted on any other instance easily.
How different is the Amazons RDS DB Instance different from The normal EC2 Instance other than the fact that RDS DB Instance has a Database server running on it?
When an EC2 Instance goes down all the data associated with it also vanishes(when you dont attach an EBS). Is this true for RDS DB Instance as well?
I have already set up my database server with the following: 1 small Instance?(m1.small) with Mysql and attached a 10GB EBS and routed the Mysql Data Directories to EBS.
Is the small Instance of EC2 RDS any Different from the above?
An RDS db instance can be configured to not lose any data during downtime, either planned or unplanned. For unplanned downtime, AWS keeps transaction logs which are replayed automatically on a failover instance. These logs can also be used to get an instance to a specific point in time.
For planned downtime, you create a DB snapshot prior to stopping the instance, and can later start a new instance with the saved snapshot.
RDS is a unmanaged MySQL service, means you only start and load data into it and your ready to go.
Is the small Instance of EC2 RDS any Different from the above?
The small instance of RDS is a 64bit, which support multi AZ failover and pricing obviously is a little expensive compared to EC2 MySQL.
EC2 Mysql needs more administration, but you can setup it up to do replication and you can customize it to have better performance compared to RDS.
See also http://www.dotdeb.org/2010/05/04/mysql-on-amazon-benchmarks-rds-vs-ec2/