Are there Data Transfer Charges for using Public Data Sets in Amazon AWS? - hadoop

Basically, I'm in a free tier with my single t1.micro instance. I want to use the Wikipedia dump file public data set.
Would Amazon charge me if I'm processing some 2-4 GB of data in my instance from that dataset?

Any data into AWS network is free , it will be charged if your data is moving out from AWS network

In order to use a public dataset, you need to create an EBS snapshot with the corresponding volume and attach it to your instance. SO you will be charged against your storage when using an AWS public dataset as your EBS volume will count, if I'm not mistaken.
The way I see it, Amazon is simply hosting the full datasets on their servers, but if you want to use them you will most likely be charged because you're using EBS as your storage.


Does d2.xlarge instance has EBS storage?

We are trying to use the d2.xlarge instance type for worker nodes in Hadoop cluster.
When i look at the AWS Site
It says HDD based local storage.
Is it EBS Storage? or attached to the EC2 instance?
You can attach Elastic Block Store (EBS) volumes to any Amazon EC2 instance. EBS is persistent disk storage.
Some EC2 instances also have Instance Store, which is locally-attached disk storage. If the instance is stopped or terminated, the contents of Instance Store is lost.
Instance Store is popular for use with Amazon EMR because it provides very large amounts of storage for HDFS. However, please be aware that the data is lost if the cluster is terminated.
The d2.xlarge instance type has 3 x 2000 GB instance store drives, which are stored on magnetic disk. This is in addition to any EBS volumes you attach.
Local means attached storage. HDD means spinning disks (not SSD's).

How to save intermidiate results on Amazon EC2 when spot instance used?

I do some scientific calculations and I have some intermidiate results on each iteration, so I think I can use spot instance reduce cost of processing.
How can I save intermidiate results on each iteration?
How can I automatically rerun instance from last checkpoint when it's terminated?
When the spot price of an Amazon EC2 instance rises above your bid price, your Amazon EC2 instance is terminated. A 2-minute notice is provided via the metadata interface. You can use this notice as a trigger for saving your work, or you could simply save work at regular intervals regardless of the notice period.
Do not save your work "locally", since the Amazon EBS volumes will either be deleted (eg boot volume) or disconnected (eg data volumes). I would recommend that you save your work in a persistent datastore, such as a database or Amazon S3.
One option would be to save files to your local disk, but use the AWS Command-Line Interface (CLI) to copy the files to Amazon S3 using the aws s3 sync command.
Then, if you have configured a persistent spot instance, simply copy the files from Amazon S3 when the new Amazon EC2 spot instance is started.
Spot Instance Interruptions

Amazon EC2 - Automatic Restore Snapshot

I'm looking to setup a demo environment in Amazon that consists of a pre-configured EC2 image that resets itself back to a snapshot configuration every hour, this is would be a Linux VM.
What would be the best way to go about doing this in EC2? Does Amazon offer any tools for scheduling and reverting to the snapshot or would this need to be done from a third party VM or software?
There is no VMWare-like 'snapshot' functionality in Amazon EC2 (where you can roll-back to a point-in-time).
The network-attached disk storage system used with Amazon EC2 is called Amazon Elastic Block Store (EBS). While EBS does have a 'snapshot' function, this actually takes a backup of an EBS Volume and stores it in Amazon S3. The snapshot can then be used to create a new EBS volume, which will contain the same contents as the original disk at the time the snapshot was created.
One option would be to launch a new Amazon EC2 instance, which will automatically create a new boot disk from the indicated Amazon Machine Image (AMI). This is the way to launch new machines with the same disk content. However, this might not lend itself well to your "revert every half hour" since it requires a new machine to be started, which will also trigger a new hourly billing cycle.
You might be able to script the deletion of files or the reload of some database tables, but this will depend upon your particular system and applications.

Where is the data stored by using Amazon EC2?

I have a large set of data to be analyzed and I am planning to use Amazon EC2 to compute. So I am wondering where can I store the data for computing.
There is a lot of lingo in the amazon world.
You can either store the data on an EBS drive connected to your EC2 instance, or if it is in MySQL format or a simple format, you could consider storing it on Amazon's managed MySQL service called RDS.
EC2 units can either be backed by S3 storage, or EBS volumes. If you want to have rapid access to your data, you will need to choose an EC2 instance backed by Amazon Elastic Block Storage (EBS). EBS gives you the flexibility to use any database or data structure you want.

how does multiple EC2 instances (scaling) works on one EBS for data storage?

So, in a simple situation, if there is only one instance, then I can store the data into a EBS volume mounted on that instance. e.g. /mnt/db
However, how does it work if I scale and have multiple instance (either static or dynamic scaling)?
Because one EBS can only attach to one instance, if I have multiple instance, does it mean that I have to attach an EBS volume for each instance? If that's the case, the data on each Instance's EBS volume will be different.
It is obvious that I want all instances to access (R & W) a single volume (as data-storage). and the data in the volume will constantly grow and there is no downtime.
What is the solution? Is there a way that I don't mount the device (EBS), and just call it for accessing the data?
Here is what I can think of:
1) if each instance has its own EBS volume, then each time interval (e.g. 1 hour), all instances will unmount & detach the EBS volume,and attach a new one. Then there is one powerful instance that mount all the EBS volumes just detached, and aggregate all the data.
2) or similar to 1), instead of detach and attach, I just take a snapshot on all volumes for all instances. Then the powerful instance aggregateness the data from the snapshot. And save the result into either another EBS or S3.
These two approach seem to be working.. but require a lot of work. is there a smarter way to approach this problem? thanks.
by the way, because of performance issue, I cannot have the instance writes data to S3. :)
OH how about this
3) First, all instances have their own EBS and write data into the EBS. and then each hour, data will be sent to S3. Then another instance will aggregate them.
how about having ang NFS instance which can be mounted to the other instances?
It seems that you need to create an EBS snapshot of your most up to date EC2 instance. This will create an EBS backed AMI. You would then need to terminate all your EC2 instances that are not up to date and launch a new stack of instances from your newly created AMI. If you had a load balancer running then you would have to attach these new instances to your load balancer also.
It seems a little long-winded but it can all be done programmatically. At least this is how I think scaling in the cloud with Amazon works and far as propagating changes across multiple instances goes. Somebody else with more experience verify this. I plan to test it out myself later on.
