I want to work with an EBS snapshot in an EMR job. Because the mapper reads from the snapshot, I want the snapshot mounted on every node. Is there an easy way to do that other than logging in to each node? I guess I could make the first step of my mapreduce job to mount it, but that seems wrong. Is there an easier way to do it?
It is possible, but you'll have to jump through some hoops to get it to work. Assuming you have recipe to create an EBS volume from the EBS snapshot in a shell script. EMR provides bootstrap actions, which are just shell scripts you can create and run. Bootstrap actions are run before any jobs (steps in EMR) are allowed to run.
Here are the steps you need to have your shell script perform:
Create a new EBS volume based on your snapshot. The aws binary is installed on all EMR instances, so that's your best bet. Assuming you know the snapshot id, this should be straightforward:
http://docs.aws.amazon.com/cli/latest/reference/ec2/create-volume.html
Make sure you include the DeleteOnTermination attachment.
You will need to parse the response to get the EBS volume id.
Attach the volume you just created (using the EBS volume id) to the current instance:
http://docs.aws.amazon.com/cli/latest/reference/ec2/attach-volume.html
To get the current instance id, use the metadata service:
wget -q -O - http://instance-data/latest/meta-data/instance-id
Once you have your shell script, you need to upload it to S3, and then add that script as a bootstrap action to your cluster:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html
Also beware, you will be charged for each EBS volume you create, so ensure the delete on termination logic is setup properly!
Related
I am a data engineer with experience in designing n creating data integration and ELT processes. Below is my use case, and I need to move my process to aws and would like your opinion?
My files to be processed are in s3. I need to process those files using Hadoop. I have existing logic written in hive, just need to migrate the same to aws. Is the below approach correct/ feasible?
Spin up a fleet of ec2 instances, initially say 5, enable autoscaling.
Create an EFS, and mount it on the ec2 instances.
Copy file from s3 to EFS as Hadoop tables.
Run hive queries on top of the data in EFS and create new tables.
Once the process is completed, move/ export the final reports table from EFS to s3 (somehow). Not sure that whether this is possible or not, if this is not possible then this entire solution is not feasible.
6.Terminate EFS and EC2 instances.
If the above method is correct, How does the Hadoop orchestration happen using EFS?
Thanks,
KR
Spin up a fleet of ec2 instances, initially say 5, enable autoscaling.
I'm not sure you need the autoscaling.
why?
let's say you start a "big" query which takes lot's of time & cpu.
auto-scale will start more instances , but how will it start run "fraction" of the query on the new machine?
all machines need to be ready before you run the query . just keep it in mind.
Or in other words: only the machines that available now will handle the query.
Copy file from s3 to EFS as Hadoop tables.
There isn't any problem with this idea.
just keep in mind , you can keep the data in EFS .
if EFS is too pricy for you ,
Please check options for provision EBS-magnetic with Raid 0 .
You will gain great speeds at minimal costs.
The rest is okay, and this is one of the ways to do "on demand" interactive analytics.
Please take a look into AWS Athena.
It's a service which allows you to run queries on s3 objects .
You can use Json and even Parquet (which is much more efficient !)
This service may be enough for your need .
Good luck !
I'm looking to setup a demo environment in Amazon that consists of a pre-configured EC2 image that resets itself back to a snapshot configuration every hour, this is would be a Linux VM.
What would be the best way to go about doing this in EC2? Does Amazon offer any tools for scheduling and reverting to the snapshot or would this need to be done from a third party VM or software?
There is no VMWare-like 'snapshot' functionality in Amazon EC2 (where you can roll-back to a point-in-time).
The network-attached disk storage system used with Amazon EC2 is called Amazon Elastic Block Store (EBS). While EBS does have a 'snapshot' function, this actually takes a backup of an EBS Volume and stores it in Amazon S3. The snapshot can then be used to create a new EBS volume, which will contain the same contents as the original disk at the time the snapshot was created.
One option would be to launch a new Amazon EC2 instance, which will automatically create a new boot disk from the indicated Amazon Machine Image (AMI). This is the way to launch new machines with the same disk content. However, this might not lend itself well to your "revert every half hour" since it requires a new machine to be started, which will also trigger a new hourly billing cycle.
You might be able to script the deletion of files or the reload of some database tables, but this will depend upon your particular system and applications.
I have several task I'm preforming on AWS EMRs which don't share data and I would like to use the same EMR to perform them one after another. Is there a way to clean a running EMR back to its initial state (remove hive tables, clean all HDFS files etc.) do avoid collision of data?
I want to reuse EMR for several reasons:
Creation of a new EMR can take 5-10 minutes.
My task are relative shorts, 20-25 minutes.
Once EMR was created you already paying for the full hour.
We didn't find a "quick and clean" API to achieve this behaviour. Instead we consolidate a simple work methodology to promise we can clean all the data.
We work on a specific DB instead of the default one.
We put all our internal data files under a specific location in the HDFS.
So every time a task started, it first delete this specific DB if exists and recreate it and recursively delete all the data under the specific location in the HDFS.
As I understood, for an EBS backed EC2 instance, it's root device will be an EBS volume. Now if I want to have the content of the EBS volume to be a snapshot that I took earlier (for the root device of another EBS backed EC2 instance), how can I do that?
The short version is that you find the snapshot in the AWS management console, click the Launch button, and follow the steps in the wizard (to e.g. select availability zone).
There is a detailed walk through here:
http://www.techrepublic.com/blog/datacenter/how-to-create-a-new-ami-from-a-snapshot-and-launch-a-new-vm/5349
This can also be done a number of other ways, including
From the command line / a script
Programmatically through the API
Automatically e.g. using Auto Scaling
I have ec2 instance set up where mysql DB is mounted on separate volume.
(as detailed in http://aws.amazon.com/articles/1663 )
I want to duplicate this instance set up where my application servers on duplicated instances share the DB volume which is attached to the already running ec2 instance.(I can specify mysql ip through configuration file)
Since almost every set up except the mysql ip is identical, i'd like to create an ami from the first instance and slightly modify to create 2nd,3rd instances.
The question is, the mount information stored in the first instance will take effect when I launch the 2nd instance.
I can elaborate the question,
1. I read that a volume can not be attached to more than one ec2 instance at the same time.
2. the running instance attaches/mount an volume to itself on start up.(so it seems)
3. if I were to create an ami from first instance and use that to initiate other instances, how would auto attach/mount information(which I assume, will be stored in the ami) will affect the other instances.
Eugene,
Mounting the same device to several servers is not possible, so you better forget about this option.
The best solution is to:
Create a copy of your master instance.
Detach the created mount volume. We are going to create an image from this new instance, and you don't want the useless drive copy to be re-created every time.
Change the settings that you need to change, in order to make this server rely on the remote (master) mysql server.
Once you are satisfied with the outcome, create an image from this instance.
Good luck!
Dotan