ETL process in AWS using EC2-s and EFS - hadoop

I am a data engineer with experience in designing n creating data integration and ELT processes. Below is my use case, and I need to move my process to aws and would like your opinion?
My files to be processed are in s3. I need to process those files using Hadoop. I have existing logic written in hive, just need to migrate the same to aws. Is the below approach correct/ feasible?
Spin up a fleet of ec2 instances, initially say 5, enable autoscaling.
Create an EFS, and mount it on the ec2 instances.
Copy file from s3 to EFS as Hadoop tables.
Run hive queries on top of the data in EFS and create new tables.
Once the process is completed, move/ export the final reports table from EFS to s3 (somehow). Not sure that whether this is possible or not, if this is not possible then this entire solution is not feasible.
6.Terminate EFS and EC2 instances.
If the above method is correct, How does the Hadoop orchestration happen using EFS?
Thanks,
KR

Spin up a fleet of ec2 instances, initially say 5, enable autoscaling.
I'm not sure you need the autoscaling.
why?
let's say you start a "big" query which takes lot's of time & cpu.
auto-scale will start more instances , but how will it start run "fraction" of the query on the new machine?
all machines need to be ready before you run the query . just keep it in mind.
Or in other words: only the machines that available now will handle the query.
Copy file from s3 to EFS as Hadoop tables.
There isn't any problem with this idea.
just keep in mind , you can keep the data in EFS .
if EFS is too pricy for you ,
Please check options for provision EBS-magnetic with Raid 0 .
You will gain great speeds at minimal costs.
The rest is okay, and this is one of the ways to do "on demand" interactive analytics.
Please take a look into AWS Athena.
It's a service which allows you to run queries on s3 objects .
You can use Json and even Parquet (which is much more efficient !)
This service may be enough for your need .
Good luck !

Related

How to update a laravel project in aws Elastic beanstalk, while keeping the same storage

I want to update my Laravel project in aws beanstalk, but the problem is the storage in tha aws elastic beanstalk is now different , and i want to keep it, i dont know how , cuz my Project contains a storage folder, but it's empty, and if i update it , i'll loose all the files
how can i update the code, but keep the storage ?
Your application should be designed to be stateless. The reason is that your EB instances always run in an Auto Scaling group.
This means that they can be terminated and replaced at any time, without your knowledge or involvement. There are many scenarios under which that may happen. Examples are, Availability Zone re-balance, migration to new physical hardware, scaling in and out activities, or instance health degradation.
Subsequently, you are always at risk loosing your storage, whether you like it or not.
Therefore you application should be designed as stateless, which means that it does not store any data on the instance. This is achieved usually by storing the data in an external storage such as EFS:
How can I mount an Amazon EFS volume to an instance in my Elastic Beanstalk environment?
But if you still want to keep your design, you can always use .ebextentions scripts to help you replace the storage folder. Specifically, in Commands you would make a copy of your storage folder to a safe location at the start of the new deployment. Then in Container commands you would copy the files back to your new application folder, just before the new deployment completes.

read data from amazon hbase

Can anyone suggest me that whether I can read data from amazon hbase using the org.apache.hadoop.conf.Configuration and org.apache.hadoop.hbase.client.HTablePool.
We are migrating to Amazon's EMR framework having hbase running on top of it.
The present implementation is based on pure Apache hadoop and hbase distributions. I'm trying to verify that no code changes needed even we migrate to amazon's EMR.
Please share your thoughts.
While it should not happen, I would expect the problems and changes related to the nature of EC2 and its networking.
HBase relay on Regions able to renew their leases in timely manner. If Region servers are two busy - because of some massive operations over them, they can not do so and get kicked off the cluster.
In amazon performance of the EC2 instances are much less predictable then in dedicated cluster (unless you use cluster instances), so adjusting timeout parameters and/or nature of your loads might be needed to get cluster to work properly

how does multiple EC2 instances (scaling) works on one EBS for data storage?

So, in a simple situation, if there is only one instance, then I can store the data into a EBS volume mounted on that instance. e.g. /mnt/db
However, how does it work if I scale and have multiple instance (either static or dynamic scaling)?
Because one EBS can only attach to one instance, if I have multiple instance, does it mean that I have to attach an EBS volume for each instance? If that's the case, the data on each Instance's EBS volume will be different.
It is obvious that I want all instances to access (R & W) a single volume (as data-storage). and the data in the volume will constantly grow and there is no downtime.
What is the solution? Is there a way that I don't mount the device (EBS), and just call it for accessing the data?
Here is what I can think of:
1) if each instance has its own EBS volume, then each time interval (e.g. 1 hour), all instances will unmount & detach the EBS volume,and attach a new one. Then there is one powerful instance that mount all the EBS volumes just detached, and aggregate all the data.
2) or similar to 1), instead of detach and attach, I just take a snapshot on all volumes for all instances. Then the powerful instance aggregateness the data from the snapshot. And save the result into either another EBS or S3.
These two approach seem to be working.. but require a lot of work. is there a smarter way to approach this problem? thanks.
by the way, because of performance issue, I cannot have the instance writes data to S3. :)
OH how about this
3) First, all instances have their own EBS and write data into the EBS. and then each hour, data will be sent to S3. Then another instance will aggregate them.
how about having ang NFS instance which can be mounted to the other instances?
It seems that you need to create an EBS snapshot of your most up to date EC2 instance. This will create an EBS backed AMI. You would then need to terminate all your EC2 instances that are not up to date and launch a new stack of instances from your newly created AMI. If you had a load balancer running then you would have to attach these new instances to your load balancer also.
It seems a little long-winded but it can all be done programmatically. At least this is how I think scaling in the cloud with Amazon works and far as propagating changes across multiple instances goes. Somebody else with more experience verify this. I plan to test it out myself later on.

How to rsync to all Amazon EC2 servers?

I have a Scalr EC2 cluster, and want an easy way to synchronize files across all instances.
For example, I have a bunch of files in /var/www on one instance, I want to be able to identify all of the other hosts, and then rsync to each of those hosts to update their files.
ls /etc/aws/hosts/app/
returns the IP addresses of all of the other instances
10.1.2.3
10.1.33.2
10.166.23.1
Ideas?
As Zach said you could use S3.
You could download one of many clients out there for mapping drives to S3. (search for S3 and webdav).
If I was going to go this route I would setup an S3 bucket with all my shared files and use jetS3 in a cronJob to sync each node's local drive to the bucket (pulling down S3 bucket updates). Then since I normally use eclipse & ant for building, I would create a ANT job for deploying updates to the S3 bucket (pushing updates up to the S3 bucket).
From http://jets3t.s3.amazonaws.com/applications/synchronize.html
Usage: Synchronize [options] UP <S3Path> <File/Directory>
(...)
or: Synchronize [options] DOWN
UP : Synchronize the contents of the Local Directory with S3.
DOWN : Synchronize the contents of S3 with the Local Directory
...
I would recommend the above solution, if you don't need cross-node file locking. It's easy and every system can just pull data from a central location.
If you need more cross-node locking:
An ideal solution would be to use IBM's GPFS, but IBM doesn't just give it away (at least not yet). Even though it's designed for High Performance interconnects it also has the ability to be used over slower connections. We used it as a replacement for NFS and it was amazingly fast ( about 3 times faster than NFS ). There maybe something similar that is open source, but I don't know. EDIT: OpenAFS may work well for building a clustered filesystem over many EC2 instances.
Have you evaluated using NFS? Maybe you could dedicate one instance as an NFS host.

EBS for storing databases vs. website files

I spent the day experimenting with AWS for the first time. I've got an EC2 instance running and I mounted an Elastic Block Store (EBS) to keep the MySQL databases.
Does it make sense to also put my web application files on the EBS, or should I just deploy them to the normal EC2 file system?
When you say your web application files, I'm not sure what exactly you are referring to.
If you are referring to your deployed code, it probably doesn't make sense to use EBS. What you want to do is create an AMI with your prerequisites, then have a script to create an instance of that AMI and deploy your latest code. I highly recommend you automate and test this process as it's easy to forget about some setting you have to manually change somewhere.
If you are storing data files, that are modified by the running application, EBS may make sense. If this is something like user-uploaded images or similar, you will likely find that S3 gives you a much simpler model.
EBS would be good for: databases, lucene indexes, file based CMS, SVN repository, or anything similar to that.
EBS gives you persistent storage so if you EC2 instance fails the files still exist. Apparently their is increased IO performance but I would test it to be sure.
If your files are going to change frequently (like a DB does) and you don't want to keep syncing them to S3 (or somewhere else), then an EBS is a good way to go. If you make infrequent changes and you can manually (or scripted) sync the files as necessary then store them in S3. If you need to shutdown or you lose your instance for whatever reason, you can just pull them down when you start up the new instance.
This is also assuming that you care about cost. If cost is not an issue, using the EBS is less complicated.
I'm not sure if you plan on having a separate EBS for your DB and your web files but if you only plan on having one EBS and you have enough empty space on it for your web files, then again, the EBS is less complicated.
If it's performance you are worried about, as mentioned, it's best to test your particular app.
Our approach is to have a script pre-deployed on our AMI that fetches the latest and greatest version of the code from source control. That makes it very straightforward to launch new instances quickly, or update all running instances (we take them out of the load balancing rotation one at a time, run the script, and put them back in the rotation).
UPDATE:
Reading between the lines it looks like you're mounting a separate EBS volume to an instance-store backed instance. AWS recently introduced EBS backed instances that have a ton of benefits vs. the old instance-store ones. I still mount my MySQL data on a separate EBS partition, though, so that I can easily mount it to a different server if needed.
I strongly suggest an EBS backed instance with a separate EBS volume for the MySQL data.

Resources