Data Migration of 200TB of Hadoop data between AWS VPCs - hadoop

I have a technology challenge in my hand. Need to Transfer 200TB of hadoop Data between two different AWS VPCs. There are following restrictions
No VPC peering
no third party software installation
Following is the solution throught through. Tried to cut one hop but the performance is not that great
Hadoop Data to EFS data..--> efs to efs --> efs to hadoop

1)Please dont use efs to efs copy. That's very slow compared to s3 replication.
2) use multiple buckets to replicate data. Go for 10 buckets with replication
3) use distcp to copy data to s3 from hdfs

Related

ETL process in AWS using EC2-s and EFS

I am a data engineer with experience in designing n creating data integration and ELT processes. Below is my use case, and I need to move my process to aws and would like your opinion?
My files to be processed are in s3. I need to process those files using Hadoop. I have existing logic written in hive, just need to migrate the same to aws. Is the below approach correct/ feasible?
Spin up a fleet of ec2 instances, initially say 5, enable autoscaling.
Create an EFS, and mount it on the ec2 instances.
Copy file from s3 to EFS as Hadoop tables.
Run hive queries on top of the data in EFS and create new tables.
Once the process is completed, move/ export the final reports table from EFS to s3 (somehow). Not sure that whether this is possible or not, if this is not possible then this entire solution is not feasible.
6.Terminate EFS and EC2 instances.
If the above method is correct, How does the Hadoop orchestration happen using EFS?
Thanks,
KR
Spin up a fleet of ec2 instances, initially say 5, enable autoscaling.
I'm not sure you need the autoscaling.
why?
let's say you start a "big" query which takes lot's of time & cpu.
auto-scale will start more instances , but how will it start run "fraction" of the query on the new machine?
all machines need to be ready before you run the query . just keep it in mind.
Or in other words: only the machines that available now will handle the query.
Copy file from s3 to EFS as Hadoop tables.
There isn't any problem with this idea.
just keep in mind , you can keep the data in EFS .
if EFS is too pricy for you ,
Please check options for provision EBS-magnetic with Raid 0 .
You will gain great speeds at minimal costs.
The rest is okay, and this is one of the ways to do "on demand" interactive analytics.
Please take a look into AWS Athena.
It's a service which allows you to run queries on s3 objects .
You can use Json and even Parquet (which is much more efficient !)
This service may be enough for your need .
Good luck !

How could i relate Amazon EC2,S3 and my HDFS?

I am learning hadoop in a pseudo distributed mode,so not much aware of the cluster. So when browsed about cluster i get that S3 is a data storage device. And EC2 is a computing service,but couldn't understand the real use of it. Will my HDFS be available in S3. If yes when i was learning hive i came across moving data from HDFS to S3 and this is mentioned as a archival logic.
hadoop distcp /data/log_messages/2011/12/02 s3n://ourbucket/logs/2011/12/02
My HDFS is landed on S3 so how would it be beneficial? This might be silly but if some one could give me a overview that would be helpful for me.
S3 is just storage, no computation is allowed. You can think S3 as a bucket which can hold data & you can retrieve data from it using there API.
If you are using AWS/EC2 then your hadoop cluster will be on AWS/EC2, it is different from S3. HDFS is just a file system in hadoop for maximizing input/output performance.
The command which you shared is distributed copy. It will copy data from your hdfs to S3. In short, EC2 will have HDFS as default file system in hadoop environment and you can move archive data or unused data to S3, as S3 storage is cheaper than EC2 machines.

Need help to setup hadoop cluster in aws

I would like to setup a hadoop cluster in aws which will have total capacity of 100T approx. If I go and choose aws instances as per http://aws.amazon.com/ec2/instance-types/ , I do not get ideal configuration for data nodes, I would like to use local disks(SSD/NON-SSD) for worker nodes. for e.g. If I select cc2.8xlarge instance for datanode then for 100T I will have to setup 30 cc2.8xlarge instances which would be very costly. Could you please suggest how should I configure my cluster in aws (EC2) with minimum number of datanodes or is there any standard configuration for hadoop in aws ?
It sounds very much like you want to consider Elastic MapReduce which is a core AWS service based in Hadoop.
http://aws.amazon.com/elasticmapreduce/
You can specify your configuration and the cluster will launch for you - much easier than trying to configure EC2 instances yourself.
If you want to do Hadoop yourself, then you use EBS drives. You can mount a bunch of drives (around 10-20 as I recall) on each node, and each drive can be up to 1 TB.
If you don't want to do it yourself, then look into EMR like monkeymatrix said.

Amazon S3 with a local Hadoop cluster

I have a data of about 40 TB in Amazon S3 which I need to analyze using Map Reduce. Our current IT policies do not provide an Amazon EMR account for the same and hence I have to rely on a locally managed Hadoop cluster. I wanted to get an advice on if its advisable to use local Hadoop cluster when our data is actually stored on S3 ?
Please check out https://wiki.apache.org/hadoop/AmazonS3 on how to use S3 as a replacement for HDFS. You can choose either S3 Native FileSystem or S3 Block FileSystem.

Syncing between Amazon EBS Devices

I have 2 EC2 instances, each with their own EBS attached. Sitting infront of the EC2s is a load balancer.
These instances run CMS driven sites, where uses can upload files.
What would be the best solution to the problem of a file getting uploaded to one EBS and the load balancer sending a visitor to the EC2 instance whose EBS does not have the file? Some sort of cron which runs an rsync?
Suggestions very welcome!
Thanks
S
I believe the best solution would be to use single shared storage like Amazon S3. It's better to use some plugin for your CMS to store users' files on S3. But if there is no such plugin you can use Fuse s3fs adapter to mount the file system on both instances and configure your CMS to store those files in that specified directory.
there are several solutions to this problem from top of my head i think
nfs/samba shared dir between instances
svn deploy
cluster file systems - OCFS/GFS
cloud management such as capistrano and trriger a deploy when you need
and of course cron jobs when you can do ftp, scp, rsync, s3sync/copy etc
Or possibly, create one EC2 instance as NFS and share it's directories with your other instances.
There are multiple solutions to keep data in both EC2 in sync with or without using EBS volumes.
Can use AWS EFS service instead of using EBS volumes. EFS volume can be shared between EC2 instances within a VPC, and both instances will have data in sync on the mountpath where EFS is mounted on instances.
Another solution is using Gluster File Storage. This can also work between EBS volumes in different AWS region. Refer this link: http://sanketdangi.com/post/5601762671/gluster-config-aws-multi-az
Can mount S3 bucket on your EC2 instances using S3 Fuse. Refer this link: https://github.com/s3fs-fuse/s3fs-fuse/wiki/Fuse-Over-Amazon
May be you can also use "s3 sync" on both ebs volumes. This way both ebs will be in sync via S3. Refer this link: https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

Resources