How could i relate Amazon EC2,S3 and my HDFS? - hadoop

I am learning hadoop in a pseudo distributed mode,so not much aware of the cluster. So when browsed about cluster i get that S3 is a data storage device. And EC2 is a computing service,but couldn't understand the real use of it. Will my HDFS be available in S3. If yes when i was learning hive i came across moving data from HDFS to S3 and this is mentioned as a archival logic.
hadoop distcp /data/log_messages/2011/12/02 s3n://ourbucket/logs/2011/12/02
My HDFS is landed on S3 so how would it be beneficial? This might be silly but if some one could give me a overview that would be helpful for me.

S3 is just storage, no computation is allowed. You can think S3 as a bucket which can hold data & you can retrieve data from it using there API.
If you are using AWS/EC2 then your hadoop cluster will be on AWS/EC2, it is different from S3. HDFS is just a file system in hadoop for maximizing input/output performance.
The command which you shared is distributed copy. It will copy data from your hdfs to S3. In short, EC2 will have HDFS as default file system in hadoop environment and you can move archive data or unused data to S3, as S3 storage is cheaper than EC2 machines.

Related

What is the recommended DefaultFS (File system) for Hadoop on ephemeral Dataproc clusters?

What is the recommended DefaultFS (File system) for Hadoop on Dataproc. Are there any benchmarks, considerations available around using GCS vs HDFS as the default file system?
I was also trying to test things out and discovered that when I set the DefaultFS to a gs:// path, the Hive scratch files are getting created - both on HDFS as well as the GCS paths. Is this happening synchronously and adding to latency or does the write to GCS happen after the fact?
Would appreciate any guidance, reference around this.
Thank you
PS: These are ephemeral Dataproc clusters that are going to be using GCS for all persistent data.
HDFS is faster. There should already be public benchmarks for that, or just taken as a fact because GCS is networked storage where HDFS is directly mounted in the Dataproc VMs.
"Recommended" would be persistent storage, though, so GCS, but maybe only after finalizing the data in the applications. For example, you might not want Hive scratch files in GCS since they'll never be used outside of the current query session, but you would want Spark checkpoints if you're running periodic batch jobs that scale down the HDFS cluster in between executions
I would say the default (HDFS) is the recommended. Typically, the input and output data of Dataproc jobs are persisted outside of the cluster in GCS or BigQuery, the cluster is used for compute and intermediate data. These intermediate data are stored on local disks directly or through HDFS which eventually also goes to local disks. After the job is done, you can safely delete the cluster, only pay for the storage of input and output data to save cost.
Also HDFS usually has lower latency for intermediate data, especially for lots of small files and metadata operations, e.g. dir rename. GCS is better at throughput for large files.
But when using HDFS, you need to provision sufficient disk space (at least 1TB each node) and consider using local SSDs. See https://cloud.google.com/dataproc/docs/support/spark-job-tuning#optimize_disk_size for more details.

Data Migration of 200TB of Hadoop data between AWS VPCs

I have a technology challenge in my hand. Need to Transfer 200TB of hadoop Data between two different AWS VPCs. There are following restrictions
No VPC peering
no third party software installation
Following is the solution throught through. Tried to cut one hop but the performance is not that great
Hadoop Data to EFS data..--> efs to efs --> efs to hadoop
1)Please dont use efs to efs copy. That's very slow compared to s3 replication.
2) use multiple buckets to replicate data. Go for 10 buckets with replication
3) use distcp to copy data to s3 from hdfs

Is Namenode still necessary if I use S3 instead of HDFS?

Recently I am setting up my Hadoop cluster over Object Store with S3, all data file are store in S3 instead of HDFS, and I successfully run spark and MP over S3, so I wonder if my namenode is still necessary, if so, what does my namenode do while I am running hadoop application over S3? Thanks.
No, provided you have a means to deal with the fact that S3 lacks the consistency needed by the shipping work committers. Every so often, if S3's listings are inconsistent enough, your results will be invalid and you won't even notice.
Different suppliers of Spark on AWS solve this in their own way. If you are using ASF spark, there is nothing bundled which can do this.
https://www.youtube.com/watch?v=BgHrff5yAQo

How can I couple Amazon Glacier / S3 with hadoop map reduce / spark?

I need to process data stored in Amazon S3 and Amazon Glacier with Hadoop / EMR and save the output data in an RDBMS for eg. Vertica
I am a total noob in big data. And I have only gone through few online sessions and ppts about map reduce and sparx. And created few dummy map reduce codes for learning purpose.
Till now I only have commands that let me import data from S3 to HDFC in Amazon EMR and after processing they store them in HDFS files.
So here are my questions:
Is it really mandatory to sync data from S3 to HDFC first before executing map reduce or is there a way to use S3 directly.`
How can I make hadoop access Amazon Glacier data`
And finally how can I store the output to Database.`
Any suggestion / reference is welcome.
EMR clusters are able to read/write to/from S3, so no need to copy data to the cluster. S3 has an implementation as Hadoop FileSystem so it can mostly be treated the same as HDFS.
AFAIK your MR/Spark jobs cannot directly access data from Glacier, data has to first be downloaded from glacier, by itself a lengthy procedure.
Check out Sqoop for pumping data between HDFS and DB

Amazon S3 with a local Hadoop cluster

I have a data of about 40 TB in Amazon S3 which I need to analyze using Map Reduce. Our current IT policies do not provide an Amazon EMR account for the same and hence I have to rely on a locally managed Hadoop cluster. I wanted to get an advice on if its advisable to use local Hadoop cluster when our data is actually stored on S3 ?
Please check out https://wiki.apache.org/hadoop/AmazonS3 on how to use S3 as a replacement for HDFS. You can choose either S3 Native FileSystem or S3 Block FileSystem.

Resources