Is Namenode still necessary if I use S3 instead of HDFS? - hadoop

Recently I am setting up my Hadoop cluster over Object Store with S3, all data file are store in S3 instead of HDFS, and I successfully run spark and MP over S3, so I wonder if my namenode is still necessary, if so, what does my namenode do while I am running hadoop application over S3? Thanks.

No, provided you have a means to deal with the fact that S3 lacks the consistency needed by the shipping work committers. Every so often, if S3's listings are inconsistent enough, your results will be invalid and you won't even notice.
Different suppliers of Spark on AWS solve this in their own way. If you are using ASF spark, there is nothing bundled which can do this.
https://www.youtube.com/watch?v=BgHrff5yAQo

Related

What is the recommended DefaultFS (File system) for Hadoop on ephemeral Dataproc clusters?

What is the recommended DefaultFS (File system) for Hadoop on Dataproc. Are there any benchmarks, considerations available around using GCS vs HDFS as the default file system?
I was also trying to test things out and discovered that when I set the DefaultFS to a gs:// path, the Hive scratch files are getting created - both on HDFS as well as the GCS paths. Is this happening synchronously and adding to latency or does the write to GCS happen after the fact?
Would appreciate any guidance, reference around this.
Thank you
PS: These are ephemeral Dataproc clusters that are going to be using GCS for all persistent data.
HDFS is faster. There should already be public benchmarks for that, or just taken as a fact because GCS is networked storage where HDFS is directly mounted in the Dataproc VMs.
"Recommended" would be persistent storage, though, so GCS, but maybe only after finalizing the data in the applications. For example, you might not want Hive scratch files in GCS since they'll never be used outside of the current query session, but you would want Spark checkpoints if you're running periodic batch jobs that scale down the HDFS cluster in between executions
I would say the default (HDFS) is the recommended. Typically, the input and output data of Dataproc jobs are persisted outside of the cluster in GCS or BigQuery, the cluster is used for compute and intermediate data. These intermediate data are stored on local disks directly or through HDFS which eventually also goes to local disks. After the job is done, you can safely delete the cluster, only pay for the storage of input and output data to save cost.
Also HDFS usually has lower latency for intermediate data, especially for lots of small files and metadata operations, e.g. dir rename. GCS is better at throughput for large files.
But when using HDFS, you need to provision sufficient disk space (at least 1TB each node) and consider using local SSDs. See https://cloud.google.com/dataproc/docs/support/spark-job-tuning#optimize_disk_size for more details.

How can i copy files from external Hadoop cluster to Amazon S3 without running any commands on the cluster

I have scenario in which i have to pull data from Hadoop cluster into AWS.
I understand running dist-cp on the hadoop cluster is a way to copy the data into s3, but i have a restriction here, i wont be able to run any commands in the cluster. I should be able to pull the files from hadoop cluster into AWS. The data is available in hive.
I thought of the below options:
1) Sqoop data from Hive ? Is it possible ?
2) S3-distcp (running it on aws), if so what would be the configuration needed ?
Any Suggestions ?
If the hadoop cluster is visible from EC2-land, you could run a distcp command there, or, if it's a specific bit of data, some hive query which uses hdfs:// as input and writes out to s3. You'll need to deal with kerberos auth though: you cannot use distcp in an un-kerberized cluster to read data from a kerberized one, though you can go the other way.
You can also run distcp locally in 1+ machine, though you are limited by the bandwidth of those individual systems. distcp is best when it schedules the uploads on the hosts which actually have the data.
Finally, if it is incremental backup you are interested in, you can use the HDFS audit log as a source of changed files...this is what incremental backup tools tend to use

Hadoop (HDFS) - file versioning

At the given time I have user file system in my application (apache CMIS).
As it's growing bigger, I'm doubting to move to hadoop (HDFS) as we need to run some statistics on it as well.
The problem:
The current file system provides versioning of the files.
When I read about hadoop - HDFS- and file versioning, I found most of the time that I have to write this (versioning) layer myself.
Is there already something available to manage versioning of files in HDFS or do I really have to write it myself (don't want to reinvent the hot water, but don't find a proper solution either).
Answer
For full details: see comments on answer(s) below
Hadoop (HDFS) doesn't support versioning of files. You can get this functionality when you combine hadoop with (amazon) S3:
Hadoop will use S3 as the filesystem (without chuncks, but recovery will be provided by S3). This solution comes with the versioning of files that S3 provides.
Hadoop will still use YARN for the distributed processing.
Versioning is not possible with HDFS.
Instead you can use Amazon S3, which provides Versioning and is also compatible with Hadoop.
HDFS supports snapshots. I think that's as close as you can get to "versioning" with HDFS.

Which version of Spark to download?

I understand you can download Spark source code (1.5.1), or prebuilt binaries for various versions of Hadoop. As of Oct 2015, the Spark webpage http://spark.apache.org/downloads.html has prebuilt binaries against Hadoop 2.6+, 2.4+, 2.3, and 1.X.
I'm not sure what version to download.
I want to run a Spark cluster in standalone mode using AWS machines.
<EDIT>
I will be running a 24/7 streaming process. My data will be coming from a Kafka stream. I thought about using spark-ec2, but since I already have persistent ec2 machines, I thought I might as well use them.
My understanding is that since my persistent workers need to perform checkpoint(), it needs to have access to some kind of shared file system with the master node. S3 seems like a logical choice.
</EDIT>
This means I need to access S3, but not hdfs. I do not have Hadoop installed.
I got a pre-built Spark for Hadoop 2.6. I can run it in local mode, such as the wordcount example. However, whenever I start it up, I get this message
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Is this a problem? Do I need hadoop?
<EDIT>
It's not a show stopper but I want to make sure I understand the reason of this warning message. I was under the assumption that Spark doesn't need Hadoop, so why is it even showing up?
</EDIT>
I'm not sure what version to download.
This consideration will also be guided by what existing code you are using, features you require, and bug tolerance.
I want to run a Spark cluster in standalone mode using AWS instances.
Have you considered simply running Apache Spark on Amazon EMR? See also How can I run Spark on a cluster? from Spark's FAQ, and their reference to their EC2 scripts.
This means I need to access S3, but not hdfs
One does not imply the other. You can run a Spark cluster on EC2 instances perfectly fine, and never have to access S3. While many examples are written using S3 access through the out-of-the-box S3 "fs" drivers for the Hadoop library, pay attention that there are now 3 different access methods. Configure as appropriate.
However, your choice of libraries to load will depend on where your data is. Spark can access any filesystem supported by Hadoop, from which there are several to choose.
Is your data even in files? Depending on your application, and where your data is, you may only need to use Data Frame over SQL, Cassandra, or others!
However, whenever I start it up, I get this message
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Is this a problem? Do I need hadoop?
Not a problem. It is telling you that it is falling back to a non-optimum implementation. Others have asked this question, too.
In general, it sounds like you don't have any application needs right now, so you don't have any dependencies. Dependencies are what would drive different configurations such as access to S3, HDFS, etc.
I can run it in local mode, such as the wordcount example.
So, you're good?
UPDATE
I've edited the original post
My data will be coming from a Kafka stream. ... My understanding is that .. my persistent workers need to perform checkpoint().
Yes, the Direct Kafka approach is available from Spark 1.3 on, and per that article, uses checkpoints. These require a "fault-tolerant, reliable file system (e.g., HDFS, S3, etc.)". See the Spark Streaming + Kafka Integration Guide for your version for specific caveats.
So why [do I see the Hadoop warning message]?
The Spark download only comes with so many Hadoop client libraries. With a fully-configured Hadoop installation, there are also platform-specific native binaries for certain packages. These get used if available. To use them, augment Spark's classpath; otherwise, the loader will fallback to less performant versions.
Depending on your configuration, you may be able to take advantage of a fully configured Hadoop or HDFS installation. You mention taking advantage of your existing, persistent EC2 instances, rather than using something new. There's a tradeoff between S3 and HDFS: S3 is a new resource (more cost) but survives when your instance is offline (can take compute down and have persisted storage); however, S3 might suffer from latency compared to HDFS (you already have the machines, why not run a filesystem over them?), as well as not behave like a filesystem in all cases. This tradeoff is described by Microsoft for choosing Azure storage vs. HDFS, for example, when using HDInsight.
We're also running Spark on EC2 against S3 (via the s3n file system). We had some issue with the pre-built versions for Hadoop 2.x. Regrettably I don't remember what the issue was. But in the end we're running with the pre-built Spark for Hadoop 1.x and it works great.

How could i relate Amazon EC2,S3 and my HDFS?

I am learning hadoop in a pseudo distributed mode,so not much aware of the cluster. So when browsed about cluster i get that S3 is a data storage device. And EC2 is a computing service,but couldn't understand the real use of it. Will my HDFS be available in S3. If yes when i was learning hive i came across moving data from HDFS to S3 and this is mentioned as a archival logic.
hadoop distcp /data/log_messages/2011/12/02 s3n://ourbucket/logs/2011/12/02
My HDFS is landed on S3 so how would it be beneficial? This might be silly but if some one could give me a overview that would be helpful for me.
S3 is just storage, no computation is allowed. You can think S3 as a bucket which can hold data & you can retrieve data from it using there API.
If you are using AWS/EC2 then your hadoop cluster will be on AWS/EC2, it is different from S3. HDFS is just a file system in hadoop for maximizing input/output performance.
The command which you shared is distributed copy. It will copy data from your hdfs to S3. In short, EC2 will have HDFS as default file system in hadoop environment and you can move archive data or unused data to S3, as S3 storage is cheaper than EC2 machines.

Resources