Kubernetes distributed filesystem - hadoop

Well, my company is considering to move from Hadoop to Kubernetes. We can find solutions in Kubernetes for tools such as cassandra, sparks, etc. So the last problem for us is how to store massive amount of files in Kubernetes, saying 1 PB. FYI, we DO NOT want to use online storage services such as S3.
As far as I know, HDFS is merely used in Kubernetes and there are a few replacement products such as Torus and Quobyte. So my question is, any recommendation for the filesystem on Kubernetes? Or any better solution?
Many thanks.

You can use a Hadoop Compatible FileSystem such as Ceph or Minio. Both of which offer S3-compatible REST APIs for reading and writing. In Kubernetes, Ceph can be deployed using the Rook project.
But overall, running HDFS in Kubernetes would require stateful services like the NameNode, and DataNodes with proper affinity and network rules in place. The Hadoop Ozone project is a realization that object storage is more common for microservice workloads than HDFS block storage as reasonably trying to analyze PB of data using distributed microservices wasn't feasible. (I'm only speculating)
The alternative is to use Docker support in Hadoop & YARN 3.x

Related

Install Druid on AWS EMR

Just started exploring Druid we dint find any blog on link to install Druid on AWS, Is there any chance to install Druid on AWS EMR ? If so if there are any per-defined Cloud Formation to execute it will be help full for my R&D on Druid.
its pretty straighforward to setup a basic single cluster druid
launch EMR with a single node master, like r3.4xlarge
download imply tar (comes with druid and pivot), https://docs.imply.io/on-prem/quickstart
tar -xzf imply-3.1.8.1.tar.gz cd imply-3.1.8.1
bin/supervise -c conf/supervise/quickstart.conf
If you are looking for a full cluster deploy, EMR is not the right tool.
If you know EKS / kubernetes, I think the easiest way to get started is using Helm
https://github.com/helm/charts/tree/master/incubator/druid
Other option is to look for Imply Cloud
They also solid documentation around Druid. Druid'd own documentation is pretty intense. I found imply to be better for beginners.
https://docs.imply.io/cloud/
Although for POC, a single r3.4xlarge or i3.4xlarge having some 200G storage is good enough
The most likely reason why you would not find much documentation is that the two things have a different nature.
Druid is meant to be long lived and statefull, where the EMR hadoop variant is meant to spin up and down in a more ephemeral manner. As such the combination is somewhat awkward.
Consider using a different hadoop distribution like HDP. Of course you can easily deploy it on AWS if needed, or on your own hardware if you want to minimize infra costs.
Disclaimer: I am an employee of Cloudera, the distributor of HDP, which is currently the most common hadoop platform under Druid.

How to provision a Hadoop ecosystem cluster with OpenShift?

We are searching a viable way for provisioning a Hadoop ecosystem cluster with OpenShift (based on Docker). We look to build up a cluster using the services of the Hadoop ecosystem, i.e. HDFS, YARN, Spark, Hive, HBase, ZooKeeper etc.
My team has been using Hortonworks HDP for on-premise hardware but will now switch into a OpenShift-based infrastructure. Hortonworks Cloudbreak seems not to be suitable for OpenShift-based infrastructures. I have found this article that describes the integration of YARN into OpenShift but it seems like there are no further information available.
What is the easiest way to provision a Hadoop ecosystem cluster on OpenShift? Manually adding all the services feels error-prone and hard to administer. I have stumbled upon the Docker images of these separate services, but it is not comparable to the automated provisioning you get with a platform like Hortonworks HDP. Any guidance is appreciated.
If you install Openstack within Openshift, Sahara allows provisioning of Openstack Hadoop clusters
Alternatively, Cloudbreak is Hortonwork's tool for provisioning container based cloud deployments
Both provides Ambari, allowing you the same interface for cluster administration as HDP.
FWIW, I personally don't find the reason for putting Hadoop in containers. Your datanodes are locked to specific disks. There's no improvement in running several smaller ResourceManagers on a single host. Plus, for YARN, you'd be running containers within containers. And for the namenode, you must have a replicated Fsimage + Editlog because the container could be placed on any system

Which version of Spark to download?

I understand you can download Spark source code (1.5.1), or prebuilt binaries for various versions of Hadoop. As of Oct 2015, the Spark webpage http://spark.apache.org/downloads.html has prebuilt binaries against Hadoop 2.6+, 2.4+, 2.3, and 1.X.
I'm not sure what version to download.
I want to run a Spark cluster in standalone mode using AWS machines.
<EDIT>
I will be running a 24/7 streaming process. My data will be coming from a Kafka stream. I thought about using spark-ec2, but since I already have persistent ec2 machines, I thought I might as well use them.
My understanding is that since my persistent workers need to perform checkpoint(), it needs to have access to some kind of shared file system with the master node. S3 seems like a logical choice.
</EDIT>
This means I need to access S3, but not hdfs. I do not have Hadoop installed.
I got a pre-built Spark for Hadoop 2.6. I can run it in local mode, such as the wordcount example. However, whenever I start it up, I get this message
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Is this a problem? Do I need hadoop?
<EDIT>
It's not a show stopper but I want to make sure I understand the reason of this warning message. I was under the assumption that Spark doesn't need Hadoop, so why is it even showing up?
</EDIT>
I'm not sure what version to download.
This consideration will also be guided by what existing code you are using, features you require, and bug tolerance.
I want to run a Spark cluster in standalone mode using AWS instances.
Have you considered simply running Apache Spark on Amazon EMR? See also How can I run Spark on a cluster? from Spark's FAQ, and their reference to their EC2 scripts.
This means I need to access S3, but not hdfs
One does not imply the other. You can run a Spark cluster on EC2 instances perfectly fine, and never have to access S3. While many examples are written using S3 access through the out-of-the-box S3 "fs" drivers for the Hadoop library, pay attention that there are now 3 different access methods. Configure as appropriate.
However, your choice of libraries to load will depend on where your data is. Spark can access any filesystem supported by Hadoop, from which there are several to choose.
Is your data even in files? Depending on your application, and where your data is, you may only need to use Data Frame over SQL, Cassandra, or others!
However, whenever I start it up, I get this message
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Is this a problem? Do I need hadoop?
Not a problem. It is telling you that it is falling back to a non-optimum implementation. Others have asked this question, too.
In general, it sounds like you don't have any application needs right now, so you don't have any dependencies. Dependencies are what would drive different configurations such as access to S3, HDFS, etc.
I can run it in local mode, such as the wordcount example.
So, you're good?
UPDATE
I've edited the original post
My data will be coming from a Kafka stream. ... My understanding is that .. my persistent workers need to perform checkpoint().
Yes, the Direct Kafka approach is available from Spark 1.3 on, and per that article, uses checkpoints. These require a "fault-tolerant, reliable file system (e.g., HDFS, S3, etc.)". See the Spark Streaming + Kafka Integration Guide for your version for specific caveats.
So why [do I see the Hadoop warning message]?
The Spark download only comes with so many Hadoop client libraries. With a fully-configured Hadoop installation, there are also platform-specific native binaries for certain packages. These get used if available. To use them, augment Spark's classpath; otherwise, the loader will fallback to less performant versions.
Depending on your configuration, you may be able to take advantage of a fully configured Hadoop or HDFS installation. You mention taking advantage of your existing, persistent EC2 instances, rather than using something new. There's a tradeoff between S3 and HDFS: S3 is a new resource (more cost) but survives when your instance is offline (can take compute down and have persisted storage); however, S3 might suffer from latency compared to HDFS (you already have the machines, why not run a filesystem over them?), as well as not behave like a filesystem in all cases. This tradeoff is described by Microsoft for choosing Azure storage vs. HDFS, for example, when using HDInsight.
We're also running Spark on EC2 against S3 (via the s3n file system). We had some issue with the pre-built versions for Hadoop 2.x. Regrettably I don't remember what the issue was. But in the end we're running with the pre-built Spark for Hadoop 1.x and it works great.

Did hortan sandbox can use as a single node Hadoop cluster

I like to study about Hadoop multinode setup and installation, by referring the above tutorial I understand that single node cluster environment can be used as node for the multinode cluster
http://bigdatahandler.com/hadoop-hdfs/hadoop-multi-node-cluster-setup/
Currently I am learning Hadoop using Horton sandbox, can we use a sandbox system as a single node environment?
If not what is the difference between sandbox and traditional Hadoop cluster installation
The sandbox images (from Hortonworks and Cloudera) provide the user with a pre-configured development environment with all the usual tools already available and installed (pig, hive etc.). Since the image is a single "system" it is set-up such that the hadoop cluster is single-node: i.e. everything - HDFS, Hadoop map-reduce etc. - is local to that image. That is a massive benefit, as anyone who has set up a hadoop cluster will tell you! It allows you to get up-and-running with very little operational overhead.
What these sandboxes do not provide, however, is realistic cluster behaviour as you have only one node. But there other possibilities - tools such as Vagrant and Docker - that would allow you to do this (I have not tried it myself).
The big data handler link you shared seems to be about combining several of these standalone, inherently single-node "clusters" so that you have something more realistic. But I would guess setting this up so that YARN, Zookeeper and other services are not duplicated comes with a not insignificant challenge.

Recommendations for Hadoop on EC2?

When running Hadoop in EC2, I seem to have two options:
A: Manage the cluster myself, using the EC2-specific shell scripts that come with Hadoop.
B: Use Elastic MapReduce, and pay a little extra for the convenience.
I'm leaning towards B, but I'd appreciate some advice from people with more experience. Here are my questions:
Are there any tasks that can be done with one of these methods but not the other?
Are there other options besides these two that I'm overlooking?
If I choose B, how easy would it be to go back to A? That is, what's the danger of vendor lock-in?
Third option:
You can use apache whirr to set up an hadoop cluster on ec2 (rackspace is also supported)
I have been told by people close to the Amazon Elastic MapReduce (EMR) development team that there are at least two other advantages to using EMR: a) Amazon is actively applying bug fixes and performance enhancements to the Hadoop code base used on EMR, and b) Amazon employs a high performance network between EMR servers and S3 servers that may not be available between EC2 servers and S3 servers.
UPDATE: See #mat's comments that refute the rumored advantages of using EMR.
Disclaimer: I'm the founder of Axemblr.com
There are also commercial alternatives you can use. Axemblr Tool for Cloudera CDH3 is a tool we are building that can deploy a cluster in just a few minutes with all you need (including Cloudera Hue, Mahout & Pig).
We are also building an alternative to EMR that's fully compatible from an API perspective, targeted at private clouds.
If you are wondering why it makes sense to run CDH on EC2 rather than EMR see:
http://www.quora.com/What-are-the-advantages-disadvantages-running-Clouderas-distribution-for-Hadoop-on-EC2-instances-rather-than-using-Amazons-Elastic-Map-Reduce-Service

Resources