The hadoop documentation states that DCE does not support a cluster with secure mode (Kerberos): https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/DockerContainerExecutor.html
Are people working on this? Is there a way around this limitation?
Ok. There's no current work on DCE (YARN-2466). Efforts have shifted towards supporting Docker containers in the LinuxContainerExecutor (YARN-3611). This will support Kerberos. There is currently no documentation yet (YARN-5258) and many of these features are expected to be part of the 2.8 Apache release.
Source and more info:
https://community.hortonworks.com/questions/39064/can-i-run-dce-docker-container-executor-on-yarn-wi.html
Related
Has someone successfully run Flink jobs with this kind of setup (Github CI CD and Kubernetes)?
Since Flink jobs can’t be dockerized and deployed in a natural way as part
of the container I am not very sure of how is the
best way of doing this.
Thanks
Yes, this can be done. For the dockerizing portion, see the the docs about running Flink on Docker and running Flink on Kubernetes, as well as Patrick Lukas' Flink Forward talk on "Flink in Containerland". You'll find links to docker hub, github, slideshare, and youtube behind these links.
dA Platform 2 is a commercial offering from data Artisans that supports CI/CD integrations for Flink on Kubernetes. The demo video from the product announcement at Flink Forward Berlin 2017 illustrates this.
I solved this stuff. Flink jobs can be dockerized and deployed in a natural way as part of the container.
I extend flink Docker and added "Flink run "some.jar"" step to it.
It works perfect
https://github.com/Aleksandr-Filichkin/flink-k8s
I want to use Big Data Analytics for my work. I have already implemented all the docker stuff creating containers within containers. I am new to Big Data however and I have come to know that using Hadoop for HDFS and using Spark instead of MapReduce on Hadoop itself is the best way for websites and applications when speed matters (is it?). Will this work on my Docker containers? It'd be very helpful if someone could direct me somewhere to learn more.
You can try playing with Cloudera QuickStart Docker Image to get started. Please take a look at https://hub.docker.com/r/cloudera/quickstart/. This docker image supports single-node deployment of Cloudera's Hadoop platform, and Cloudera Manager. Also this docker image supports spark too.
I am in planning phase of a multi-node Hadoop cluster in a Docker based environment. So it should be based on a lightweight easy to use virtualized system.
Current architecture (regarding to documentation) contains 1 master and 3 slave nodes. This host machine uses HDFS filesystem and KVM for virtualization.
The whole cloud is managed by Cloudera Manager. There are several Hadoop modules installed on this cluster. There is also a NodeJS data upload service.
This time I should make architecture Docker based.
I have read several tutorials and have some opinions, but also open questions.
A. What do you think, is https://github.com/Lewuathe/docker-hadoop-cluster a good base for my project? I have found also an official image, but it is single-node.
B. How will system requirements change if I would like to make this in a single container? It would be great, because this architecture should work in different locations, so changes can be easily transferred between these locations. Synchronization between these so called clones would be important.
C. Do you have some other ideas, maybe best practices?
As of September 2016 there is no quick answer.
https://github.com/Lewuathe/docker-hadoop-cluster does not seem like a good start, as it should be universal for your B. option
Keep an eye on https://github.com/sequenceiq/hadoop-docker and https://github.com/kiwenlau/hadoop-cluster-docker
To address your question C., you may want to check out BlueData's software platform: http://www.bluedata.com/blog/2015/06/docker-containers-big-data-clusters
It's designed to run multi-node Hadoop clusters in a Docker-based environment and there is a free version available for download (you can also run it in an AWS EC2 instance).
This work has already been done for you, actually:
https://hub.docker.com/r/cloudera/clusterdock/
It includes a pre-packaged multi-node CDH cluster, with Cloudera Manager as an optional component for cluster management et al.
I configured a multi node hadoop env on AWS (1 master/3 slaves running on Ubuntu 14.04). now I am planning to install and configure other Apache bricks (not sure which one exactly yet). I decided to start with HBase.
here is my dilemma: should I install zookeeper as a standalone and then HBase (taking into consideration future bricks like pig, hive ...) or should I use zookeeper/Hbase bundled?
How this choices may affect subsequent architecture design ?
thanks for sharing your views/personal experiences !
It doesn't really matter all that much from a capabilities point of view. If you install the bundled HBase+ZK, you'll still be able to use ZK later on to support other bricks. Since installing the bundle is likely to be the quickest path to a working HBase, it is probably the best option for you.
ZK ensemble is recommended to be run on separate machines (Odd number), in any production environment.
For your learning & experimenting, it can co-exist in one machine.
More information # https://zookeeper.apache.org/doc/r3.3.2/zookeeperAdmin.html
I have been using a Hadoop cluster, created using Google's script, for a few months.
Every time I boot the machines I have to manually start Hadoop using:
sudo su hadoop
cd /home/hadoop/hadoop-install/sbin
./start-all.sh
Besides scripting, how can I resolve this?
Or is this just the way it is by default?
(The first boot after cluster creation always starts Hadoop automatically, why not always?)
You have to configure using init.d.
Document provide more details and sample script for datameer. You need to follow similar steps. Script should be smart enough to check all the nodes in the cluster are up before invoking this script using ssh.
While different third-party scripts and "getting started" solutions like Cloud Launcher have varying degrees of support for automatic restart of Hadoop on boot, the officially supported tools are bdutil as a do-it-yourself deployment tool, and Google Cloud Dataproc as a managed service, both of which are already configured with init.d and/or systemd to automatically start Hadoop on boot.
More detailed instructions on using bdutil here.