I am planning to implement a hadoop cluster with about 5 machines. With some background study, I understood that I need to install hadoop on each of those machines in order to implement the cluster.
Earlier I was planning to install a Linux distribution on each of these machines, and then install hadoop separately, and configure each machine to work in parallel.
Recently I came through some Hadoop distributions, such as Cloudera and Hortonworks. My question is, should I install a distribution such as Cloudera or Hortonworks in each of those machines, or should I install hadoop separately as I described earlier?
Will using a distribution make my task easier or would it need more knowledge to handle them than pure hadoop installation?
I'm a beginner in Hadoop too (~1.5 month), using a distribution can be very helpful if you use the automated way to install (Cloudera Manager for Cloudera or Ambari for Hortonworks). It install and deploy Hadoop and services you choose (hive, impala, spark, hue ...) on all the cluster very quickly. The main disadvantages in my opinion is that you can't really optimize and personalize your installation but for a first time it's much easier to run some simple cases.
I would highly recommend using a distro rather than doing it manually. Even using a distro will be complicated the first time as there are a lot of separate services that need to be running depending on what you want into addition to a base Hadoop install.
Also, do you intend to have a cluster size of just 5 machines? If so Hadoop may not be the right solution for you. You could potentially run all the masters on a single server and have a 4 node cluster, but that is probably not going to perform all that well. Note that the typical redundancy for HDFS is 3, so 4 nodes is just barely enough. If one or two machines goes down you could easily lose data in a production cluster. Personally I would recommend at least 8 nodes and one or two servers for the masters, so a total cluster size of 9 or 10, preferably 10.
Related
I am experimenting with Hadoop and Spark, as the company I work for is getting ready to start spinning up Hadoop and want to use Spark and other resources to do a lot of machine learning on our data.
Most of that falls to me, so I am preparing by learning on my own.
I have a machine I have setup as a single node Hadoop cluster.
Here is what I have:
CentOS 7 (minimal server install, added XOrg and OpenBox for GUI)
Python 2.7
Hadoop 2.7.2
Spark 2.0.0
I followed these guides to set this up:
http://www.tecmint.com/install-configure-apache-hadoop-centos-7/
http://davidssysadminnotes.blogspot.com/2016/01/installing-spark-centos-7.html
When I attempt to run 'pyspark' I get the following:
IPYTHON and IPYTHON_OPTS are removed in Spark 2.0+. Remove these from the environment and set PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYHTON_OPTS instead.
I opened up the pyspark file in vi and examined it.
I see a lot of stuff going on there, but I don't know where to start to make the corrections I need to make.
My Spark installation is under:
/opt/spark-latest
The pyspark is under /opt/spark-latest/bin/ and my Hadoop installation (though I don't think this factors in) is /opt/hadoop/.
I know there must be a change I need to make in the pyspark file somewhere, I just don't know where to being on this.
I did some googling and found references to similar things, but nothing that indicated steps in order to fix this.
Can anyone give me a push in the right direction?
If just starting to learn Spark's compatibility in a Hadoop environment, at the moment, Spark 2.0 isn't officially supported (Cloudera CDH or Hortonworks HDP). I'll go ahead and assume your company isn't standing up Hadoop outside of one of those distributions (because enterprise support).
That being said, Spark 1.6 (and Hadoop 2.6) is the latest supported version. Reason being is that there are a few breaking changes in Spark 2.0.
Now, if using Spark 1.6, you shouldn't get those errors. Anaconda isn't completely necessary (PySpark and Scala shells should just work). If using Jupyter notebooks, you could look up Apache Toree, which I've had good success getting notebooks setup. Otherwise, Apache Zeppelin is probably the recommended notebook environment in a production Hadoop cluster.
All the documentation about deploying a Spark cluster on Amazon EC2 is relative to Linux environments. However, my distributed project is, at this moment, dependent of some Windows functionalities, and I would like to start working with a Windows cluster while making the necessary changes.
I would like to know if there is any method that makes us able to deploy a Windows Spark cluster on EC2 in a way relatively similar to the spark-ec2 script provided by Spark.
spark-ec2 currently only supports launching clusters in EC2 using specific Linux AMIs, so deploying a Windows Spark cluster is currently not possible using that tool. I doubt that spark-ec2 will ever have that capability, since all of the setup scripts it uses assume a Linux host.
That said, Databricks recently announced a community-managed index of Spark packages, and people are adding in stuff there all the time. For example, there is already a package to let you launch Spark clusters on Google's Compute Engine.
Though there doesn't currently appear to be anything for you, I would keep my eye on that community index for something that lets you launch Windows Spark clusters on EC2.
In a resource with Spark Packages, suggested by Nick, you can see recently added project by Sigmoid Analytics - that lets you launch the Spark cluster on Azure - spark_azure:
https://github.com/sigmoidanalytics/spark_azure
I am trying to learn Hadoop, is that possible to install Hadoop on a linux box and try most of (if not all of) the Hadoop utilities?
You can download CDH3 virtual machine from cloudera.(https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM) and have everything integrated in one VM. IMHO It is a simplest way to start with hadoop.
Yes. It is possible.
Hadoop can run in two modes locally:
Standalone mode -- nothing starts up, you can run hadoop jobs locally off of local files.
Pseudo-distributed mode -- Effectively distributed mode, but all of the daemons start up locally on one node.
How to set these up and get started with them is documented on the hadoop page.
Since you say you want to try Hadoop utilities, you probably want to try out pseudo-distributed mode. When using the command line tools, mapreduce jobs, pig, hive, etc., a local cluster running in pseudo-distributed mode will look like a 1000 node cluster (except that it can't hold as much data).
I am using HBase. I have installed and have the distributed environment running now.
However, it shows a warning in HMaster's interface page:
"You are currently running the HMaster without HDFS append support enabled. This may result in data loss"
How can I solve this? If I don't use CDH3's hadoop? Can someone give me very detailed instructions please?
Thanks!!!!
As you just found out you cannot (should not) use the standard Apache release of Hadoop 0.20.* with HBase as it is missing append support, HDFS-200. There is no official ASF Hadoop release that has append support. Cloudera's release is the easiest way, can you elaborate on why you cannot use it? It is distributed with the same license as Apache, and if you use a tarball release it is similar to the Apache release and you don't need special permission to install RPMs.
The other choices that I am aware of are rolling your own hadoop from the hadoop-append branch (not fun) and using MapR, which I have no first hand experience with.
For a while on the HBase mail lists some people have had luck replacing the hadoop jar in their hadoop install with the hadoop jar that gets distributed with HBase. That way does seem fraught with risk and not everyone is happy with it.
At my house I have about 10 computers all different processors and speeds (all x86 compatible). I would like to cluster these. I have looked at openMosix but since they stopped development on it I am deciding against using it. I would prefer to use the latest or next to latest version of a mainstream distribution of Linux (Suse 11, Suse 10.3, Fedora 9 etc).
Does anyone know any good sites (or books) that explain how to get a cluster up and running using free open source applications that are common on most mainstream distributions?
I would like a load balancing cluster for custom software I would be writing. I can not use something like Folding#home because I need constant contact with every part of the application. For example if I was running a simulation and one computer was controlling where rain was falling, and another controlling what my herbivores are doing in the simulation.
I recently set up an OpenMPI cluster using Ubuntu. Some existing write up is at https://wiki.ubuntu.com/MpichCluster .
Your question is too vague. What cluster application do you want to use?
By far the easiest way to set up a "cluster" is to install Folding#Home on each of your machines. But I doubt that's really what you're asking for.
I have set up clusters for music/video transcoding using simple bash scripts and ssh shared keys before.
I manage mail server clusters at work.
You only need a cluster if you know what you want to do. Come back with an actual requirement, and someone will suggest a solution.
Take a look at Rocks. It's a fullblown cluster "distribution" based on CentOS 5.1. It installs all you need (libs, applications and tools) to run a cluster and is dead simple to install and use. You do all the tweaking and configuration on the master node and it helps you with kickstarting all your other nodes. I've recently been installing a 1200+ nodes (over 10.000 cores!) cluster with it! And would not hesitate to install it on a 4 node cluster since the workload to install the master is none!
You could either run applications written for cluster libs such as MPI or PVM or you could use the queue system (Sun Grid Engine) to distribute any type of jobs. Or distcc to compile code of choice on all nodes!
And it's open source, gpl, free, everything that you like!
I think he's looking for something similar with openMosix, some kind of a general cluster on top of which any application can run distributed among the nodes. AFAIK there's nothing like that available. MPI based clusters are the closest thing you can get, but I think you can only run MPI applications on them.
Linux Virtual Server
http://www.linuxvirtualserver.org/
I use pvm and it works. But even with a nice ssh setup, allowing for login without entering passwd to the machine, you can easily remotely launch commands on your different computing nodes.