All the documentation about deploying a Spark cluster on Amazon EC2 is relative to Linux environments. However, my distributed project is, at this moment, dependent of some Windows functionalities, and I would like to start working with a Windows cluster while making the necessary changes.
I would like to know if there is any method that makes us able to deploy a Windows Spark cluster on EC2 in a way relatively similar to the spark-ec2 script provided by Spark.
spark-ec2 currently only supports launching clusters in EC2 using specific Linux AMIs, so deploying a Windows Spark cluster is currently not possible using that tool. I doubt that spark-ec2 will ever have that capability, since all of the setup scripts it uses assume a Linux host.
That said, Databricks recently announced a community-managed index of Spark packages, and people are adding in stuff there all the time. For example, there is already a package to let you launch Spark clusters on Google's Compute Engine.
Though there doesn't currently appear to be anything for you, I would keep my eye on that community index for something that lets you launch Windows Spark clusters on EC2.
In a resource with Spark Packages, suggested by Nick, you can see recently added project by Sigmoid Analytics - that lets you launch the Spark cluster on Azure - spark_azure:
https://github.com/sigmoidanalytics/spark_azure
Related
I have an RHEL7 server in which i am trying to create a common datalake platform for POC and learning purpose. I have setup Hadoop,Hive,Zookeeper,Kafka,Spark,Sqoop separately.
Installing these components separately turns out to be a tricky affair and is taking lot of effort even though this is for an internal purpose and not production specific.
I am now trying to install CDH package in this Server now.
Is it possible to do so? Will it overlap with the current installations?
How can this be achieved.
Note: Reason why we went with separate installation is due to unavailability of internet in the server at that point of time.
Reason why going for CDH now is due to availability of internet for few days after some approvals plus CDH saves lot of time and effort and includes the
components required to setup a datalake.
Can someone please help me out here.
Yes it is feasible to setup CDH without disturbing existing configs with docker. Checkout the below link for setup guide. I have tested this and it works fine even if I have individual tools setup.
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/quickstart_docker_container.html
The spark distribution includes an ec2 launch script that points to a location in github for spark ami's. Unfortunately the ami (only one) is an amazon linux - which is very limited. Specifically the amazon linux ami has limited packages support.
So, if for example I want to get php5.4 (instead of default 5.3) on the amazon linux - no such luck.
Are there any non-amazon linux ami's available for using with the spark-ec2?
I don't know of an up-to-date set of Spark AMIs apart from the ones provided by the Spark project.
That said, I have developed a way using Packer to automatically create a set of Spark AMIs from a set of base AMIs and some Bash scripts:
https://github.com/nchammas/spark-ec2/tree/packer/image-build
This is being done as part of SPARK-3821.
You'll need to do some work to get this to work with Ubuntu, since the scripts currently assume a yum-based Linux distribution.
Basically:
These lines define the base AMIs to build on.
These lines show the scripts that are being run to build the image.
These and these lines tell Packer to copy the built AMIs to all EC2 regions. You probably want to change this.
The shortest path to success for you might be to try a CentOS or Fedora base image that has the packages you are looking for. That will minimize the changes you have to make to the Bash scripts.
Around the Spark 1.4 release timeframe (roughly June/July 2015), I will work to have this merged into the main spark-ec2 repo.
I am planning to implement a hadoop cluster with about 5 machines. With some background study, I understood that I need to install hadoop on each of those machines in order to implement the cluster.
Earlier I was planning to install a Linux distribution on each of these machines, and then install hadoop separately, and configure each machine to work in parallel.
Recently I came through some Hadoop distributions, such as Cloudera and Hortonworks. My question is, should I install a distribution such as Cloudera or Hortonworks in each of those machines, or should I install hadoop separately as I described earlier?
Will using a distribution make my task easier or would it need more knowledge to handle them than pure hadoop installation?
I'm a beginner in Hadoop too (~1.5 month), using a distribution can be very helpful if you use the automated way to install (Cloudera Manager for Cloudera or Ambari for Hortonworks). It install and deploy Hadoop and services you choose (hive, impala, spark, hue ...) on all the cluster very quickly. The main disadvantages in my opinion is that you can't really optimize and personalize your installation but for a first time it's much easier to run some simple cases.
I would highly recommend using a distro rather than doing it manually. Even using a distro will be complicated the first time as there are a lot of separate services that need to be running depending on what you want into addition to a base Hadoop install.
Also, do you intend to have a cluster size of just 5 machines? If so Hadoop may not be the right solution for you. You could potentially run all the masters on a single server and have a 4 node cluster, but that is probably not going to perform all that well. Note that the typical redundancy for HDFS is 3, so 4 nodes is just barely enough. If one or two machines goes down you could easily lose data in a production cluster. Personally I would recommend at least 8 nodes and one or two servers for the masters, so a total cluster size of 9 or 10, preferably 10.
Hi I had successfully installed hadoop in pseudo-distributed mode on a VM. I am writing code in eclipse and than exporting as jar file onto hadoop cluster and than doing debugging there. Now just for learning purpose I was trying to install hadoop in local configuration mode on my windows machine. By doing this I will be able to do testing without going through all the hassle of creating jar files,exporting and do testing on hadoop cluster.
My question is can anyone help me in understanding how hadoop will work in local mode ( hdfs vs local file system ) on windows and How I can configure hadoop in local machine on the windows machine ( what steps I can follow).
I tried following various blogs for doing same but was not able to understand much from them. So posting here the same.
Let me know if any-other information is needed. Thanks in advance.
Unfortunately, you cant use hadoop on windows from the get go - however, you can use Cygwin to achieve effectively the same thing.
I managed to set up local mode and distributed mode running directly from cygwin, however was unable to get pseudo-distributed to work nicely due to various cygpath conversion issues between Unix and Windows path styles.
However, in practise I still make the jars and send them straight across to the cluster using rsync, as it is much faster once your project reaches a certain magnitude for testing and remote debugging can be done from eclipse on windows to the linux cluster.
At my house I have about 10 computers all different processors and speeds (all x86 compatible). I would like to cluster these. I have looked at openMosix but since they stopped development on it I am deciding against using it. I would prefer to use the latest or next to latest version of a mainstream distribution of Linux (Suse 11, Suse 10.3, Fedora 9 etc).
Does anyone know any good sites (or books) that explain how to get a cluster up and running using free open source applications that are common on most mainstream distributions?
I would like a load balancing cluster for custom software I would be writing. I can not use something like Folding#home because I need constant contact with every part of the application. For example if I was running a simulation and one computer was controlling where rain was falling, and another controlling what my herbivores are doing in the simulation.
I recently set up an OpenMPI cluster using Ubuntu. Some existing write up is at https://wiki.ubuntu.com/MpichCluster .
Your question is too vague. What cluster application do you want to use?
By far the easiest way to set up a "cluster" is to install Folding#Home on each of your machines. But I doubt that's really what you're asking for.
I have set up clusters for music/video transcoding using simple bash scripts and ssh shared keys before.
I manage mail server clusters at work.
You only need a cluster if you know what you want to do. Come back with an actual requirement, and someone will suggest a solution.
Take a look at Rocks. It's a fullblown cluster "distribution" based on CentOS 5.1. It installs all you need (libs, applications and tools) to run a cluster and is dead simple to install and use. You do all the tweaking and configuration on the master node and it helps you with kickstarting all your other nodes. I've recently been installing a 1200+ nodes (over 10.000 cores!) cluster with it! And would not hesitate to install it on a 4 node cluster since the workload to install the master is none!
You could either run applications written for cluster libs such as MPI or PVM or you could use the queue system (Sun Grid Engine) to distribute any type of jobs. Or distcc to compile code of choice on all nodes!
And it's open source, gpl, free, everything that you like!
I think he's looking for something similar with openMosix, some kind of a general cluster on top of which any application can run distributed among the nodes. AFAIK there's nothing like that available. MPI based clusters are the closest thing you can get, but I think you can only run MPI applications on them.
Linux Virtual Server
http://www.linuxvirtualserver.org/
I use pvm and it works. But even with a nice ssh setup, allowing for login without entering passwd to the machine, you can easily remotely launch commands on your different computing nodes.