How to Setup a Low cost cluster - cluster-computing

At my house I have about 10 computers all different processors and speeds (all x86 compatible). I would like to cluster these. I have looked at openMosix but since they stopped development on it I am deciding against using it. I would prefer to use the latest or next to latest version of a mainstream distribution of Linux (Suse 11, Suse 10.3, Fedora 9 etc).
Does anyone know any good sites (or books) that explain how to get a cluster up and running using free open source applications that are common on most mainstream distributions?
I would like a load balancing cluster for custom software I would be writing. I can not use something like Folding#home because I need constant contact with every part of the application. For example if I was running a simulation and one computer was controlling where rain was falling, and another controlling what my herbivores are doing in the simulation.

I recently set up an OpenMPI cluster using Ubuntu. Some existing write up is at https://wiki.ubuntu.com/MpichCluster .

Your question is too vague. What cluster application do you want to use?
By far the easiest way to set up a "cluster" is to install Folding#Home on each of your machines. But I doubt that's really what you're asking for.
I have set up clusters for music/video transcoding using simple bash scripts and ssh shared keys before.
I manage mail server clusters at work.

You only need a cluster if you know what you want to do. Come back with an actual requirement, and someone will suggest a solution.

Take a look at Rocks. It's a fullblown cluster "distribution" based on CentOS 5.1. It installs all you need (libs, applications and tools) to run a cluster and is dead simple to install and use. You do all the tweaking and configuration on the master node and it helps you with kickstarting all your other nodes. I've recently been installing a 1200+ nodes (over 10.000 cores!) cluster with it! And would not hesitate to install it on a 4 node cluster since the workload to install the master is none!
You could either run applications written for cluster libs such as MPI or PVM or you could use the queue system (Sun Grid Engine) to distribute any type of jobs. Or distcc to compile code of choice on all nodes!
And it's open source, gpl, free, everything that you like!

I think he's looking for something similar with openMosix, some kind of a general cluster on top of which any application can run distributed among the nodes. AFAIK there's nothing like that available. MPI based clusters are the closest thing you can get, but I think you can only run MPI applications on them.

Linux Virtual Server
http://www.linuxvirtualserver.org/

I use pvm and it works. But even with a nice ssh setup, allowing for login without entering passwd to the machine, you can easily remotely launch commands on your different computing nodes.

Related

In GCP, what is the difference between SSH'ing into a VM and using Cloud Shell?

I'm trying to learn ML on GCP. Some of the Qwiklabs and Tutorials start with Cloud Shell to setup things like env variables and install Python packages, while others start by opening an SSH terminal into a VM to do those preliminary steps.
I can't really tell the difference between the two approaches, other than the fact that in the second case a VM needs to be provisioned first. Presumably, when you use Cloud Shell some sort of VM instance is being provisioned for you behind the scenes anyway.
So how are the two approaches different?
Cloud Shell is a product that is designed to give a large number of preconfigured tools that are kept updated, as well as being quick to start, accessable from the UI, and free. Basically, its a quick way to get an interactive shell. You can learn more about this environment from its documentation.
There are also limits to Cloud Shell -- you can only use it for 60 hours a week, if you go idle your session is terminated, and there is only 5GB of storage. It is also only an f1-micro instance, IIRC. So while it is provisioned for you (and free!), it isn't really useful for anything other than an interactive shell.
On the other hand, SSHing into a VM places you directly in a terminal on that VM, much like you would on any specific host -- you only have whatever tools that the image installed onto that VM provides (and many VMs come pretty bare bones, it depends on the image). But, you're now in a terminal on the host that is likely executing the code you want to work with, and it has as much CPU and RAM as you provisioned in that instance.
As far as guides pointing you to one or the other -- thats really up to them, but I suspect they'd point client / tool type work to the cloud shell (since its easy and a reasonably standard environment, which can even be scripted with tutorials), while they'd probably point how to install necessary software for use in production to a "real" VM.

Is it possible to install CDH on a RHEL7 server where Hadoop and few other components are installed seperatly

I have an RHEL7 server in which i am trying to create a common datalake platform for POC and learning purpose. I have setup Hadoop,Hive,Zookeeper,Kafka,Spark,Sqoop separately.
Installing these components separately turns out to be a tricky affair and is taking lot of effort even though this is for an internal purpose and not production specific.
I am now trying to install CDH package in this Server now.
Is it possible to do so? Will it overlap with the current installations?
How can this be achieved.
Note: Reason why we went with separate installation is due to unavailability of internet in the server at that point of time.
Reason why going for CDH now is due to availability of internet for few days after some approvals plus CDH saves lot of time and effort and includes the
components required to setup a datalake.
Can someone please help me out here.
Yes it is feasible to setup CDH without disturbing existing configs with docker. Checkout the below link for setup guide. I have tested this and it works fine even if I have individual tools setup.
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/quickstart_docker_container.html

Running Jenkins slave on different OS than master (and host)

I'm trying to introduce continuous integration in an old project, and we've got quite specific situation - it's possible to put the CI server only on our test server that runs on CentOS. The server has quite a lot of unused RAM and CPU capability.
However, we need to run Ant builds on Windows (this also used to be how the project did packaging before), however it turned out that not the same output (after binary compare) is produced by just using Unix versions of Java and Ant.
I drew up a diagram of how in my mind it could work, but I'm really wondering whether that is even possible (with already given tools).
The black part is implemented, I'm curious whether the red part could be possible. Could the Jenkins slave communicate with master on different OS?
It should be possible. I have a feeling you will need to play with your network settings. But if before you start changing anything see if you can start a headless slave by following these directions: https://wiki.jenkins-ci.org/display/JENKINS/Step+by+step+guide+to+set+up+master+and+slave+machine
Using VirtualBox for CentOS, it will possible to run a Windows VM on your CentOS host.
I'm not sure you need Docker to launch your Jenkins slave.
It maybe better to use a standard JNLP Windows service to connect your Windows slave to Dockerised Jenkins master.
If the master is not able to view the Windows node using this method, you may have to tweak your network configuration on the Windows VM.
But I'm not sure it's necessary.

Can I use a hadoop distribution instead manually installing?

I am planning to implement a hadoop cluster with about 5 machines. With some background study, I understood that I need to install hadoop on each of those machines in order to implement the cluster.
Earlier I was planning to install a Linux distribution on each of these machines, and then install hadoop separately, and configure each machine to work in parallel.
Recently I came through some Hadoop distributions, such as Cloudera and Hortonworks. My question is, should I install a distribution such as Cloudera or Hortonworks in each of those machines, or should I install hadoop separately as I described earlier?
Will using a distribution make my task easier or would it need more knowledge to handle them than pure hadoop installation?
I'm a beginner in Hadoop too (~1.5 month), using a distribution can be very helpful if you use the automated way to install (Cloudera Manager for Cloudera or Ambari for Hortonworks). It install and deploy Hadoop and services you choose (hive, impala, spark, hue ...) on all the cluster very quickly. The main disadvantages in my opinion is that you can't really optimize and personalize your installation but for a first time it's much easier to run some simple cases.
I would highly recommend using a distro rather than doing it manually. Even using a distro will be complicated the first time as there are a lot of separate services that need to be running depending on what you want into addition to a base Hadoop install.
Also, do you intend to have a cluster size of just 5 machines? If so Hadoop may not be the right solution for you. You could potentially run all the masters on a single server and have a 4 node cluster, but that is probably not going to perform all that well. Note that the typical redundancy for HDFS is 3, so 4 nodes is just barely enough. If one or two machines goes down you could easily lose data in a production cluster. Personally I would recommend at least 8 nodes and one or two servers for the masters, so a total cluster size of 9 or 10, preferably 10.

Apache Cassandra and Windows

What are the fine tuning configuration for Apache Cassandra for windows machine,I have seen "Unable to create new native thread" due to less number of "max user processes" in linux and the one of the solution is [1]
[1]http://vanjikumaran.blogspot.com/2014/01/unable-to-create-new-native-thread-and.html
Therefore, what are the best practices for Apache Cassandra configuration and OS settings for windows?
The best practice for "Cassandra on Windows" right now is "don't". There are a bunch of edge case issues that crop up on Windows because things like file handles behave slightly differently and do not have the same guarantees they do on Linux.
It works well enough to run a dev/test instance on your Windows box for development purposes. But for anything other than that you should really use Linux, as that is what everyone else uses, and it has the most testing.
Here is a blog post with the current status of Cassandra on Windows:
http://www.datastax.com/dev/blog/cassandra-and-windows-past-present-and-future

Resources