What is the purpose of a single Hadoop node? - hadoop

I am new to Hadoop so this may seem like a silly question.
The purpose of Hadoop is to distribute processing power and storage across multiple computers.
So what is the purpose of a single Hadoop node? Its only one computer so there is no distribution or sharing of resources available?

Strictly learning and getting started.
Also useful for local unit testing very small workloads without touching a standing production cluster or a large dataset. For example, parsing a file, and making sure the logic works before you make it scalable and run into eventual issues.

Related

High Performance Cluster Virtualization required in Hadoop on Mesos

Our department at work just bought 4 nodes (servers) each with 80 cores and a bunch of memory and disk space.
We are just in the beginning stages and want to make sure that the nodes are brought into a cluster correctly for what we will want to use it for as well as future use.
Anticipated use is focused on machine learning/ big data. Essentially we are the advanced analytics team. We have SQL servers and databases setup for the full data. Our primary objective is to use the data to gain business insights, develop algorithms, and build optimization engines for the data and processes for the org. Tools we might need at some point:
-Docker images for developed applications
-Place to run jobs when developing new algorithms in batch job/maybe real time.
-Python ML algorithms
-Spark Jobs
-Possible Hadoop cluster? (this one uncertain about now)
-We want to run batch jobs, but also interactive jobs.
Our current plan is to run Chronos and eventually Marathon as well for the scheduling. We plan on Apache Mesos for the resource management.
Finally to the question. Our IT department informed us that to run a hadoop cluster, we have to virtualize each node. This virtualization takes up 8 cores on each node as well as GBs of memory and a ton of disk space. Are they correct? What way can we reduce the overhead of our system so we aren't consuming 10-20% of our resources in setting up the server?
Finally, as an added bonus, are there good books on setting up a mesos cluster, adding hadoop, and configuring everything.
Based on some comments, maybe we don't need Hadoop, in which case we wouldn't need virtualization.

Difference between BOINC and Hadoop/Spark/etc

What's the difference between BOINC https://en.wikipedia.org/wiki/Berkeley_Open_Infrastructure_for_Network_Computing
vs. General Hadoop/Spark/etc. big data framework? They all seem to be distributed computing frameworks - are there places where I can read about the differences or BOINC in particular?
Seems the Large Hadron Collider in EU is using BOINC, why not Hadoop?
Thanks.
BOINC is software that can use the unused CPU and GPU cycles on a computer to do scientific computing
BOINC is strictly a single application that enables grid computing using unused computation cycles.
Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce.
(emphasis added to framework and it's dual functionality)
Here, you see Hadoop is a framework (also referred to as an ecosystem) that has both storage and computing capabilities. Hadoop vendors such as Cloudera and Hortonworks bundle in additional functionality into that (Hive, Hbase, Pig, Spark, etc) as well as a few security/auditing tools.
Additionally, hardware failure is handled differently by these two clusters. If a BOINC node dies, there is no fault tolerance; those resources are lost. In the case of Hadoop, data is replicated and tasks are re-ran a certain number of times before eventually failing, but these steps are traceable as long as the logging services built into the framework are running.
Seems the Large Hadron Collider in EU is using BOINC, why not Hadoop?
Because BOINC provides a software that anyone in the world can install to join the cluster, they gain a large range of computing power from anywhere practically for free.
They might be using Hadoop internally to do some storage and perhaps Spark to do additional computing, but buying commodity hardware in bulk and building/maintaining that cluster seems cost prohibitive.
What is similar between BOINC and Hadoop is that they exploit that a big problem can be solved in many parts. And both are most associated with distributing data across many computers, not an application.
The difference is the degree of synchronisation between all contributing machines. With Hadoop the synchronisation is very tight and you expect at some point all data to be collected from all machines to then come up with the final analysis. You literally wait for the last one and nothing is returned until that last fraction of the job was completed.
With BOINC, there is no synchronicity at all. You have many thousands of jobs to be run. The BOINC server side run by the project maintainers orchestrates the delivery of jobs to run to the BOINC client sides run by volunteers.
With BOINC, the project maintainers have no control over the clients at all. If a client is not returning a result then the work unit is sent elsewhere again. With Hadoop, the whole cluster is accessible to the project maintainer. With BOINC, the application is provided across different platforms since it is completely uncertain what platform the user offers. With Hadoop everything is well-defined and typically very homogeneous. BOINC's largest projects have many tens of thousands of regular volunteers, Hadoop has what you can afford to buy or rent.

What areas of Hadoop's network usage are interesting to observe and profile, as part of a college project?

I'm planning to profile a part of Hadoop's MapReduce for a grad school project, focussing on the network related aspects. I have found a few papers regarding the same, but I was wondering if there are some well known areas of study, and some existing resources abut the same.
I don't need to break any new ground. Even if I can reproduce any well known existing pattern of network utilization, it is good enough.
The best way to come up with the entire list of network related bottlenecks that could occur in an MapReduce scenario would be to understand how each and every daemon works with each other.
Get to know the entire flow of a MapReduce job. You can find this at a blog post I had written sometime back - Introducing Hadoop
The JobTracker and the TaskTracker are the daemons that actually do the work in an Hadoop enviroment. So looking into how the JobTracker assigns tasks and how the TaskTracker responds is an area that is prone to bottlenecks in case of network issue.
The MapReduce "Shuffle and Sort" phase is another keyword you could look up wherein network issues can cause major latency.
Also, as you must be already knowing that each node in the cluster needs to have password less ssh access to the other nodes. This is another area that could be affected due to network issues.
I don't have any specific links to point but I hope I was able to point you in the correct direction.

Are there circumstances where an Akka-based application can replace a Hadoop setup?

From reading about Akka and my own beginning uses of it, it seems to me that Akka could be used, and more simply, than a Hadoop setup for some applications. You wouldn't have HDFS for use, but you could write an application that would send out pieces of work to different "mappers" and have results sent to a "reducer", and it would be easier to set up than Hadoop in VMs or on hardware, fewer services to set up.
Is this reasonable or are the two technologies used for totally different things?
Yes, totally reasonable. We have built a large scale (1000+ workers) map-reduce system using Akka 2.0. Akka 2.2+ is even better because you can use the clustering and remote deathwatch features instead of having to write that functionality yourself.
See this post to get a feel for how it might work.
Akka cluster is currently marked experimental but the Akka team say it's more or less ready for prime time and people are using it in production. I would be very cautious about going this direction and you may instead want to consider hadoop or using zookeeper with akka and zmq or a message queue for horizontally scaling as well.

How to use a "Rocks" cluster

I've just joined a research lab at my University and been given access to a Cluster to compile and run the c++ code that I write. I use SSH to access it and simply use the cluster like a Linux terminal.
I often have to wait a relatively long time while my code runs. I'm trying to figure out if there's a more efficient way use the Cluster. For example, there are different CPUs/Nodes in the cluster, some of which are more in use and others less in use. How do I access a specific CPU? I have access to the "Ganglia" overview page which gives information about the different Nodes.
Also, if I run 2 processes in a different SSH windows will it automatically use different processors or nodes, or do I have to manually specify that.
I couldn't find any documentation to help me with these issues, so I'd appreciate a little help.
Thanks
Simply running something on a cluster does not mean it is taking advantage of the cluster at all. By default, it will probably just run on the head node. Software needs to be written specifically for a cluster.
There is likely to be some kind of scheduler running that you need to interface with. Perhaps you could also see if distcc is installed and configured for your particular cluster (for doing the compilation across multiple machines). There may also be a particular flavour of MPI running to allow processes on different nodes to communicate.
Clusters software setups tend to be very specialised to the hardware and computing environment. Really, I would recommend that you ask someone who has used the machine before these kinds of questions, because any advice you receive here is unlikely to be completely accurate for your particular cluster.

Resources