What's the difference between BOINC https://en.wikipedia.org/wiki/Berkeley_Open_Infrastructure_for_Network_Computing
vs. General Hadoop/Spark/etc. big data framework? They all seem to be distributed computing frameworks - are there places where I can read about the differences or BOINC in particular?
Seems the Large Hadron Collider in EU is using BOINC, why not Hadoop?
Thanks.
BOINC is software that can use the unused CPU and GPU cycles on a computer to do scientific computing
BOINC is strictly a single application that enables grid computing using unused computation cycles.
Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce.
(emphasis added to framework and it's dual functionality)
Here, you see Hadoop is a framework (also referred to as an ecosystem) that has both storage and computing capabilities. Hadoop vendors such as Cloudera and Hortonworks bundle in additional functionality into that (Hive, Hbase, Pig, Spark, etc) as well as a few security/auditing tools.
Additionally, hardware failure is handled differently by these two clusters. If a BOINC node dies, there is no fault tolerance; those resources are lost. In the case of Hadoop, data is replicated and tasks are re-ran a certain number of times before eventually failing, but these steps are traceable as long as the logging services built into the framework are running.
Seems the Large Hadron Collider in EU is using BOINC, why not Hadoop?
Because BOINC provides a software that anyone in the world can install to join the cluster, they gain a large range of computing power from anywhere practically for free.
They might be using Hadoop internally to do some storage and perhaps Spark to do additional computing, but buying commodity hardware in bulk and building/maintaining that cluster seems cost prohibitive.
What is similar between BOINC and Hadoop is that they exploit that a big problem can be solved in many parts. And both are most associated with distributing data across many computers, not an application.
The difference is the degree of synchronisation between all contributing machines. With Hadoop the synchronisation is very tight and you expect at some point all data to be collected from all machines to then come up with the final analysis. You literally wait for the last one and nothing is returned until that last fraction of the job was completed.
With BOINC, there is no synchronicity at all. You have many thousands of jobs to be run. The BOINC server side run by the project maintainers orchestrates the delivery of jobs to run to the BOINC client sides run by volunteers.
With BOINC, the project maintainers have no control over the clients at all. If a client is not returning a result then the work unit is sent elsewhere again. With Hadoop, the whole cluster is accessible to the project maintainer. With BOINC, the application is provided across different platforms since it is completely uncertain what platform the user offers. With Hadoop everything is well-defined and typically very homogeneous. BOINC's largest projects have many tens of thousands of regular volunteers, Hadoop has what you can afford to buy or rent.
Related
Our department at work just bought 4 nodes (servers) each with 80 cores and a bunch of memory and disk space.
We are just in the beginning stages and want to make sure that the nodes are brought into a cluster correctly for what we will want to use it for as well as future use.
Anticipated use is focused on machine learning/ big data. Essentially we are the advanced analytics team. We have SQL servers and databases setup for the full data. Our primary objective is to use the data to gain business insights, develop algorithms, and build optimization engines for the data and processes for the org. Tools we might need at some point:
-Docker images for developed applications
-Place to run jobs when developing new algorithms in batch job/maybe real time.
-Python ML algorithms
-Spark Jobs
-Possible Hadoop cluster? (this one uncertain about now)
-We want to run batch jobs, but also interactive jobs.
Our current plan is to run Chronos and eventually Marathon as well for the scheduling. We plan on Apache Mesos for the resource management.
Finally to the question. Our IT department informed us that to run a hadoop cluster, we have to virtualize each node. This virtualization takes up 8 cores on each node as well as GBs of memory and a ton of disk space. Are they correct? What way can we reduce the overhead of our system so we aren't consuming 10-20% of our resources in setting up the server?
Finally, as an added bonus, are there good books on setting up a mesos cluster, adding hadoop, and configuring everything.
Based on some comments, maybe we don't need Hadoop, in which case we wouldn't need virtualization.
I am new to Hadoop so this may seem like a silly question.
The purpose of Hadoop is to distribute processing power and storage across multiple computers.
So what is the purpose of a single Hadoop node? Its only one computer so there is no distribution or sharing of resources available?
Strictly learning and getting started.
Also useful for local unit testing very small workloads without touching a standing production cluster or a large dataset. For example, parsing a file, and making sure the logic works before you make it scalable and run into eventual issues.
Having run through configuration of both the Hadoop Big Insights and Apache Spark services on Bluemix, I noticed that Hadoop is very configurable.I have a choice of how many nodes there will be in the cluster and the RAM and CPU cores of those nodes as well as hard disk space
But the Spark service seems less configurable. The only choice I have is to choose between 2 and 30 Spark executors.
I am working with Bluemix as part of an IBM IC4 project to evaluate these services, so I have a few questions about this.
Is it possible to configure the Spark service in a similar way to the Hadoop service? i.e. choose nodes, RAM of nodes, CPU cores etc.
What are Spark executors in this context? Are they nodes? If so, what are their specifications?
Is there a plan to improve the options for Spark's configuration in the future?
Apologies for the questions but I need to know these specifications in order to carry out my work.
The Big Insights service is what some would call a hosted service. Which is to say, when you provision on instance of this service you get your own cluster with nodes configured as specified in the chosen plan. Consequently, you'll want to know exactly what each node you're paying for gives you. On the other hand, the Apache Spark service is a shared compute service, wherein you pay for compute to run your spark programs. Running spark is about in-memory compute, and creating RDDs over sources of data hosted by other data services. So in this context, what matters is how many concurrent jobs can I run and how many parallel tasks can I run with how much memory, and so on. In the Spark service plan, these executors seem to be an abstraction on this compute horsepower; unfortunately, hard for you to map that to physical hardware if you care about that. The plan description needs more elaboration and details about how one translates this abstraction to how you map to your workload needs.
However, I understand that this should be improved considerably at some point in the near future. There have been rumors about moving to only a single spark service plan where you can dial in, whenever you want, how much compute you need and that would take effect when you click "go", for all spark jobs from that point forward; it seems like you can twiddle the dials until you get what you want, see what that would cost, then lock it in until next time you need to change it. I can image something even more dynamic than that on a per-job basis. But anyway, seems like the direction things may be going for this compute service.
help me understand what are the advantages of hadoop over teradata.
Why should we migrate from teradat to hadoop.
In my applications I have some reports retrieving data from teradata, reports are very slow because of millions of row data.
Will migrating to hadoop resolve it?
Possible duplicate of hadoop vs teradata what is the difference.
The main advantage of Hadoop system is scalability with commodity hardware.
As pointed out by #dnoeth in comments. Teradata also scales out similar to Hadoop. But it can only scale out using expensive servers. However Hadoop systems can scale out using any commodity hardware (more commonly available less expensive hardware).
In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for
more systems of computers.
—Grace Hopper
Hadoop Advantages
Fault tolerance provided as part of the system. Graceful degradation, and data availability taken care of.
Individual nodes in the cluster can vary in their capacities.
Flexibility to add/remove nodes from cluster without shutting the cluster down.
Hadoop Disadvantages
It is batch processing system with high throughput and high latency.
Hadoop distributed file system doesn't allow modifying existing files.
Performance is very poor if used for small data.
Both are defined to be a set of computers that work together and give the end users a perception of a single computer running behind it.
So what is the difference here?
What is the difference between a car and a sports car?
A cluster is a system, usually managed by a single company. Clusters have normally a very low latency and consist of server hardware. A distributed system can be anything. Having JS on the client and PHP-server code which makes up together a system is already called a distributed system by some people.
In general when working with distributed systems you work a lot with long latencies and unexpected failures (like mentioned in p2p systems). When building a cluster (or a big cluster which can be called supercomputer) you try to prevent it by using more robust hardware and better network interconnection (InfiniBand). But nevertheless, a cluster is still a distributed system. (A sports car still has 4 wheels and an engine)