Having run through configuration of both the Hadoop Big Insights and Apache Spark services on Bluemix, I noticed that Hadoop is very configurable.I have a choice of how many nodes there will be in the cluster and the RAM and CPU cores of those nodes as well as hard disk space
But the Spark service seems less configurable. The only choice I have is to choose between 2 and 30 Spark executors.
I am working with Bluemix as part of an IBM IC4 project to evaluate these services, so I have a few questions about this.
Is it possible to configure the Spark service in a similar way to the Hadoop service? i.e. choose nodes, RAM of nodes, CPU cores etc.
What are Spark executors in this context? Are they nodes? If so, what are their specifications?
Is there a plan to improve the options for Spark's configuration in the future?
Apologies for the questions but I need to know these specifications in order to carry out my work.
The Big Insights service is what some would call a hosted service. Which is to say, when you provision on instance of this service you get your own cluster with nodes configured as specified in the chosen plan. Consequently, you'll want to know exactly what each node you're paying for gives you. On the other hand, the Apache Spark service is a shared compute service, wherein you pay for compute to run your spark programs. Running spark is about in-memory compute, and creating RDDs over sources of data hosted by other data services. So in this context, what matters is how many concurrent jobs can I run and how many parallel tasks can I run with how much memory, and so on. In the Spark service plan, these executors seem to be an abstraction on this compute horsepower; unfortunately, hard for you to map that to physical hardware if you care about that. The plan description needs more elaboration and details about how one translates this abstraction to how you map to your workload needs.
However, I understand that this should be improved considerably at some point in the near future. There have been rumors about moving to only a single spark service plan where you can dial in, whenever you want, how much compute you need and that would take effect when you click "go", for all spark jobs from that point forward; it seems like you can twiddle the dials until you get what you want, see what that would cost, then lock it in until next time you need to change it. I can image something even more dynamic than that on a per-job basis. But anyway, seems like the direction things may be going for this compute service.
Related
I am trying to migrate our organization's hadoop jobs to GCP...I am confused between GCP Data Flow and Data Proc...
I want to re-use Hadoop jobs we already have created and minimize the management of the cluster as much as possible. We also want to be able to persist data beyond the life of the cluster...
Can anyone suggest
I would just start with DataProc as it is very close to what you have.
Check out DataProc initialization actions, https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions, create a simple cluster and get a feel for it.
DataFlow is completely managed and you don't operate any cluster resources, but at the same time you cannot migrate an onsite cluster to DataFlow as is, you need to migrate (some times rewrite) your Hive/Pig/Oozie etc.
Cost for DataFlow is also calculated differently, though there is no upfront cost vs DataProc, everytime you run a job you incur some cost associated with it on DataFlow.
A lot depends on the nature of your Hadoop jobs and the activities you are performing in regards to the selection of Cloud Dataproc (managed big data platform - orientation of Hadoop/Spark) and/or Cloud Dataflow (managed big data platform - orientation of Apache Beam for streaming use cases).
In regards to ensuring persistence of data beyond the operation, you may want to consider storing your data on GCS or on PD's if that's an option basis the need of your use case.
Our department at work just bought 4 nodes (servers) each with 80 cores and a bunch of memory and disk space.
We are just in the beginning stages and want to make sure that the nodes are brought into a cluster correctly for what we will want to use it for as well as future use.
Anticipated use is focused on machine learning/ big data. Essentially we are the advanced analytics team. We have SQL servers and databases setup for the full data. Our primary objective is to use the data to gain business insights, develop algorithms, and build optimization engines for the data and processes for the org. Tools we might need at some point:
-Docker images for developed applications
-Place to run jobs when developing new algorithms in batch job/maybe real time.
-Python ML algorithms
-Spark Jobs
-Possible Hadoop cluster? (this one uncertain about now)
-We want to run batch jobs, but also interactive jobs.
Our current plan is to run Chronos and eventually Marathon as well for the scheduling. We plan on Apache Mesos for the resource management.
Finally to the question. Our IT department informed us that to run a hadoop cluster, we have to virtualize each node. This virtualization takes up 8 cores on each node as well as GBs of memory and a ton of disk space. Are they correct? What way can we reduce the overhead of our system so we aren't consuming 10-20% of our resources in setting up the server?
Finally, as an added bonus, are there good books on setting up a mesos cluster, adding hadoop, and configuring everything.
Based on some comments, maybe we don't need Hadoop, in which case we wouldn't need virtualization.
I am starting a big data initiative for my startup. In 2018 is there any reason to use Hadoop at all since Spark is touted to be way faster due to it primarily not writing the intermediate data to disk as Hadoop’s MR.
I realize Spark has a higher need for RAM But that would be just one time CAPEX costs that would pay for itself?
In general unless there are legacy projects why should one pick up Hadoop at all since Spark is available?
Would appreciate real world comparisons of the two, gotchas etc.?
Alternately are there use cases that Hadoop can solve but Spark cannot?
—————-comment below for actual problem————
I would use YARN as the resource manager with HDFS as the file system for Spark.
Also realize that as Spark intersects quiet a bit with Hadoop ecosystem.
Comparos are :
Mapreduce vs Spark code
SparkSQL vs Hive
People mention Pig too but not a whole lot of people want to learn custom querying. And if I had to use Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop?
Also not sure how Spark handles the following:
If data does not fit in RAM then what ? Back to a disk based paradigm (not talking of streaming use cases here..) so no better than Mapreduce? How does Tez make MR2 better?
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Where I am unclear is the plethora of overlapping choices. For e.g. streaming alone has:
Spark streaming
Apache storm
Apache Samza
Kafka streams
CEP commercial tools.(ORacle CEP, TIBCO etc.)
A lot of them use DAG similar to Spark’s core engine so hard to pick one from the other.
Use case:
App sends data to middleware until end of event. Event can end specified on periodicity or due to a business condition being met.
Middleware must show real time addition of a value (simplifying) sent by users from their app instances. Accepted that middleware is the floor of the actual sum of values and real value can be higher. Plan to use Kafka streams here to have a consumer that adds all the inputs with minimal latency the consumer posts to a cache which is polled by apps to show current additive value.
Middleware logs all input
After event ends a big data paradigm scans through log data and database records to get accurate count by comparing all dB values and log entries (audit) and compare them to the Kafka shown value. Value calculated by this scheme is the final value.
Design choices:
I like Kafka because it decouples application middleware and is low latency high throughput messaging. Streams code is easy to write . Happy for someone to counter argue using Spark Streams Or Apache Storm or Apache Samza instead?
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients. Not doing client caching due to explicit liveliness of additive value.
You're confusing Hadoop with just MapReduce. Hadoop is an ecosystem of MapReduce, HDFS, and YARN.
First of all, Spark doesn't have a filesystem. That's primarily why Hadoop is nice, in my book. Sure, you can use S3, or many other cloud storages, or bare metal data stores like Ceph, or GlusterFS, but from what I've researched, HDFS is by far the fastest when processing data.
Maybe you're not familiar with the concept of rack locality that YARN offers. If you use Spark Standalone mode with any file system not mounted under the Spark executors, then all your data requests will need to be pulled over a network connection, therefore saturating the network, and causing a bottleneck, regardless of memory. Compare that to the Spark executors running on the YARN NodeManagers, HDFS datanodes are ideally also NodeManagers.
A similar problem - people say Hive is slow, SparkSQL is faster. Well, that's true if you run Hive with MapReduce instead of Tez or Spark execution modes.
Now, if you're wanting streaming and real-time events rather than the batch world commonly associated with Hadoop. You might want to research the SMACK stack.
Update
Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop
Pig is not comparable to NiFi.
You can use NiFi; nothing is stopping you. It would run closer to real-time than Spark micro batches. And it is a good tool to pair with Kafka.
plethora of overlapping choices
Yes, and you didn't even list them all... It's up to some BigData architect in your company to come up with a solution. You'll find that vendor support from Confluent is mostly for Kafka. I haven't seen them talking about Samza much. Hortonworks will support Storm, Nifi, and Spark, but they aren't running the latest version of Kafka if you want fancy features like KSQL. Streamsets is a similar company offering a tool competing with NiFi which consists of employees with backgrounds in other batch/streaming Apache projects.
Storm and Samza are two ways to do the same thing, as far as I know. I think Flink is more programmer friendly than Storm. I don't have experience with Samza, though I work closely with people who primarily are using Kafka Streams rather than it. And Kafka Streams isn't DAG based - it's just a high level Kafka library, embeddable in any JVM application.
If data does not fit in RAM then what ?
By default, it spills to disk... Spark has parameters to configure if you don't want disk to be touched. In which case, your jobs die of OOM more quickly, obviously.
How does Tez make MR2 better?
Tez isn't MR. It creates more optimized DAGs like Spark does. Go read about it.
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Spark has no filesystem. We already covered this. Erasure encoding is primarily for data at-rest, not during processing. I actually don't know if Spark supports Hadoop 3, yet.
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients
Personally, I would use Kafka Streams here because 1) You are using Java already 2) it's a standalone thread in your code that offers you to read/publish data from Kafka without Hadoop/YARN or Spark Clusters. It's not clear what your question has to do with Hadoop from your listed client-server archictecture, but feel free to string an additional line from a Kafka topic to a database/analytics engine of your choice. The Kafka Connect framework has many connectors for you to choose from.
You could also use NiFi as your mobile REST API to just ExposeHTTP and send requests to it, then route flows based on attributes in the data. Then, manipulate and publish to Kafka as well as other systems.
Spark and Hadoop works pretty similar in the way of solving MapReduce problems.
Hadoop is pretty relevant if you talk about HDFS point of view. The HDFS is a well known used solution for big data storage. But your question is about MapReduce.
Spark is the best option if you are talking about good machines with real good configuration of memory and network throughput. But we know that kind of machines are expensive and sometimes you best option is to use Hadoop to process your data. Spark is great and fast but sometimes you get crazy with Memory issues if you don't have a good cluster in case of fit too much data in the memory. Hadoop in this case can be better. But this problem year after year are less relevant.
So hadoop is here com complement Spark, Hadoop is not only MapReduce Hadoop is an ecosystem. Spark doesn't have a distributed file system, to Spark works well you need one, Spark doesn't have a resource manager, Hadoop has called Yarn. And Spark in a cluster mode need a resource manager.
Conclusion
Hadoop still relevant as an ecosystem but as only mapReduce I can say that is not been used anymore.
What's the difference between BOINC https://en.wikipedia.org/wiki/Berkeley_Open_Infrastructure_for_Network_Computing
vs. General Hadoop/Spark/etc. big data framework? They all seem to be distributed computing frameworks - are there places where I can read about the differences or BOINC in particular?
Seems the Large Hadron Collider in EU is using BOINC, why not Hadoop?
Thanks.
BOINC is software that can use the unused CPU and GPU cycles on a computer to do scientific computing
BOINC is strictly a single application that enables grid computing using unused computation cycles.
Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce.
(emphasis added to framework and it's dual functionality)
Here, you see Hadoop is a framework (also referred to as an ecosystem) that has both storage and computing capabilities. Hadoop vendors such as Cloudera and Hortonworks bundle in additional functionality into that (Hive, Hbase, Pig, Spark, etc) as well as a few security/auditing tools.
Additionally, hardware failure is handled differently by these two clusters. If a BOINC node dies, there is no fault tolerance; those resources are lost. In the case of Hadoop, data is replicated and tasks are re-ran a certain number of times before eventually failing, but these steps are traceable as long as the logging services built into the framework are running.
Seems the Large Hadron Collider in EU is using BOINC, why not Hadoop?
Because BOINC provides a software that anyone in the world can install to join the cluster, they gain a large range of computing power from anywhere practically for free.
They might be using Hadoop internally to do some storage and perhaps Spark to do additional computing, but buying commodity hardware in bulk and building/maintaining that cluster seems cost prohibitive.
What is similar between BOINC and Hadoop is that they exploit that a big problem can be solved in many parts. And both are most associated with distributing data across many computers, not an application.
The difference is the degree of synchronisation between all contributing machines. With Hadoop the synchronisation is very tight and you expect at some point all data to be collected from all machines to then come up with the final analysis. You literally wait for the last one and nothing is returned until that last fraction of the job was completed.
With BOINC, there is no synchronicity at all. You have many thousands of jobs to be run. The BOINC server side run by the project maintainers orchestrates the delivery of jobs to run to the BOINC client sides run by volunteers.
With BOINC, the project maintainers have no control over the clients at all. If a client is not returning a result then the work unit is sent elsewhere again. With Hadoop, the whole cluster is accessible to the project maintainer. With BOINC, the application is provided across different platforms since it is completely uncertain what platform the user offers. With Hadoop everything is well-defined and typically very homogeneous. BOINC's largest projects have many tens of thousands of regular volunteers, Hadoop has what you can afford to buy or rent.
I'm trying to figure out what would be the reasons for using Mesos. Can you come up with other ones?
Running all of your services in the same cluster instead of dedicated clusters (your end-applications + DevOps such as Jenkins)
Running different maturity applications in same cluster (dev, test, production), or is this viable? Kubernetes has a similar approach with Labels
Mesos simplifies the use of traditional distributed applications such as Hadoop by easing deployment, unified API, bin-packing of resources
Full-disclosure: I currently work at Twitter and I'm involved in both Apache Mesos and Aurora.
Mesos uses cases can vary based upon a few dimensions: scale (10 servers vs 10s of thousands), available hardware (dedicated/static or in the public cloud/scalable), and workloads (primarily services, batch, or both).
Your list is a great start. Here are a few additional use cases / features to add.
Container Orchestration
As container runtimes like Docker have become popular, lots of potential users are looking at Mesos + a scheduler to manage orchestration once container images are created. Mesos is already quite mature and has been proven at scale, which I think has given it a leg up over some emergent solutions.
Increased Resource Utilization
For companies running >50 servers, a common motivation for adopting Mesos is to increase resource utilization to reduce CapEx. There are a number of examples of this in both the public and private cloud. In the case of Ebay they have been running Jenkins on Mesos and were able to reduce their VM footprint. Mesosphere has also published a case study of HubSpot (runnning on AWS), and how they've been able to replace hundreds of smaller servers with dozens of larger ones by more-efficiently using their available hardware.
Preemption
At Twitter we're running Mesos via one scheduler: Apache Aurora. One of the ways we can improve utilization relates to your use case: running different maturity applications in the same cluster. Aurora has a concept of environments, so you can run applications that are production, development, or test. Additionally, Aurora has a built-in preemption feature which allows it to prioritize production over non-production tasks, killing non-production tasks when those resources are needed to run production ones as well as a priority system within each environment.
Long-term, functionality related to preemption will also be located in the Mesos core itself -- it's a killer feature related to both increased resource utilization and running different maturity applications (dev, test, prod). There are a few Mesos tickets to follow if you're interested in keeping up to date, including MESOS-155 for preemption, and MESOS-1474 for inverse offers.
Colocating Batch and Services
Running batch and services in a shared Mesos cluster will be key to driving up utilization even further as js84 points out. Check out Project Myriad, an effort to colocate Mesos and YARN workloads in the same cluster. At this time I'm not aware of any large deployments running both batch and services, but it's certainly the direction the community is moving in as it becomes easier for multiple frameworks to run in a shared cluster.
At least one additional use case comes to mind: Development SDK for developing distributed applications. If you have a look at Mesos Frameworks you will find a number of frameworks which have been developed on top of Mesos. Also interesting Apple's Siri framework powering Siri.
Regarding your 1): One additional angle you should keep in mind here is scaling your applications in the same cluster. I.e. at peak load of your website, shift resources easily towards the webservers while scaling down the Hadoop analytical processing.