Understanding Sparkling Water - h2o

I am new to Sparkling Water, I want to ask some quick questions:
Does Sparking Water support all the algorithms that both Spark MLlib and H2O provides
Does Sparkling Water itself provide algorithms that Spark MLlib and H2O don't support?
If I want to write code with pure Spark MLlib within Sparkling Water context, should I have to use H2OContext or Sparkling Water related API?
Per the above 3 questions, I think what I want to understand is how Sparkling Water works. (For present, I know no more than that Sparkling Water brings Spark and H2O together)
Thanks.
Questions-2017-01-11
I am able to run the AirlinesWithWeatherDemo2example with run-example.shsuccessfully, but I got two questions:
H2O Flow web ui is opened during application running(can be accessed through 54321 port), but when the application is finished,
the process that opens 54321 port is also shut down(the web ui is inaccessible any more), I would ask when I am running the example, what functionality does this flow UI provide since it may be short-lived
Sparkling water is meant to integrate Spark and H2O, when I submit the example, I only need the sparkling-water-assembly_2.11-2.0.3-all as the applicaiton jar(It contains the example classes),
It looks that if I want to run H2O algorithms that Sparkling water doesn't provide, I should add the H2O jars(h2o.jar) as the dependent jars?

Yes
Not really, we are working on wrapping Spark's MLlib algorithms so you can run them from H2O's FlowUI and on wrapping H2O's algorithms so you can use them in MLlib's pipelines, though.
You need H2OContext only if you want to run H2O specific functionality.
Sparkling Water simply allows you to run H2O nodes inside Spark nodes, instead of bootstrapping the H2O cluster by hand. This also allows you to use data in both H2O and Spark.
#Edit:
None but you might have a long running Spark job, where you don't exit after doing some initial computation but lock the job (and need to kill it somehow). Then you can use FlowUI as normal. We simply start the HTTP server every time (even for demos). No reason not do to it.
You can either use one of our droplets - https://github.com/h2oai/h2o-droplets/tree/master/sparkling-water-droplet which is a template project, you add your logic in the main class and run ./gradlew shadowJar and submit the jar with spark-submit, it already contains all the jars. Or, as you mentioned you'll need to provide (though --jars or --packages) all the necessary dependencies, H2O.jar included.

Related

How to migrate On Prem Hadoop to GCP

I am trying to migrate our organization's hadoop jobs to GCP...I am confused between GCP Data Flow and Data Proc...
I want to re-use Hadoop jobs we already have created and minimize the management of the cluster as much as possible. We also want to be able to persist data beyond the life of the cluster...
Can anyone suggest
I would just start with DataProc as it is very close to what you have.
Check out DataProc initialization actions, https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions, create a simple cluster and get a feel for it.
DataFlow is completely managed and you don't operate any cluster resources, but at the same time you cannot migrate an onsite cluster to DataFlow as is, you need to migrate (some times rewrite) your Hive/Pig/Oozie etc.
Cost for DataFlow is also calculated differently, though there is no upfront cost vs DataProc, everytime you run a job you incur some cost associated with it on DataFlow.
A lot depends on the nature of your Hadoop jobs and the activities you are performing in regards to the selection of Cloud Dataproc (managed big data platform - orientation of Hadoop/Spark) and/or Cloud Dataflow (managed big data platform - orientation of Apache Beam for streaming use cases).
In regards to ensuring persistence of data beyond the operation, you may want to consider storing your data on GCS or on PD's if that's an option basis the need of your use case.

Relevance of Hadoop & Streaming solutions when Spark exists

I am starting a big data initiative for my startup. In 2018 is there any reason to use Hadoop at all since Spark is touted to be way faster due to it primarily not writing the intermediate data to disk as Hadoop’s MR.
I realize Spark has a higher need for RAM But that would be just one time CAPEX costs that would pay for itself?
In general unless there are legacy projects why should one pick up Hadoop at all since Spark is available?
Would appreciate real world comparisons of the two, gotchas etc.?
Alternately are there use cases that Hadoop can solve but Spark cannot?
—————-comment below for actual problem————
I would use YARN as the resource manager with HDFS as the file system for Spark.
Also realize that as Spark intersects quiet a bit with Hadoop ecosystem.
Comparos are :
Mapreduce vs Spark code
SparkSQL vs Hive
People mention Pig too but not a whole lot of people want to learn custom querying. And if I had to use Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop?
Also not sure how Spark handles the following:
If data does not fit in RAM then what ? Back to a disk based paradigm (not talking of streaming use cases here..) so no better than Mapreduce? How does Tez make MR2 better?
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Where I am unclear is the plethora of overlapping choices. For e.g. streaming alone has:
Spark streaming
Apache storm
Apache Samza
Kafka streams
CEP commercial tools.(ORacle CEP, TIBCO etc.)
A lot of them use DAG similar to Spark’s core engine so hard to pick one from the other.
Use case:
App sends data to middleware until end of event. Event can end specified on periodicity or due to a business condition being met.
Middleware must show real time addition of a value (simplifying) sent by users from their app instances. Accepted that middleware is the floor of the actual sum of values and real value can be higher. Plan to use Kafka streams here to have a consumer that adds all the inputs with minimal latency the consumer posts to a cache which is polled by apps to show current additive value.
Middleware logs all input
After event ends a big data paradigm scans through log data and database records to get accurate count by comparing all dB values and log entries (audit) and compare them to the Kafka shown value. Value calculated by this scheme is the final value.
Design choices:
I like Kafka because it decouples application middleware and is low latency high throughput messaging. Streams code is easy to write . Happy for someone to counter argue using Spark Streams Or Apache Storm or Apache Samza instead?
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients. Not doing client caching due to explicit liveliness of additive value.
You're confusing Hadoop with just MapReduce. Hadoop is an ecosystem of MapReduce, HDFS, and YARN.
First of all, Spark doesn't have a filesystem. That's primarily why Hadoop is nice, in my book. Sure, you can use S3, or many other cloud storages, or bare metal data stores like Ceph, or GlusterFS, but from what I've researched, HDFS is by far the fastest when processing data.
Maybe you're not familiar with the concept of rack locality that YARN offers. If you use Spark Standalone mode with any file system not mounted under the Spark executors, then all your data requests will need to be pulled over a network connection, therefore saturating the network, and causing a bottleneck, regardless of memory. Compare that to the Spark executors running on the YARN NodeManagers, HDFS datanodes are ideally also NodeManagers.
A similar problem - people say Hive is slow, SparkSQL is faster. Well, that's true if you run Hive with MapReduce instead of Tez or Spark execution modes.
Now, if you're wanting streaming and real-time events rather than the batch world commonly associated with Hadoop. You might want to research the SMACK stack.
Update
Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop
Pig is not comparable to NiFi.
You can use NiFi; nothing is stopping you. It would run closer to real-time than Spark micro batches. And it is a good tool to pair with Kafka.
plethora of overlapping choices
Yes, and you didn't even list them all... It's up to some BigData architect in your company to come up with a solution. You'll find that vendor support from Confluent is mostly for Kafka. I haven't seen them talking about Samza much. Hortonworks will support Storm, Nifi, and Spark, but they aren't running the latest version of Kafka if you want fancy features like KSQL. Streamsets is a similar company offering a tool competing with NiFi which consists of employees with backgrounds in other batch/streaming Apache projects.
Storm and Samza are two ways to do the same thing, as far as I know. I think Flink is more programmer friendly than Storm. I don't have experience with Samza, though I work closely with people who primarily are using Kafka Streams rather than it. And Kafka Streams isn't DAG based - it's just a high level Kafka library, embeddable in any JVM application.
If data does not fit in RAM then what ?
By default, it spills to disk... Spark has parameters to configure if you don't want disk to be touched. In which case, your jobs die of OOM more quickly, obviously.
How does Tez make MR2 better?
Tez isn't MR. It creates more optimized DAGs like Spark does. Go read about it.
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Spark has no filesystem. We already covered this. Erasure encoding is primarily for data at-rest, not during processing. I actually don't know if Spark supports Hadoop 3, yet.
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients
Personally, I would use Kafka Streams here because 1) You are using Java already 2) it's a standalone thread in your code that offers you to read/publish data from Kafka without Hadoop/YARN or Spark Clusters. It's not clear what your question has to do with Hadoop from your listed client-server archictecture, but feel free to string an additional line from a Kafka topic to a database/analytics engine of your choice. The Kafka Connect framework has many connectors for you to choose from.
You could also use NiFi as your mobile REST API to just ExposeHTTP and send requests to it, then route flows based on attributes in the data. Then, manipulate and publish to Kafka as well as other systems.
Spark and Hadoop works pretty similar in the way of solving MapReduce problems.
Hadoop is pretty relevant if you talk about HDFS point of view. The HDFS is a well known used solution for big data storage. But your question is about MapReduce.
Spark is the best option if you are talking about good machines with real good configuration of memory and network throughput. But we know that kind of machines are expensive and sometimes you best option is to use Hadoop to process your data. Spark is great and fast but sometimes you get crazy with Memory issues if you don't have a good cluster in case of fit too much data in the memory. Hadoop in this case can be better. But this problem year after year are less relevant.
So hadoop is here com complement Spark, Hadoop is not only MapReduce Hadoop is an ecosystem. Spark doesn't have a distributed file system, to Spark works well you need one, Spark doesn't have a resource manager, Hadoop has called Yarn. And Spark in a cluster mode need a resource manager.
Conclusion
Hadoop still relevant as an ecosystem but as only mapReduce I can say that is not been used anymore.

Bluemix Spark and Hadoop Service Configuration

Having run through configuration of both the Hadoop Big Insights and Apache Spark services on Bluemix, I noticed that Hadoop is very configurable.I have a choice of how many nodes there will be in the cluster and the RAM and CPU cores of those nodes as well as hard disk space
But the Spark service seems less configurable. The only choice I have is to choose between 2 and 30 Spark executors.
I am working with Bluemix as part of an IBM IC4 project to evaluate these services, so I have a few questions about this.
Is it possible to configure the Spark service in a similar way to the Hadoop service? i.e. choose nodes, RAM of nodes, CPU cores etc.
What are Spark executors in this context? Are they nodes? If so, what are their specifications?
Is there a plan to improve the options for Spark's configuration in the future?
Apologies for the questions but I need to know these specifications in order to carry out my work.
The Big Insights service is what some would call a hosted service. Which is to say, when you provision on instance of this service you get your own cluster with nodes configured as specified in the chosen plan. Consequently, you'll want to know exactly what each node you're paying for gives you. On the other hand, the Apache Spark service is a shared compute service, wherein you pay for compute to run your spark programs. Running spark is about in-memory compute, and creating RDDs over sources of data hosted by other data services. So in this context, what matters is how many concurrent jobs can I run and how many parallel tasks can I run with how much memory, and so on. In the Spark service plan, these executors seem to be an abstraction on this compute horsepower; unfortunately, hard for you to map that to physical hardware if you care about that. The plan description needs more elaboration and details about how one translates this abstraction to how you map to your workload needs.
However, I understand that this should be improved considerably at some point in the near future. There have been rumors about moving to only a single spark service plan where you can dial in, whenever you want, how much compute you need and that would take effect when you click "go", for all spark jobs from that point forward; it seems like you can twiddle the dials until you get what you want, see what that would cost, then lock it in until next time you need to change it. I can image something even more dynamic than that on a per-job basis. But anyway, seems like the direction things may be going for this compute service.

EC2 Container Service vs Apache Mesos

We are looking to use Docker container to run our batch jobs in a cluster enviroment.
We are evaluating to use AWS ECS Container Service/Chronos/Mesos.
As far as I know, Apache Mesos has some overlapping features/purpose that EC2 has, like cluster management. Chronos is a distributed scheduler.
I am having dificult to correlate all this technologies to create a architecture!
EC2 service replace Mesos? What about the scheduler?
We are a small team will little experience in cluster development. Which stack is better for our batch processing?
EDIT
I make a huge edit, and i think now i understand the architecture:
This is a sample picture with two cluster been managed by Mesos.
Reading the ECS Container Service documentation(http://docs.aws.amazon.com/AmazonECS/latest/developerguide/scheduling_tasks.html), AWS is on the way of integrete ECS with the Mesos Apache Framework. So I imagine that using in the future, we can use the mesos framework to manage the resources in the ECS Cluster. So it is going to be possible to use Chronos (for batch scheduling) and Marathon (for long running app.)
EDIT
At this moment, we dont have distributed jobs running, like hadoop jobs or sparks jobs. Our job are much simpler, running on single instances of EC2. We are planning to use Docker to run our batch running jobs.
I'd argue it depends on the type of batch jobs, but the Apache Mesos ecosystem is certainly more flexible than ECS to accomodate your needs. The flexibility comes from the fact that Mesos uses a so called two-level scheduling model, which is a fancy name for it outsources the scheduling decision into frameworks (rather than trying to implement each and every existing and future workload scheduling strategy in its core, itself).
You mentioned one such a framework already, Chronos, which is a good working horse, just maybe don't use the dependencies for jobs, ok? Then there is another great batch framework called Cook. Depending on your needs (for example SQL-based batch report generation) you could use Apache Spark. And so on and so forth.
BTW, did I mention already that with Mesos you don't risk a vendor lock-in, while being able to deploy it, depending on your needs, fully in one cloud (such as AWS), hybrid cloud (say AWS and GCP/Azure) or on-premises?
UPDATE: to clarify, of course Mesos has first-class Docker support.

Is it possible to run Hadoop MR jobs using Google's Dataflow?

Is it possible to run Hadoop MR jobs using Google's Dataflow service?
I have several Hadoop MR jobs which I'd like to be able to run on the Dataflow service. I'd like to be able to take advantage of the Dataflow service without having to completely rewrite my Hadoop jobs.
To make migration easier, I think it should be possible to define a generic Dataflow Transform which could wrap Hadoop Mappers and Reducers so the code could be reused in Dataflow Pipelines.
Here is a very minimal implementation AvroMRTransform that acts as a wrapper for AvroMapper and AvroReducer (i.e. it can only be used for inputs and outputs which are Avro data).
AvroMRTransform works but there are almost certainly cases it doesn't handle. It also doesn't support a bunch of Hadoop features such as counters.
For these reasons, I wouldn't recommend this as anything other than a temporary stop gap measure (e.g. your application contains many MR jobs and you don't want to rewrite them all at once).
The Hadoop MR API strikes me as being very large so ultimately supporting every feature using Dataflow is probably going to be more work then just rewriting your application.

Resources