I have a few questions or doubts on sparkling water and why is it needed.
Lets assume that I have a generated h2o model with both binary and pojo.
Now I want to deploy the model into production and have an option for using pojo and binary (sparkling water) both.
Which one should I use? Direct spark with pojo or sparkling water with Binary.
What is the exact use of sparkling water, when we can easily deploy a model using pojo and spark itself?
Is sparkling water needed only when you have to train model on huge amounts of data? Or it can be used in PROD deployments of model's as well.
Example: https://github.com/h2oai/h2o-droplets/blob/master/h2o-pojo-on-spark-droplet/src/main/scala/examples/PojoExample.scala
Uses spark to run a pojo model.
Example: https://github.com/h2oai/h2o-droplets/blob/master/sparkling-water-droplet/src/main/scala/water/droplets/SparklingWaterDroplet.scala
Trains / Runs a model in sparkling water.
What are the advantages which sparkling water h2o provides over normal spark?
Which one should I use? Direct spark with pojo or sparkling water with Binary.
There is no 'right' answer, it depends on your use case. It sounds like what you want is the POJO/MOJO in Spark, so you can do scoring without the added dependency of having an H2O cluster up.
What is the exact use of sparkling water, when we can easily deploy a model using pojo and spark itself?
The exact use of Sparkling Water is to have an H2O available within a Spark context. This is particularly useful for training: you can leverage Spark's many data connectors, munging capabilities etc. POJO/MOJO + Spark is sufficient for scoring
Is sparkling water needed only when you have to train model on huge amounts of data? Or it can be used in PROD deployments of model's as well.
Sparkling Water is needed when you want to leverage H2O's algorithms in a context that plays nicely w/ the Spark ecosystem.
If putting a model in "production" means having "always on" scoring exposed as a REST endpoint or similar: the POJO/MOJO is the way you want to go (H2O clusters are not highly available). You'll need to make sure you're handling incoming data correctly yourself though.
If you are doing batch scoring, nightly or otherwise, then it may make sense to use the binary model w/ Sparkling Water because parsing incoming data becomes trivial (asH2OFrame(..)) and scoring is easy as predict()
Related
Which technology is efficient to analyze the data hadoop or python? and which technology is speed between the above two?
So Hadoop mostly uses spark. If the underlying framework you are using to analyse or crunch your data contains spark, you are good to go with either Scala, PySpark or maybe R. Using alone python won't give you benefits of Spark which makes data analysis faster and also various transformations on Big Data. So whichever you use, its about using spark.
Scala or PySpark : both contains almost all of these features.
Whenever analyzing data and considering speed as a criteria, two key components determine the speed: The amount of data you have and where the data is located.
If you have Big Data, consider using Hadoop or Spark to analyze it. This will make it much faster and you will not be dependent of load time. If you have a few gigabytes of data it maybe best to use python but it still may slow down your machine.
Now to address where the data is, if you have your data on premise then python is the best approach. If your data is located in cloud server, then Azure, GCP, or even AWS have big data tools available to make this data exploration easier. All three cloud systems have big data tools available for use.
So in terms of speed, it really depends on the two constraints. If you have Big Data and your data is located in a cloud system. Consider using Hadoop to analyze your data. If you have only a few gigabytes of data and on-premise, use python to analyze your data.
I am planning to spin my development cluster for trend analysis for Infrastructure Monitoring application which I am planning to build using Spark for analysing failure trend and Cassandra for storing incoming data and analysed data.
Consider collecting performance matrix from around 25000 machines/servers (probably set of same application on different servers). I am expecting performance matrix of size 2MB/sec from each machine, which I am planning to push into Cassandra table having timestamp, server as primary key and application along with some important matrix as clustering key. I will be running Spark job on top of this stored information for performance matrix failure trend analysis.
Comming to the question, How many nodes (machines) and of what configuration in terms of CPU and Memory do I need to kick start my cluster considering above scenario.
Cassandra needs a well planned out data model for things to run well. It is very much worth spending time planning things out at this stage before you have a large data set and find out you probably would have done better re-arranging the data model!
The "general" rule of thumb is you shape your model to the queries, while paying attention to avoiding things like really large rows, large deletes, batches and such the like which can have big performance penalties.
The docs give a good start on planning and testing you would probably find useful. I would also recommend using the Cassandra stress tool. You can use it to push performance tests into your Cassandra cluster to check latencies and any performance problems. You can use your own schema too which I personally think is super-useful!
If you are using cloud based hardware like AWS then its relatively easy to scale up / down and see what works best for you. You dont need to throw big hardware at Cassandra, its easier to scale horizontally than vertically.
I'm assuming you are pulling back the data into a separate spark cluster for the analytics side too so these nodes would be running plain Cassandra (less hardware specs). If however you are using the Datastax Enterprise version (where you can run nodes in spark "mode") then you will need more beefier hardware with the additional load you need for spark driver programs, executors and such the like. Another good docs link is the DSE hardware recommendations
I have a lot of data saved into Cassandra on a daily basis and I want to compare one datapoint with last 5 versions of data for different regions.
Lets say there is a price datapoint of a product and there are 2000 products in a context/region(say US). I want to show a heat map dash board showing when the price change happened for different regions.
I am new to hadoop, hive and pig. Which path would help me achieve my goal and some details appreciated.
Thanks.
This sounds like a good use case for either traditional mapreduce or spark. You have relatively infrequent updates, so a batch job running over the data and updating a table that in turn provides the data for the heatmap seems like the right way to go. Since the updates are infrequent, you probably don't need to worry about spark streaming- just a traditional batch job run a few times a day is fine.
Here's some info from datastax on reading from cassandra in a spark job: http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSCcontext.html
For either spark or mapreduce, you are going to want to leverage the (spark or MR) framework's ability to partition the task- if you are manually connecting to cassandra and reading/writing the data like you would from a traditional RDBMS, you are probably doing something wrong. If you write your job correctly, the framework will be responsible for spinning up multiple readers (one for each node that contains the source data that you are interested in), distributing the calculation tasks, and routing the results to the appropriate machine to store them.
Some more examples are here:
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkIntro.html
and
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/byoh/byohIntro.html
Either way, MapReduce is probably a little simpler, and Spark is probably a little more future proof.
We are working on developing a platform which models flow of entities across a graph. The system has to answer questions of the kind how many entities having these properties are sitting at a given node on the graph , what is the inflow on a node, outflow on a node etc. Flow data is fed to the system in a stream. We are thinking of breaking the flow data in time buckets(say 5 mins) and pre-compute various aggregates against different properties and storing the aggregates in DynamoDB to serve queries.
With regards to this we are evaluating the following options:
EMR: Put flow data in AWS -S3/DynamoDB run a Map Reduce/hive job
Putting recent data into AWS- RDS, computing the aggregates via sql
Akka: It is a framework to build distributed applications via Actors
and Message passing.
If anyone has worked on similar usecase or has used any of the above technologies, please let me know what approach would be best fit for our use case.
I have used EMR to process data in S3... works pretty well. And the best part is you can spin up hadoop clusters of various sizes that fit the work load.
you may want to look into Storm for stream processing
I am also collecting a list of big-data tools here: http://hadoopilluminated.com/hadoop_book/Bigdata_Ecosystem.html
The final solution employed AWS Redshift, the driving reason was the requirement of high speed data ingestion, which Redshift provides via the COPY command.
Hadoop is built to store the data efficiently, however it does not gurantees a sub-second sla for ingestion, neither does it provide an SLA for when the data will be available for MR jobs, this was the main reason we did not go with EMR or Hadoop in general.
We have a lot of user interaction data from various websites stored in Cassandra such as cookies, page-visits, ads-viewed, ads-clicked, etc.. that we would like to do reporting on. Our current Cassandra schema supports basic reporting and querying. However we also would like to build large queries that would typically involve Joins on large Column Families (containing millions of rows).
What approach is best suited for this? One possibility is to extract data out to a relational database such as mySQL and do data mining there. Alternate could be to attempt at use hadoop with hive or pig to run map reduce queries for this purpose? I must admit I have zero experience with the latter.
Anyone have experience of performance differences in one one vs the other? Would you run map reduce queries on a live Cassandra production instance or on a backup copy to prevent query load from affecting write performance?
In my experience Cassandra is better suited to processes where you need real-time access to your data, fast random reads and just generally handle large traffic loads. However, if you start doing complex analytics, the availability of your Cassandra cluster will probably suffer noticeably. In general from what I've seen it's in your best interest to leave the Cassandra cluster alone, otherwise the availability starts suffering.
Sounds like you need an analytics platform, and I would definitely advise exporting your reporting data out of Cassandra to use in an offline data-warehouse system.
If you can afford it, having a real data-warehouse would allow you to do complex queries with complex joins on multiples tables. These data-warehouse systems are widely used for reporting, here is a list of what are in my opinion the key players:
Netezza
Aster/TeraData
Vertica
A recent one which is gaining a lot of momentum is Amazon Redshift, but it is currently in beta, but if you can get your hands on it you could give this a try since it looks like a solid analytics platform with a pricing much more attractive than the above solutions.
Alternatives like using Hadoop MapReduce/Hive/Pig are also interesting to look at, but probably not a replacement for Hadoop technologies. I would recommend Hive if you have a SQL background because it will be very easy to understand what you're doing and you can scale easily. There are actually already libraries integrated with Hadoop, like Apache Mahout, which allow you to do data-mining on a Hadoop cluster, you should definitely give this a try and see if it fits your needs.
To give you an idea, an approach that I've used that has been working well so far is pre-aggregating the results in Hive and then have the reports themselves generated in a data-warehouse like Netezza to compute complex joins .
Disclosure: I'm an engineer at DataStax.
In addition to Charles' suggestions, you might want to look into DataStax Enterprise (DSE), which offers a nice integration of Cassandra with Hadoop, Hive, Pig, and Mahout.
As Charles mentioned, you don't want to run your analytics directly against Cassandra nodes that are handling your real-time application needs because they can have a substantial impact on performance. To avoid this, DSE allows you to devote a portion of your cluster strictly to analytics by using multiple virtual "datacenters" (in the NetworkToplogyStrategy sense of the term). Queries performed as part of a Hadoop job will only impact those nodes, essentially leaving your normal Cassandra nodes unaffected. Additionally, you can scale each portion of the cluster up or down separately based on your performance needs.
There are a couple of upsides to the DSE approach. The first is that you don't need to perform any ETL prior to processing your data; Cassandra's normal replication mechanisms keep the nodes devoted to analytics up to date. Second, you don't need an external Hadoop cluster. DSE includes a drop-in replacement for HDFS called CFS (CassandraFS), so all source data, intermediate results, and final results from a Hadoop job can be stored in the Cassandra cluster.