Differences between H20 dataframes and Spark RDD - h2o

I am abit investigating the h2o framework to work with its extra machine learning tool. I am just curious what is the differences between an H20 dataframes and Spark RDDs. Is the h2o dataframes can be cached or persisted like Spark RDDs?

H2O frames are not lazy, contrary to Spark's data structures. There is therefore no need for explicit caching/persisting as we load the whole frame into memory anyway. This can be problematic if your dataset is bigger than cluster's memory but we do it in such a way for performance reasons. In spark you'd cache RDDs for machine learning anyway. There are 2 requirements for an H2O frame:
the total cluster memory has to be big enough to hold the whole frame
a single row of the frame has to fully fit on every machine (we do not distribute frames column-wise, only row-wise meaning we take X rows, a chunk, and put them all on a single node)
H2O frames just like RDDs are fully distributed and only parts of the frame are located on each node. Most of our algos take advantage of data locality (i.e. each node only uses rows stored on it for computations) but you can also shuffle data around so every node uses the whole frame.
When converting an RDD into an H2O frame we materialize the whole data in memory. When doing the opposite there's no memory overhead as we're just iterating over an H2O Frame.
H2O frames are less generic than RDDs but thanks to that we can highly compress the data in memory.

Related

Spark RDD - is partition(s) always in RAM?

We all know Spark does the computation in memory. I am just curious on followings.
If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDDs data will reside on Spark Workers Memory?
If I do not delete RDD, will it be in memory forever?
If my dataset(file) size exceeds available RAM size, where will data to stored?
If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDD
data will reside on Spark Memory?
Yes, All 10 RDDs data will spread in spark worker machines RAM. but not necessary to all machines must have a partition of each RDD. off course RDD will have data in memory only if any action performed on it as it's lazily evaluated.
If I do not delete RDD, will it be in memory forever?
Spark Automatically unpersist the RDD or Dataframe if they are no longer used. In order to know if an RDD or Dataframe is cached, you can get into the Spark UI -- > Storage table and see the Memory details. You can use df.unpersist() or sqlContext.uncacheTable("sparktable") to remove the df or tables from memory.
link to read more
If my dataset size exceeds available RAM size, where will data to
stored?
If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time, when they're needed.
link to read more
If we are saying RDD is already in RAM, meaning it is in memory, what is the need to persist()? --As per comment
To answer your question, when any action triggered on RDD and if that action could not find memory, it can remove uncached/unpersisted RDDs.
In general, we persist RDD which need a lot of computation or/and shuffling (by default spark persist shuffled RDDs to avoid costly network I/O), so that when any action performed on persisted RDD, simply it will perform that action only rather than computing it again from start as per lineage graph, check RDD persistence levels here.
If I create 10 RDD in my Pyspark shell, does it mean all these 10 RDD
data will reside on Spark Memory?
Answer: RDD only contains the "lineage graph" (the applied transformations). So, RDD is not data!!! When ever we perform any action on an RDD, all the transformations are applied before the action. So if not explicitly (of course there are some optimisations which cache implicitly) cached, each time an action is performed the whole transformation and action are performed again!!!
E.g - If you create an RDD from HDFS, apply some transformations and perform 2 actions on the transformed RDD, HDFS read and transformations will be executed twice!!!
So, if you want to avoid the re-computation, you have to persist the RDD. For persisting you have the choice of a combination of one or more on HEAP, Off-Heap, Disk.
If I do not delete RDD, will it be in memory for ever?
Answer: Considering RDD is just "lineage graph", it will follow the same scope and lifetime rule of the hosting language. But if you have already persisted the computed result, you could unpersist!!!
If my dataset size exceed available RAM size, where will data to stored?
Answer: Assuming you have actually persisted/cached the RDD in memory, it will be stored in memory. And LRU is used to evict data. Refer for more information on how memory management is done in spark.

Amount of data storage : HDFS vs NoSQL

In several sources on the internet, it's explained that HDFS is built to handle a greater amount of data than NoSQL technologies (Cassandra, for example). In general when we go further than 1TB we must start thinking Hadoop (HDFS) and not NoSQL.
Besides the architecture and the fact that HDFS supports batch processing and that most NoSQL technologies (e.g. Cassandra) perform random I/O, and besides the schema design differences, why can't NoSQL Solutions (again, for example Cassandra) handle as much data as HDFS?
Why can't we use a NoSQL technology as a Data Lake? Why should we only use them as hot storage solutions in a big data architecture?
why can't NoSQL Solutions (... for example Cassandra) handle as much data as HDFS?
HDFS has been designed to store massive amounts of data and support batch mode (OLAP) whereas Cassandra was designed for online transactional use-cases (OLTP).
The current recommendation for server density is 1TB/node for spinning disk and 3TB/node when using SSD.
In the Cassandra 3.x series, the storage engine has been rewritten to improve node density. Furthermore there are a few JIRA tickets to improve server density in the future.
There is a limit right now for server density in Cassandra because of:
repair. With an eventually consistent DB, repair is mandatory to re-sync data in case of failures. The more data you have on one server, the longer it takes to repair (more precisely to compute the Merkle tree, a binary tree of digests). But the issue of repair is mostly solved with incremental repair introduced in Cassandra 2.1
compaction. With an LSM tree data structure, any mutation results in a new write on disk so compaction is necessary to get rid of deprecated data or deleted data. The more data you have on 1 node, the longer is the compaction. There are also some solutions to address this issue, mainly the new DateTieredCompactionStrategy that has some tuning knobs to stop compacting data after a time threshold. There are few people using DateTiered compaction in production with density up to 10TB/node
node rebuild. Imagine one node crashes and is completely lost, you'll need to rebuild it by streaming data from other replicas. The higher the node density, the longer it takes to rebuild the node
load distribution. The more data you have on a node, the greater the load average (high disk I/O and high CPU usage). This will greatly impact the node latency for real time requests. Whereas a difference of 100ms is negligible for a batch scenario that takes 10h to complete, it is critical for a real time database/application subject to a tight SLA

Hadoop comparison to RDBMS

I really do not understand the actual reason behind hadoop scaling better than RDBMS . Can anyone please explain at a granular level ? Has this got something to do with underlying datastructures & algorithms
RDBMS have challenges in handling huge data volumes of Terabytes & Peta bytes. Even if you have Redundant Array of Independent/Inexpensive Disks (RAID) & data shredding, it does not scale well for huge volume of data. You require very expensive hardware.
EDIT:
To answer, why RDBMS cannot scale, have a look at Overheads of RBDMS.
Logging. Assembling log records and tracking down all changes
in database structures slows performance. Logging may not be
necessary if recoverability is not a requirement or if recoverability
is provided through other means (e.g., other sites on the network).
Locking. Traditional two-phase locking poses a sizeable overhead
since all accesses to database structures are governed by a
separate entity, the Lock Manager.
Latching. In a multi-threaded database, many data structures
have to be latched before they can be accessed. Removing this
feature and going to a single-threaded approach has a noticeable
performance impact.
Buffer management. A main memory database system does not
need to access pages through a buffer pool, eliminating a level of
indirection on every record access.
How Hadoop handles?:
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment, which can run on commodity hardware. It is useful for storing & retrieval of huge volumes of data.
This scalability & efficiency are possible with Hadoop implementation of storage mechanism (HDFS) & processing jobs (YARN Map reduce jobs). Apart from scalability, Hadoop provides high availability of stored data.
Scalability, High Availability, Processing of huge volumes of data (Strucutred data, Unstructured data, Semi structured data) with flexibility are key to success of Hadoop.
Data is stored on thousands of nodes & processing is done on the node where data is stored (most of the times) through Map Reduce jobs. Data Locality on processing front is one key area of success of Hadoop.
This has been achieved with Name Node, Data Node & Resource Manager.
To understand how Hadoop achieve this, you should must visit these links : HDFS Architecture , YARN Architecture and HDFS Federation
Still RDBMS is good for multiple write/read/updates and consistent ACID transactions on Giga bytes of data. But not good for processing of Tera bytes & Peta bytes of data. NoSQL with two of Consistency ,Availability Partitioning attributes of CAP theory is good in some of use cases.
But Hadoop is not meant for real time transaction support with ACID properties. It is good for Business intelligence reporting with batch processing - "Write once, multiple read" paradigm.
From slideshare.net
Have a look at one more related SE question :
NoSql vs Relational database
First, hadoop IS NOT a DB replacement.
RDBMS scale vertical and hadoop scale horizontal.
This means that to scale twice a RDBMS you need to have hardware with the double memory, double storage and double cpu. That is very expensive and has limits. There isn't a server with 10TB of ram for example. With hadoop is different, you don't need expensive edge technology, instead of that you can use several commodity servers working together to simulate a bigger server (with some limitations). You can have a cluster with 10 Tb of ram distributed in several nodes.
Other advantage is that instead to have to buy a new more powerful server and drop the old one, to scale distributed systems only require to add new nodes into the cluster.
The one issue if have with the description above is that paralleled RDBMS required expensive hardware. Teridata and Netezza need special hardware. Greenplum and Vertica can be put on commodity hardware. (Now I will admit I am biased, like everyone else.) I have seen Greenplum scan petabytes of information daily. (Walmart was up to 2.5 petabytes last I hard.) I dealt with both Hawq and Impala. They both require about 30% more hardware to do the same job on structured data. Hbase is less efficient.
There is no magic silver spoon. It has been my experience that both structured and unstructured have their place. Hadoop is great for ingesting large amounts of data and scanning through it a small amount of times. We use it as part of our load procedures. RDBMS is grate at scanning the same data over and over with highly complex queries.
You always have to structure the data to make use of it. That structuring takes time somewhere. You ether structure before you put it in to an RDBMS or at query time .
In RDBMS , data is structured , rather it is indexed.
Retrieval of data of any particular 'nth' column is loading the entire database and then selecting the 'nth' column.
where as in Hadoop, say Hive, we load the only the particular column from the entire data set.
More so over the data loading is also done by Map reduce programs which is done in a distributed structure which reduce the overall time.
Hence, two advantages of using Hadoop and its tools.

What makes Spark fast if data size exceeds available memory?

Everywhere I try to understand spark it says it is fast because it keeps data in memory as opposed to map reduce. Lets take this examples -
I have a 5 node spark cluster, with 100 GB RAM each. Lets say I have 500 TB of data to run a spark job against. Now total data that spark can keep is 100*5=500 GB. If It can keep max of 500 GB of data only in memory at any point of time, what makes it lightning fast ??
Spark isn't magical and can't change fundamental principles of computing. Spark uses memory as a progressive enhancement and will fall back to disk I/O for huge datasets that can not be kept in memory. In a scenario where tables must be scanned from disks, spark performance should be comparable to other parallel solutions involving table scanning from disk.
Suppose only 0.1% of the 500 TB is "interesting". For instance, in a marketing funnel there are a lot of ad impressions, fewer clicks, even fewer sales, and less repeat sales. A program can filter through a huge dataset and tell Spark to cache in memory a smaller, filtered and corrected dataset needed for further processing. Spark caching of a smaller filtered data set is obviously much faster than repeated disk table scans and repeated processing of the larger raw data.

Why is Spark fast when word count? [duplicate]

This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
Test case: word counting in 6G data in 20+ seconds by Spark.
I understand MapReduce, FP and stream programming models, but couldn’t figure out the word counting is so amazing fast.
I think it’s an I/O intensive computing in this case, and it’s impossible to scan 6G files in 20+ seconds. I guess there is index is performed before word counting, like Lucene does. The magic should be in RDD (Resilient Distributed Datasets) design which I don’t understand well enough.
I appreciate if anyone could explain RDD for the word counting case. Thanks!
First is startup time. Hadoop MapReduce job startup requires starting a number of separate JVMs which is not fast. Spark job startup (on existing Spark cluster) causes existing JVM to fork new task threads, which is times faster than starting JVM
Next, no indexing and no magic. 6GB file is stored in 47 blocks of 128MB each. Imagine you have a big enough Hadoop cluster that all of these 47 HDFS blocks are residing on different JBOD HDDs. Each of them would deliver you 70 MB/sec scan rate, which means you can read this data in ~2 seconds. With 10GbE network in your cluster you can transfer all of this data from one machine to another in just 7 seconds.
Lastly, Hadoop puts intermediate data to disks a number of times. It puts map output to the disk at least once (and more if the map output is big and on-disk merges happen). It puts the data to disks next time on reduce side before the reduce itself is executed. Spark puts the data to HDDs only once during the shuffle phase, and the reference Spark implementation recommends to increase the filesystem write cache not to make this 'shuffle' data hit the disks
All of this gives Spark a big performance boost compared to Hadoop. There is no magic in Spark RDDs related to this question
Other than the factors mentioned by 0x0FFF, local combining of results also makes spark run word count more efficiently. Spark, by default, combines results on each node before sending the results to other nodes.
In case of word count job, Spark calculates the count for each word on a node and then sends the results to other nodes. This reduces the amount of data to be transferred over network. To achieve the same functionality in Hadoop Map-reduce, you need to specify combiner class job.setCombinerClass(CustomCombiner.class)
By using combineByKey() in Spark, you can specify a custom combiner.
Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.
Hope this helps.

Resources