Doubts on RDD Spark - hadoop

I want to understand below things on RDD of Spark Concept.
is RDD just a concept of copying require data in some node's RAM from HDFS storage to speed up the execution?
if a file is splitted across the cluster then for a single flie, RDD brings all require data from other nodes?
if 2nd point is correct then how it decides which node's JVM it has to execute? how data locality works here?

The RDD is at the core of Apache Spark and it is a data abstraction for a distributed collection of objects. They are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure. Ref: https://databricks.com/blog/2016/06/22/apache-spark-key-terms-explained.html
If a file is split across the cluster upon loading, the calculations are done on the nodes where the RDDs reside. That is, the compute is performed where the data resides (as well as it can) to minimize the need for performing shuffles. For more information concerning Spark and Data locality, please refer to: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html.
Note, for more information about Spark Research, please refer to: http://spark.apache.org/research.html; more specifically, please refer to Zaharia et. al.'s paper: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for
In-Memory Cluster Computing (http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf).

Related

is it possible to convert from hbase to spark rdd efficiency?

I have a large dataset of items in hbase that I want to load into a spark rdd for processing. My understanding is that hbase is optimized for low-latency single item searches on hadoop, so I am wondering if it's possible to efficiently query for 100 million items in hbase (~10Tb in size)?
Here is some general advice on making Spark and HBase work together.
Data colocation and partitioning
Spark avoids shuffling : if your Spark workers and HBase regions are located on the same machines, Spark will create partitions according to regions.
A good region split in HBase will map to a good partitioning in Spark.
If possible, consider working on your rowkeys and region splits.
Operations in Spark vs operations in HBase
Rule of thumb : use HBase scans only, and do everything else with Spark.
To avoid shuffling in your Spark operations, you can consider working on your partitions. For example : you can join 2 Spark rdd from HBase scans on their Rowkey or Rowkey prefix without any shuffling.
Hbase configuration tweeks
This discussion is a bit old (some configurations are not up to date) but still interesting : http://community.cloudera.com/t5/Storage-Random-Access-HDFS/How-to-optimise-Full-Table-Scan-FTS-in-HBase/td-p/97
And the link below has also some leads:
http://blog.asquareb.com/blog/2015/01/01/configuration-parameters-that-can-influence-hbase-performance/
You might find multiple sources (including the ones above) suggesting to change the scanner cache config, but this holds only with HBase < 1.x
We had this exact question at Splice Machine. We found the following based on our tests.
HBase had performance challenges if you attempted to perform remote scans from spark/mapreduce.
The large scans hurt performance of ongoing small scans by forcing garbage collection.
There was not a clear resource management dividing line between OLTP and OLAP queries and resources.
We ended up writing a custom reader that reads the HFiles directly from HDFS and performs incremental deltas with the memstore during scans. With this, Spark could perform quick enough for most OLAP applications. We also separated the resource management so the OLAP resources were allocated via YARN (On Premise) or Mesos (Cloud) so they would not disturb normal OLTP apps.
I wish you luck on your endeavor. Splice Machine is open source and you are welcome to checkout out our code and approach.

Spark performance advantage vs. Hadoop MapReduce [duplicate]

This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
I am hearing that Spark has an advantage over hadoop due to spark's in-memory computation. However, one of the obvious problems is not all the data can fit into one computers memory. So is Spark then limited to smaller datasets. At the same time, there is the notion of spark cluster. So I am not following the purported advantages of spark over hadoop MR.
Thanks
Hadoop MapReduce has been the mainstay on Hadoop for batch jobs for a long time. However, two very promising technologies have emerged, Apache Drill, which is a low-density SQL engine for self-service data exploration and Apache Spark, which is a general-purpose compute engine that allows you to run batch, interactive and streaming jobs on the cluster using the same unified frame. Let's dig a little bit more into Spark.
To understand Spark, you have to understand really three big concepts.
First is RDDs, the resilient distributed data sets. This is really a representation of the data that's coming into your system in an object format and allows you to do computations on top of it. RDDs are resilient because they have a long lineage. Whenever there's a failure in the system, they can recompute themselves using the prior information using lineage.
The second concept is transformations. Transformations is what you do to RDDs to get other resilient RDDs. Examples of transformations would be things like opening a file and creating an RDD or doing functions like printer that would then create other resilient RDDs.
The third and the final concept is actions. These are things which will do where you're actually asking for an answer that the system needs to provide you, for instance, count or asking a question about what's the first line that has Spark in it. The interesting thing with Spark is that it does lazy elevation which means that these RDDs are not loaded and pushed into the system as in when the system encounters an RDD but they're only done when there is actually an action to be performed.
One thing that comes up with RDDs is that when we come back to them being that they are resilient and in main memory is that how do they compare with distributed shared memory architectures and most of what are familiar from our past? There are a few differences. Let's go with them in a small, brief way. First of all, writes in RDDs are core of Spark. They are happening at an RDD level. Writes in distributor-shared memory are typically fine-grained. Reads and distributor-shared memory are fine-grained as well. Writes in RDD can be fine or course-grained.
The second piece is recovery. What happens if there is a part in the system, how do we recover it? Since RDDs build this lineage graph if something goes bad, they can go back and recompute based on that graph and regenerate the RDD. Lineage is used very strongly in RDDs to recovery. In distributor-shared memories we typically go back to check-pointing done at intervals or any other semantic check-pointing mechanism. Consistency is relatively trivial in RDDs because the data underneath it is assumed to be immutable. If, however, the data was changing, then consistency would be a problem here. Distributor-shared memory doesn't make any assumptions about mutability and, therefore, leaves the consistency semantics to the application to take care of.
At last let's look at the benefits of Spark:
Spark provides full recovery using lineage.
Spark is optimized in making computations as well as placing the computations optimally using the directory cyclic graph.
Very easy programming paradigms using the transformation and actions on RDDs as well as a ready-rich library support for machine learning, graphics and recently data frames.
At this point a question comes up. If Spark is so great, does Spark actually replace Hadoop? The answer is clearly no because Spark provides an application framework for you to write your big data applications. However, it still needs to run on a storage system or on a no-SQL system.
Spark is never limited to smaller dataset and its not always about in-memorycomputation. Spark has very good number higher APIS . Spark can process the in GB as well. In my realtime experience i have used Spark to handle the streaming application where we usually gets the data in GB/Hour basic . And we have used Spark in Telecommunication to handle bigger dataset as well . Check this RDD Persistence how to accommodate bigger datasets.
In case of real world problem we can't solve them just by one MapReduce program which is having a Mapper class and a reducer class, We mostly need to build a pipeline. A pipeline will consists of multiple stages each having MapReduce program , and out put of one stage will be fed to one or multiple times to the subsequent stages. And this is a pain because of the amount of IO it involves.
In case of MapReduce there are these Map and Reduce tasks subsequent to which there is a synchronization barrier and one needs to preserve the data to the disc. This feature of MapReduce framework was developed with the intent that in case of failure the jobs can be recovered but the drawback to this is that, it does not leverage the memory of the Hadoop cluster to the maximum. And this becomes worse when you have a iterative algorithm in your pipeline. Every iteration will cause significant amount of Disk IO.
So in order to solve the problem , Spark introduced a new Data Structure called RDD . A DS that can hold the information like how the data can be read from the disk and what to compute. Spark also provided easy programming paradigm to create pipeline(DAG) by transforming RDDs . And what you get it a series of RDD which knows how to get the data and what to compute.
Finally when an Action is invoked Spark framework internally optimize the pipeline , group together the portion that can be executed together(map phases), and create a final optimized execution plan from the logical pipeline. And then executes it. It also provides user the flexibility to select the data user wanted to be cached. Hence spark is able to achieve near about 10 to 100 times faster batch processing than MapReduce.
Spark advantages over hadoop.
As spark tasks across stages can be executed on same executor nodes, the time to spawn the Executor is saved for multiple task.
Even if you have huge memory, MapReduce can never make any advantage of caching data in memory and using the in memory data for subsequent steps.
Spark on other hand can cache data if huge JVM is available to it. Across stages the inmemory data is used.
In Spark task run as threads on same executor, making the task memory footprint light.
In MapReduce the Map of reduce Task are processes and not threads.
Spark uses efficient serialization format to store data on disk.
Follow this for detail understanding http://bytepadding.com/big-data/spark/understanding-spark-through-map-reduce/

Hadoop comparison to RDBMS

I really do not understand the actual reason behind hadoop scaling better than RDBMS . Can anyone please explain at a granular level ? Has this got something to do with underlying datastructures & algorithms
RDBMS have challenges in handling huge data volumes of Terabytes & Peta bytes. Even if you have Redundant Array of Independent/Inexpensive Disks (RAID) & data shredding, it does not scale well for huge volume of data. You require very expensive hardware.
EDIT:
To answer, why RDBMS cannot scale, have a look at Overheads of RBDMS.
Logging. Assembling log records and tracking down all changes
in database structures slows performance. Logging may not be
necessary if recoverability is not a requirement or if recoverability
is provided through other means (e.g., other sites on the network).
Locking. Traditional two-phase locking poses a sizeable overhead
since all accesses to database structures are governed by a
separate entity, the Lock Manager.
Latching. In a multi-threaded database, many data structures
have to be latched before they can be accessed. Removing this
feature and going to a single-threaded approach has a noticeable
performance impact.
Buffer management. A main memory database system does not
need to access pages through a buffer pool, eliminating a level of
indirection on every record access.
How Hadoop handles?:
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment, which can run on commodity hardware. It is useful for storing & retrieval of huge volumes of data.
This scalability & efficiency are possible with Hadoop implementation of storage mechanism (HDFS) & processing jobs (YARN Map reduce jobs). Apart from scalability, Hadoop provides high availability of stored data.
Scalability, High Availability, Processing of huge volumes of data (Strucutred data, Unstructured data, Semi structured data) with flexibility are key to success of Hadoop.
Data is stored on thousands of nodes & processing is done on the node where data is stored (most of the times) through Map Reduce jobs. Data Locality on processing front is one key area of success of Hadoop.
This has been achieved with Name Node, Data Node & Resource Manager.
To understand how Hadoop achieve this, you should must visit these links : HDFS Architecture , YARN Architecture and HDFS Federation
Still RDBMS is good for multiple write/read/updates and consistent ACID transactions on Giga bytes of data. But not good for processing of Tera bytes & Peta bytes of data. NoSQL with two of Consistency ,Availability Partitioning attributes of CAP theory is good in some of use cases.
But Hadoop is not meant for real time transaction support with ACID properties. It is good for Business intelligence reporting with batch processing - "Write once, multiple read" paradigm.
From slideshare.net
Have a look at one more related SE question :
NoSql vs Relational database
First, hadoop IS NOT a DB replacement.
RDBMS scale vertical and hadoop scale horizontal.
This means that to scale twice a RDBMS you need to have hardware with the double memory, double storage and double cpu. That is very expensive and has limits. There isn't a server with 10TB of ram for example. With hadoop is different, you don't need expensive edge technology, instead of that you can use several commodity servers working together to simulate a bigger server (with some limitations). You can have a cluster with 10 Tb of ram distributed in several nodes.
Other advantage is that instead to have to buy a new more powerful server and drop the old one, to scale distributed systems only require to add new nodes into the cluster.
The one issue if have with the description above is that paralleled RDBMS required expensive hardware. Teridata and Netezza need special hardware. Greenplum and Vertica can be put on commodity hardware. (Now I will admit I am biased, like everyone else.) I have seen Greenplum scan petabytes of information daily. (Walmart was up to 2.5 petabytes last I hard.) I dealt with both Hawq and Impala. They both require about 30% more hardware to do the same job on structured data. Hbase is less efficient.
There is no magic silver spoon. It has been my experience that both structured and unstructured have their place. Hadoop is great for ingesting large amounts of data and scanning through it a small amount of times. We use it as part of our load procedures. RDBMS is grate at scanning the same data over and over with highly complex queries.
You always have to structure the data to make use of it. That structuring takes time somewhere. You ether structure before you put it in to an RDBMS or at query time .
In RDBMS , data is structured , rather it is indexed.
Retrieval of data of any particular 'nth' column is loading the entire database and then selecting the 'nth' column.
where as in Hadoop, say Hive, we load the only the particular column from the entire data set.
More so over the data loading is also done by Map reduce programs which is done in a distributed structure which reduce the overall time.
Hence, two advantages of using Hadoop and its tools.

spark + hadoop data locality

I got an RDD of filenames, so an RDD[String]. I get that by parallelizing a list of filenames (of files inside hdfs).
Now I map this rdd and my code opens a hadoop stream using FileSystem.open(path). Then I process it.
When I run my task, I use spark UI/Stages and I see the "Locality Level" = "PROCESS_LOCAL" for all the tasks. I don't think spark could possibly achieve data locality the way I run the task (on a cluster of 4 data nodes), how is that possible?
When FileSystem.open(path) gets executed in Spark tasks, File
content will be loaded to local variable in same JVM process and prepares
the RDD ( partition(s) ). so the data locality for that RDD is always
PROCESS_LOCAL
-- vanekjar has
already commented the on question
Additional information about data locality in Spark:
There are several levels of locality based on the data’s current location. In order from closest to farthest:
PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible
NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes
NO_PREF data is accessed equally quickly from anywhere and has no locality preference
RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch
ANY data is elsewhere on the network and not in the same rack
Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels.
Data locality is one of the spark's functionality which increases its processing speed.Data locality section can be seen here in spark tuning guide to Data Locality.At start when you write sc.textFile("path") at this point the data locality level will be according to the path you specified but after that spark tries to make locality level to process_local in order to optimize speed of processing by starting process at the place where data is present(locally).

Why is Spark fast when word count? [duplicate]

This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
Test case: word counting in 6G data in 20+ seconds by Spark.
I understand MapReduce, FP and stream programming models, but couldn’t figure out the word counting is so amazing fast.
I think it’s an I/O intensive computing in this case, and it’s impossible to scan 6G files in 20+ seconds. I guess there is index is performed before word counting, like Lucene does. The magic should be in RDD (Resilient Distributed Datasets) design which I don’t understand well enough.
I appreciate if anyone could explain RDD for the word counting case. Thanks!
First is startup time. Hadoop MapReduce job startup requires starting a number of separate JVMs which is not fast. Spark job startup (on existing Spark cluster) causes existing JVM to fork new task threads, which is times faster than starting JVM
Next, no indexing and no magic. 6GB file is stored in 47 blocks of 128MB each. Imagine you have a big enough Hadoop cluster that all of these 47 HDFS blocks are residing on different JBOD HDDs. Each of them would deliver you 70 MB/sec scan rate, which means you can read this data in ~2 seconds. With 10GbE network in your cluster you can transfer all of this data from one machine to another in just 7 seconds.
Lastly, Hadoop puts intermediate data to disks a number of times. It puts map output to the disk at least once (and more if the map output is big and on-disk merges happen). It puts the data to disks next time on reduce side before the reduce itself is executed. Spark puts the data to HDDs only once during the shuffle phase, and the reference Spark implementation recommends to increase the filesystem write cache not to make this 'shuffle' data hit the disks
All of this gives Spark a big performance boost compared to Hadoop. There is no magic in Spark RDDs related to this question
Other than the factors mentioned by 0x0FFF, local combining of results also makes spark run word count more efficiently. Spark, by default, combines results on each node before sending the results to other nodes.
In case of word count job, Spark calculates the count for each word on a node and then sends the results to other nodes. This reduces the amount of data to be transferred over network. To achieve the same functionality in Hadoop Map-reduce, you need to specify combiner class job.setCombinerClass(CustomCombiner.class)
By using combineByKey() in Spark, you can specify a custom combiner.
Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.
Hope this helps.

Resources